1、The Speex Codec Manual(version 1.0.4)Jean-Marc Valin14th July 200412Copyright (c) 2002-2004 Jean-Marc Valin.Permission is granted to copy, distribute and/or modify this document under theterms of the GNU Free Documentation License, Version 1.1 or any later version pub-lished by the Free Software Fou
2、ndation; with no Invariant Section, with no Front-CoverTexts, and with no Back-Cover. A copy of the license is included in the section entitled“GNU Free Documentation License“.CONTENTS 3Contents1 Introduction to Speex 62 Feature description 73 Command-line encoder/decoder 93.1 speexenc . . . . . . .
3、 . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 speexdec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 Programming with Speex (the libspeex API) 114.1 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 Decoding . . . . . . . . . . . . . .
4、 . . . . . . . . . . . . . . . . . . 124.3 Codec Options (speex_*_ctl) . . . . . . . . . . . . . . . . . . . . . . 134.4 Mode queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.5 Packing and in-band signalling . . . . . . . . . . . . . . . . . . . . . 155 Formats and standards 1
5、65.1 RTP Payload Format . . . . . . . . . . . . . . . . . . . . . . . . . . 165.2 MIME Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3 Ogg file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 Introduction to CELP Coding 176.1 Linear Prediction (LPC) . .
6、. . . . . . . . . . . . . . . . . . . . . . 176.2 Pitch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Innovation Codebook . . . . . . . . . . . . . . . . . . . . . . . . . . 196.4 Analysis-by-Synthesis and Error Weighting . . . . . . . . . . . . . . 197 Speex narrowband
7、mode 217.1 LPC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217.2 Pitch Prediction (adaptive codebook) . . . . . . . . . . . . . . . . . . 217.3 Innovation Codebook . . . . . . . . . . . . . . . . . . . . . . . . . . 227.4 Bit allocation . . . . . . . . . . . . . . . . . . .
8、. . . . . . . . . . . 227.5 Perceptual enhancement . . . . . . . . . . . . . . . . . . . . . . . . 238 Speex wideband mode (sub-band CELP) 248.1 Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.2 Pitch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9、248.3 Excitation Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 248.4 Bit allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24A FAQ 26B Sample code 30B.1 sampleenc.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30B.2 sampledec.c . . . . . . . . .
10、 . . . . . . . . . . . . . . . . . . . . . . 31CONTENTS 4C IETF RTP Profile 34D Speex License 50E GNU Free Documentation License 51LIST OF TABLES 5List of Tables1 In-band signalling codes . . . . . . . . . . . . . . . . . . . . . . . . 152 Ogg/Speex header packet . . . . . . . . . . . . . . . . . .
11、. . . . . . 173 Bit allocation for narrowband modes . . . . . . . . . . . . . . . . . . 224 Quality versus bit-rate . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Bit allocation for high-band in wideband mode . . . . . . . . . . . . 251 INTRODUCTION TO SPEEX 61 Introduction to SpeexThe Spee
12、x project (http:/www.speex.org/) has been started because there was aneed for a speech codec that was open-source and free from software patents. Theseare essential conditions for being used by any open-source software. There is alreadyVorbis that does general audio, but it is not really suitable fo
13、r speech. Also, unlikemany other speech codecs, Speex is not targeted at cell phones (not many open-sourcecell phones anyway :-) ) but rather at voice over IP (VoIP) and file-based compression.As design goals, we wanted to have a codec that would allow both very good qualityspeech and low bit-rate (
14、unfortunately not at the same time!), which led us to develop-ing a codec with multiple bit-rates. Of course very good quality also meant we had todo wideband (16 kHz sampling rate) in addition to narrowband (telephone quality, 8kHz sampling rate).Designing for VoIP instead of cell phone use means t
15、hat Speex must be robust tolost packets, but not to corrupted ones since packets either arrive unaltered or dont ar-rive at all. Also, the idea was to have a reasonable complexity and memory requirementwithout compromising too much on the efficiency of the codec.All this led us to the choice of CELP
16、 as the encoding technique to use for Speex.One of the main reasons is that CELP has long proved that it could do the job andscale well to both low bit-rates (think DoD CELP 4.8 kbps) and high bit-rates (thinkG.728 16 kbps).The main characteristics can be summarized as follows:a15 Free software/open
17、-source, patent and royalty-freea15 Integration of narrowband and wideband in the same bit-streama15 Wide range of bit-rates available (from 2 kbps to 44 kbps)a15 Dynamic bit-rate switching and Variable Bit-Rate (VBR)a15 Voice Activity Detection (VAD, integrated with VBR)a15 Variable complexitya15 U
18、ltra-wideband mode at 32 kHz (up to 48 kHz)a15 Intensity stereo encoding optionThis document is divided in the following way. Section 2 describes the different Speexfeatures and defines some terms that will be used in later sections. Section 3 providesinformation about the standard command-line tool
19、s, while 4 contains information aboutprogramming using the Speex API. Section 5 has some information related to Speexand standards. The three last sections describe the internals of the codec and requiresome signal processing knowledge. Section 6 explains the general idea behind CELP,while sections
20、7 and 8 are specific to Speex. Note that if you are only interested inusing Speex, those three last sections are not required.2 FEATURE DESCRIPTION 72 Feature descriptionThis section explains the main Speex features, as well as some concepts in speechcoding that help better understand the next secti
21、ons.Sampling rateSpeex is mainly designed for 3 different sampling rates: 8 kHz, 16 kHz, and 32 kHz.These are respectively refered to as narrowband, wideband and ultra-wideband.QualitySpeex encoding is controlled most of the time by a quality parameter that ranges from0 to 10. In constant bit-rate (
22、CBR) operation, the quality parameter is an integer, whilefor variable bit-rate (VBR), the parameter is a float.Complexity (variable)With Speex, it is possible to vary the complexity allowed for the encoder. This is doneby controlling how the search is performed with an integer ranging from 1 to 10
23、ina way thats similar to the -1 to -9 options to gzip and bzip2 compression utilities.For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than atcomplexity 10, but the CPU requirements for complexity 10 is about 5 times higherthan for complexity 1. In practice, the best trad
24、e-off is between complexity 2 and 4,though higher settings are often useful when encoding non-speech sounds like DTMFtones.Variable Bit-Rate (VBR)Variable bit-rate (VBR) allows a codec to change its bit-rate dynamically to adapt tothe “difficulty” of the audio being encoded. In the example of Speex,
25、 sounds likevowels and high-energy transients require a higher bit-rate to achieve good quality,while fricatives (e.g. s,f sounds) can be coded adequately with less bits. For thisreason, VBR can achive lower bit-rate for the same quality, or a better quality for acertain bit-rate. Despite its advant
26、ages, VBR has two main drawbacks: first, by onlyspecifying quality, theres no guaranty about the final average bit-rate. Second, forsome real-time applications like voice over IP (VoIP), what counts is the maximumbit-rate, which must be low enough for the communication channel.Average Bit-Rate (ABR)
27、Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBRquality in order to meet a specific target bit-rate. Because the quality/bit-rate is adjustedin real-time (open-loop), the global quality will be slightly lower than that obtained byencoding in VBR with exactly the righ
28、t quality setting to meet the target average bit-rate.2 FEATURE DESCRIPTION 8Voice Activity Detection (VAD)When enabled, voice activity detection detects whether the audio being encoded isspeech or silence/background noise. VAD is always implicitly activated when encodingin VBR, so the option is onl
29、y useful in non-VBR operation. In this case, Speex detectsnon-speech periods and encode them with just enough bits to reproduce the backgroundnoise. This is called “comfort noise generation” (CNG).Discontinuous Transmission (DTX)Discontinuous transmission is an addition to VAD/VBR operation, that al
30、lows to stoptransmitting completely when the background noise is stationary. In file-based opera-tion, since we cannot just stop writing to the file, only 5 bits are used for such frames(corresponding to 250 bps).Perceptual enhancementPerceptual enhancement is a part of the decoder which, when turne
31、d on, tries to reduce(the perception of) the noise produced by the coding/decoding process. In most cases,perceptual enhancement make the sound further from the original objectively (if youuse SNR), but in the end it still sounds better (subjective improvement).Algorithmic delayEvery speech codec in
32、troduces a delay in the transmission. For Speex, this delay isequal to the frame size, plus some amount of “look-ahead” required to process eachframe. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16kHz), the delay is 34 ms. These values dont account for the CPU time it ta
33、kes toencode or decode the frames.3 COMMAND-LINE ENCODER/DECODER 93 Command-line encoder/decoderThe base Speex distribution includes a command-line encoder (speexenc) and decoder(speexdec). This section describes how to use these tools.3.1 speexencThe speexenc utility is used to create Speex files f
34、rom raw PCM or wave files. It canbe used by calling:speexenc options input_file output_fileThe value - for input_file or output_file corresponds respectively to stdin and stdout.The valid options are:narrowband (-n) Tell Speex to treat the input as narrowband (8 kHz). This is thedefaultwideband (-w)
35、 Tell Speex to treat the input as wideband (16 kHz)ultra-wideband (-u) Tell Speex to treat the input as “ultra-wideband” (32 kHz)quality n Set the encoding quality (0-10), default is 8bitrate n Encoding bit-rate (use bit-rate n or lower)vbr Enable VBR (Variable Bit-Rate), disabled by defaultabr n En
36、able ABR (Average Bit-Rate) at n kbps, disabled by defaultvad Enable VAD (Voice Activity Detection), disabled by defaultdtx Enable DTX (Discontinuous Transmission), disabled by defaultnframes n Pack n frames in each Ogg packet (this saves space at low bit-rates)comp n Set encoding speed/quality trad
37、eoff. The higher the value of n, the slowerthe encoding (default is 3)-V Verbose operation, print bit-rate currently in usehelp (-h) Print the helpversion (-v) Print version informationSpeex commentscomment Add the given string as an extra comment. This may be used multipletimes.author Author of thi
38、s track.title Title for this track.3 COMMAND-LINE ENCODER/DECODER 10Raw input optionsrate n Sampling rate for raw inputstereo Consider raw input as stereole Raw input is little-endianbe Raw input is big-endian8bit Raw input is 8-bit unsigned16bit Raw input is 16-bit signed3.2 speexdecThe speexdec ut
39、ility is used to decode Speex files and can be used by calling:speexdec options speex_file output_fileThe value - for input_file or output_file corresponds respectively to stdin and stdout.Also, when no output_file is specified, the file is played to the soundcard. The validoptions are:enh enable po
40、st-filter (default)no-enh disable post-filterforce-nb Force decoding in narrowbandforce-wb Force decoding in widebandforce-uwb Force decoding in ultra-widebandmono Force decoding in monostereo Force decoding in stereorate n Force decoding at n Hz sampling ratepacket-loss n Simulate n % random packet
41、 loss-V Verbose operation, print bit-rate currently in usehelp (-h) Print the helpversion (-v) Print version information4 PROGRAMMING WITH SPEEX (THE LIBSPEEX API) 114 Programming with Speex (the libspeex API)This section explains how to use the Speex API. Examples of code can also be foundin append
42、ix B.4.1 EncodingIn order to encode speech using Speex, you first need to:#include You then need to declare a Speex bit-packing structSpeexBits bits;and a Speex encoder statevoid *enc_state;The two are initialized by:speex_bits_init(enc_state = speex_encoder_init(For wideband coding, speex_nb_mode w
43、ill be replaced by speex_wb_mode. In mostcases, you will need to know the frame size used by the mode you are using. You canget that value in the frame_size variable with:speex_encoder_ctl(enc_state,SPEEX_GET_FRAME_SIZE,Once the initialization is done, for every input frame:speex_bits_reset(speex_en
44、code(enc_state, input_frame, nbBytes = speex_bits_write(where input_frame is a (float *) pointing to the beginning of a speech frame, byte_ptris a (char *) where the encoded frame will be written, MAX_NB_BYTES is the maxi-mum number of bytes that can be written to byte_ptr without causing an overflo
45、w andnbBytes is the number of bytes actually written to byte_ptr (the encoded size in bytes).Before calling speex_bits_write, it is possible to find the number of bytes that need tobe written by calling speex_bits_nbytes(speex_encoder_destroy(enc_state);Thats about it for the encoder.4 PROGRAMMING W
46、ITH SPEEX (THE LIBSPEEX API) 124.2 DecodingIn order to encode speech using Speex, you first need to:#include You also need to declare a Speex bit-packing structSpeexBits bits;and a Speex encoder statevoid *dec_state;The two are initialized by:speex_bits_init(dec_state = speex_decoder_init(For wideba
47、nd decoding, speex_nb_mode will be replaced by speex_wb_mode. If youneed to obtain the size of the frames that will be used by the decoder, you can get thatvalue in the frame_size variable with:speex_decoder_ctl(dec_state, SPEEX_GET_FRAME_SIZE, There is also a parameter that can be set for the decod
48、er: whether or not to use aperceptual post-filter. This can be set by:speex_decoder_ctl(dec_state, SPEEX_SET_ENH, where enh is an int that with value 0 to have the post-filter disabled and 1 to have itenabled.Again, once the decoder initialization is done, for every input frame:speex_bits_read_from(
49、speex_decode(st, where input_bytes is a (char *) containing the bit-stream data received for a frame,nbBytes is the size (in bytes) of that bit-stream, and output_frame is a (float *) andpoints to the area where the decoded speech frame will be written. A NULL value asthe first argument indicates that we dont have the bits for the current frame. When aframe is lost, the Speex decoder will do its best to “guess“ the correct signal.After youre done with the decoding, free all resources with:speex_bits_destroy(speex_decoder_destroy(dec_state)