Multiple mode variable rate speech coding

Information

  • Patent Grant
  • 6691084
  • Patent Number
    6,691,084
  • Date Filed
    Monday, December 21, 1998
    25 years ago
  • Date Issued
    Tuesday, February 10, 2004
    20 years ago
Abstract
A method and apparatus for the variable rate coding of a speech signal. An input speech signal is classified and an appropriate coding mode is selected based on this classification. For each classification, the coding mode that achieves the lowest bit rate with an acceptable quality of speech reproduction is selected. Low average bit rates are achieved by only employing high fidelity modes (i.e., high bit rate, broadly applicable to different types of speech) during portions of the speech where this fidelity is required for acceptable output. Lower bit rate modes are used during portions of speech where these modes produce acceptable output. Input speech signal is classified into active and inactive regions. Active regions are further classified into voiced, unvoiced, and transient regions. Various coding modes are applied to active speech, depending upon the required level of fidelity. Coding modes may be utilized according to the strengths and weaknesses of each particular mode. The apparatus dynamically switches between these modes as the properties of the speech signal vary with time. And where appropriate, regions of speech are modeled as pseudo-random noise, resulting in a significantly lower bit rate. This coding is used in a dynamic fashion whenever unvoiced speech or background noise is detected.
Description




BACKGROUND OF THE INVENTION




I. Field of the Invention




The present invention relates to the coding of speech signals. Specifically, the present invention relates to classifying speech signals and employing one of a plurality of coding modes based on the classification.




II. Description of the Related Art




Many communication systems today transmit voice as a digital signal, particularly long distance and digital radio telephone applications. The performance of these systems depends, in part, on accurately representing the voice signal with a minimum number of bits. Transmitting speech simply by sampling and digitizing requires a data rate on the order of 64 kilobits per second (kbps) to achieve the speech quality of a conventional analog telephone. However, coding techniques are available that significantly reduce the data rate required for satisfactory speech reproduction.




The term “vocoder” typically refers to devices that compress voiced speech by extracting parameters based on a model of human speech generation. Vocoders include an encoder and a decoder. The encoder analyzes the incoming speech and extracts the relevant parameters. The decoder synthesizes the speech using the parameters that it receives from the encoder via a transmission channel. The speech signal is often divided into frames of data and block processed by the vocoder.




Vocoders built around linear-prediction-based time domain coding schemes far exceed in number all other types of coders. These techniques extract correlated elements from the speech signal and encode only the uncorrelated elements. The basic linear predictive filter predicts the current sample as a linear combination of past samples. An example of a coding algorithm of this particular class is described in the paper “A 4.8 kbps Code Excited Linear Predictive Coder,” by Thomas E. Tremain et al., Proceedings of the Mobile Satellite Conference, 1988.




These coding schemes compress the digitized speech signal into a low bit rate signal by removing all of the natural redundancies (i e., correlated elements) inherent in speech. Speech typically exhibits short term redundancies resulting from the mechanical action of the lips and tongue, and long term redundancies resulting from the vibration of the vocal cords. Linear predictive schemes model these operations as filters, remove the redundancies, and then model the resulting residual signal as white gaussian noise. Linear predictive coders therefore achieve a reduced bit rate by transmitting filter coefficients and quantized noise rather than a full bandwidth speech signal.




However, even these reduced bit rates often exceed the available bandwidth where the speech signal must either propagate a long distance (e.g. ground to satellite) or coexist with many other signals in a crowded channel. A need therefore exists for an improved coding scheme which achieves a lower bit rate than linear predictive schemes.




SUMMARY OF THE INVENTION




The present invention is a novel and improved method and apparatus for the variable rate coding of a speech signal. The present invention classifies the input speech signal and selects an appropriate coding mode based on this classification. For each classification, the present invention selects the coding mode that achieves the lowest bit rate with an acceptable quality of speech reproduction. The present invention achieves low average bit rates by only employing high fidelity modes (i.e., high bit rate, broadly applicable to different types of speech) during portions of the speech where this fidelity is required for acceptable output. The present invention switches to lower bit rate modes during portions of speech where these modes produce acceptable output.




An advantage of the present invention is that speech is coded at a low bit rate. Low bit rates translate into higher capacity, greater range, and lower power requirements.




A feature of the present invention is that the input speech signal is classified into active and inactive regions. Active regions are further classified into voiced, unvoiced, and transient regions. The present invention therefore can apply various coding modes to different types of active speech, depending upon the required level of fidelity.




Another feature of the present invention is that coding modes may be utilized according to the strengths and weaknesses of each particular mode. The present invention dynamically switches between these modes as properties of the speech signal vary with time.




A further feature of the present invention is that, where appropriate, regions of speech are modeled as pseudo-random noise, resulting in a significantly lower bit rate. The present invention uses this coding in a dynamic fashion whenever unvoiced speech or background noise is detected.




The features, objects, and advantages of the present invention will become more apparent from the detailed description set forth below when taken in conjunction with the drawings in which like reference numbers indicate identical or functionally similar elements. Additionally, the left-most digit of a reference number identifies the drawing in which the reference number first appears.











BRIEF DESCRIPTION OF THE DRAWINGS





FIG. 1

is a diagram illustrating a signal transmission environment;





FIG. 2

is a diagram illustrating encoder


102


and decoder


104


in greater detail;





FIG. 3

is a flowchart illustrating variable rate speech coding according to the present invention;





FIG. 4A

is a diagram illustrating a frame of voiced speech split into subframes;





FIG. 4B

is a diagram illustrating a frame of unvoiced speech split into subframes;





FIG. 4C

is a diagram illustrating a frame of transient speech split into subframes;





FIG. 5

is a flowchart that describes the calculation of initial parameters;





FIG. 6

is a flowchart describing the classification of speech as either active or inactive;





FIG. 7A

depicts a CELP encoder;





FIG. 7B

depicts a CELP decoder;





FIG. 8

depicts a pitch filter module;





FIG. 9A

depicts a PPP encoder;





FIG. 9B

depicts a PPP decoder;





FIG. 10

is a flowchart depicting the steps of PPP coding, including encoding and decoding;





FIG. 11

is a flowchart describing the extraction of a prototype residual period;





FIG. 12

depicts a prototype residual period extracted from the current frame of a residual signal, and the prototype residual period from the previous frame;





FIG. 13

is a flowchart depicting the calculation of rotational parameters;





FIG. 14

is a flowchart depicting the operation of the encoding codebook;





FIG. 15A

depicts a first filter update module embodiment;





FIG. 15B

depicts a first period interpolator module embodiment;





FIG. 16A

depicts a second filter update module embodiment;





FIG. 16B

depicts a second period interpolator module embodiment;





FIG. 17

is a flowchart describing the operation of the first filter update module embodiment;





FIG. 18

is a flowchart describing the operation of the second filter update module embodiment;





FIG. 19

is a flowchart describing the aligning and interpolating of prototype residual periods;





FIG. 20

is a flowchart describing the reconstruction of a speech signal based on prototype residual periods according to a first embodiment;





FIG. 21

is a flowchart describing the reconstruction of a speech signal based on prototype residual periods according to a second embodiment;





FIG. 22A

depicts a NELP encoder;





FIG. 22B

depicts a NELP decoder; and





FIG. 23

is a flowchart describing NELP coding.











DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS




I. Overview of the Environment




II. Overview of the Invention




III. Initial Parameter Determination




A. Calculation of LPC Coefficients




B. LSI Calculation




C. NACF Calculation




D. Pitch Track and Lag Calculation




E. Calculation of Band Energy and Zero Crossing Rate




F. Calculation of the Formant Residual




IV. Active/Inactive Speech Classification




A. Hangover Frames




V. Classification of Active Speech Frames




VI. Encoder/Decoder Mode Selection




VII. Code Excited Linear Prediction (CELP) Coding Mode




A. Pitch Encoding Module




B. Encoding codebook




C. CELP Decoder




D. Filter Update Module




VIII. Prototype Pitch Period (PPP) Coding Mode




A. Extraction Module




B. Rotational Correlator




C. Encoding Codebook




D. Filter Update Module




E. PPP Decoder




F. Period Interpolator




IX. Noise Excited Linear Prediction (NELP) Coding Mode




X. Conclusion




I. Overview of the Environment




The present invention is directed toward novel and improved methods and apparatuses for variable rate speech coding.

FIG. 1

depicts a signal transmission environment


100


including an encoder


102


, adecoder


104


, and a transmission medium


106


. Encoder


102


encodes a speech signal s(n), forming encoded speech signal s


enc


(n), for transmission across transmission medium


106


to decoder


104


. Decoder


104


decodes s


enc


(n), thereby generating synthesized speech signal ŝ(n).




The term “coding” as used herein refers generally to methods encompassing both encoding and decoding. Generally, coding methods and apparatuses seek to minimize the number of bits transmitted via transmission medium


106


(i.e., minimize the bandwidth of S


enc


(n)) while maintaining acceptable speech reproduction (i.e., ŝ(n)≈s(n)). The composition of the encoded speech signal will vary according to the particular speech coding method. Various encoders


102


, decoders


104


, and the coding methods according to which they operate are described below.




The components of encoder


102


and decoder


104


described below may be implemented as electronic hardware, as computer software, or combinations of both. These components are described below in terms of their functionality. Whether the functionality is implemented as hardware or software will depend upon the particular application and design constraints imposed on the overall system. Skilled artisans will recognize the interchangeability of hardware and software under these circumstances, and how best to implement the described functionality for each particular application.




Those skilled in the art will recognize that transmission medium


106


can represent many different transmission media, including, but not limited to, a land-based communication line, a link between a base station and a satellite, wireless communication between a cellular telephone and a base station, or between a cellular telephone and a satellite.




Those skilled in the art will also recognize that often each party to a communication transmits as well as receives. Each party would therefore require an encoder


102


and a decoder


104


. However, signal tranmission environment


100


will be described below as including encoder


102


at one end of transmission medium


106


and decoder


104


at the other. Skilled artisans will readily recognize how to extend these ideas to two-way communication.




For purposes of this description, assume that s(n) is a digital speech signal obtained during a typical conversation including different vocal sounds and periods of silence. The speech signal s(n) is preferably partitioned into frames, and each frame is further partitioned into subframes (preferably 4). These arbitrarily chosen frame/subframe boundaries are commonly used where some block processing is performed, as is the case here. Operations described as being performed on frames might also be performed on subframes—in this sense, frame and subframe are used interchangeably herein. However, s(n) need not be partitioned into frames/subframes at all if continuous processing rather than block processing is implemented. Skilled artisans will readily recognize how the block techniques described below might be extended to continuous processing.




In a preferred embodiment, s(n) is digitally sampled at 8 kHz. Each frame preferably contains 20 ms of data, or 160 samples at the preferred 8 kHz rate. Each subframe therefore contains 40 samples of data. It is important to note that many of the equations presented below assume these values. However, those skilled in the art will recognize that while these parameters are appropriate for speech coding, they are merely exemplary and other suitable alternative parameters could be used.




II. Overview of the Invention




The methods and apparatuses of the present invention involve coding the speech signal s(n).

FIG. 2

depicts encoder


102


and decoder


104


in greater detail. According to the present invention, encoder


102


includes an initial parameter calculation module


202


, a classification module


208


, and one or more encoder modes


204


. Decoder


104


includes one or more decoder modes


206


. The number of decoder modes, N


d


, in general equals the number of encoder modes, N


e


. As would be apparent to one skilled in the art, encoder mode


1


communicates with decoder mode


1


, and so on. As shown, the encoded speech signal, S


enc


(n), is transmitted via transmission medium


106


.




In a preferred embodiment, encoder


102


dynamically switches between multiple encoder modes from frame to frame, depending on which mode is most appropriate given the properties of s(n) for the current frame. Decoder


104


also dynamically switches between the corresponding decoder modes from frame to frame. A particular mode is chosen for each frame to achieve the lowest bit rate available while maintaining acceptable signal reproduction at the decoder. This process is referred to as variable rate speech coding, because the bit rate of the coder changes over time (as properties of the signal change).





FIG. 3

is a flowchart


300


that describes variable rate speech coding according to the present invention. In step


302


, initial parameter calculation module


202


calculates various parameters based on the current frame of data. In a preferred embodiment, these parameters include one or more of the following: linear predictive coding (LPC) filter coefficients, line spectruminformation (LSI) coefficients, the normalized autocorrelation functions (NACFs), the open loop lag, band energies, the zero crossing rate, and the formant residual signal.




In step


304


, classification module


208


classifies the current frame as containing either “active” or “inactive” speech. As described above, s(n) is assumed to include both periods of speech and periods of silence, common to an ordinary conversation. Active speech includes spoken words, whereas inactive speech includes everything else, e.g., background noise, silence, pauses. The methods used to classify speech as active/inactive according to the present invention are described in detail below.




As shown in

FIG. 3

, step


306


considers whether the current frame was classified as active or inactive in step


304


. If active, control flow proceeds to step


308


. If inactive, control flow proceeds to step


310


.




Those frames which are classified as active are further classified in step


308


as either voiced, unvoiced, or transient frames. Those skilled in the art will recognize that human speech can be classified in many different ways. Two conventional classifications of speech are voiced and unvoiced sounds. According to the present invention, all speech which is not voiced or unvoiced is classified as transient speech.





FIG. 4A

depicts an example portion of s(n) including voiced speech


402


. Voiced sounds are produced by forcing air through the glottis with the tension of the vocal cords adjusted so that they vibrate in a relaxed oscillation, thereby producing quasi-periodic pulses of air which excite the vocal tract. One common property measured in voiced speech is the pitch period, as shown in FIG.


4


A.





FIG. 4B

depicts an example portion of s(n) including unvoiced speech


404


. Unvoiced sounds are generated by forming a constriction at some point in the vocal tract (usually toward the mouth end), and forcing air through the constriction at a high enough velocity to produce turbulence. The resulting unvoiced speech signal resembles colored noise.





FIG. 4C

depicts an example portion of s(n) including transient speech


406


(i.e., speech which is neither voiced nor unvoiced). The example transient speech


406


shown in

FIG. 4C

might represent s(n) transitioning between unvoiced speech and voiced speech. Skilled artisans will recognize that many different classifications of speech could be employed according to the techniques described herein to achieve comparable results.




In step


310


, an encoder/decoder mode is selected based on the frame classification made in steps


306


and


308


. The various encoder/decoder modes are connected in parallel, as shown in FIG.


2


. One or more of these modes can be operational at any given time. However, as described in detail below, only one mode preferably operates at any given time, and is selected according to the classification of the current frame.




Several encoder/decoder modes are described in the following sections. The different encoder/decoder modes operate according to different coding schemes. Certain modes are more effective at coding portions of the speech signal s(n) exhibiting certain properties.




In a preferred embodiment, a “Code Excited Linear Predictive” (CELP) mode is chosen to code frames classified as transient speech. The CELP mode excites a linear predictive vocal tract model with a quantized version of the linear prediction residual signal. Of all the encoder/decoder modes described herein, CELP generally produces the most accurate speech reproduction but requires the highestbit rate. In one embodiment, the CELP mode performs encoding at 8500 bits per second.




A “Prototype Pitch Period” (PPP) mode is preferably chosen to code frames classified as voiced speech. Voiced speech contains slowly time varying periodic components which are exploited by the PPP mode. The PPP mode codes only a subset of the pitch periods within each frame. The remaining periods of the speech signal are reconstructed by interpolating between these prototype periods. By exploiting the periodicity of voiced speech, PPP is able to achieve a lower bit rate than CELP and still reproduce the speech signal in a perceptually accurate manner. In one embodiment, the PPP mode performs encoding at 3900 bits per second.




A “Noise Excited Linear Predictive” (NELP) mode is chosen to code frames classified as unvoiced speech. NELP uses a filtered pseudo-random noise signal to model unvoiced speech. NELP uses the simplest model for the coded speech, and therefore achieves the lowest bit rate. In one embodiment, the NELP mode performs encoding at 1500 bits per second.




The same coding technique can frequently be operated at different bit rates, with varying levels of performance. The different encoder/decoder modes in

FIG. 2

can therefore represent different coding techniques, or the same coding technique operating at different bit rates, or combinations of the above. Skilled artisans will recognize that increasing the number of encoder/decoder modes will allow greater flexibility when choosing a mode, which can result in a lower average bit rate, but will increase complexity within the overall system. The particular combination used in any given system will be dictated by the available system resources and the specific signal environment.




In step


312


, the selected encoder mode


204


encodes the current frame and preferably packs the encoded data into data packets for transmission. And in step


314


, the corresponding decoder mode


206


unpacks the data packets, decodes the received data and reconstructs the speech signal. These operations are described in detail below with respect to the appropriate encoder/decoder modes.




III. Initial Parameter Determination





FIG. 5

is a flowchart describing step


302


in greater detail. Various initial parameters are calculated according to the present invention. The parameters preferably include, e.g., LPC coefficients, line spectrum information (LSI) coefficients, normalized autocorrelation functions (NACFs), open loop lag, band energies, zero crossing rate, and the formant residual signal. These parameters are used in various ways within the overall system, as described below.




In a preferred embodiment, initial parameter calculation module


202


uses a “look ahead” of 160+40 samples. This serves several purposes. First, the 160 sample look ahead allows a pitch frequency track to be computed using information in the next frame, which significantly improves the robustness of the voice coding and the pitch period estimation techniques, described below. Second, the 160 sample look ahead also allows the LPC coefficients, the frame energy, and the voice activity to be computed for one frame in the future. This allows for efficient, multi-frame quantization of the frame energy and LPC coefficients. Third, the additional 40 sample look ahead is for calculation of the LPC coefficients on Hamming windowed speech as described below. Thus the number of samples buffered before processing the current frame is 160+160+40 which includes the current frame and the 160+40 sample look ahead.




A. Calculation of LPC Coefficients




The present invention utilizes an LPC prediction error filter to remove the short term redundancies in the speech signal. The transfer function for the LPC filter is:







A


(
z
)


=

1
-




i
=
1

10




a
i



z

-
i















The present invention preferably implements a tenth-order filter, as shown in the previous equation. An LPC synthesis filter in the decoder reinserts the redundancies, and is given by the inverse of A(z):







1

A


(
z
)



=

1

1
-




i
=
1

10




a
i



z

-
i
















In step


502


, the LPC coefficients, α


i


, are computed from s(n) as follows. The LPC parameters are preferably computed for the next frame during the encoding procedure for the current frame.




A Hamming window is applied to the current frame centered between the 119


th


and 120


th


samples (assuming the preferred 160 sample frame with a “look ahead”). The windowed speech signal, s


w


(n) is given by:









s
w



(
n
)


=


s


(

n
+
40

)




(

0.5
+

0.46
*
cos


(

π



n
-
79.5

80


)



)



,

0

n
<
160











The offset of 40 samples results in the window of speech being centered between the 119


th


and 120


th


sample of the preferred 160 sample frame of speech.




Eleven autocorrelation values are preferably computed as








R


(
k
)


=




m
=
0


159
-
k






s
w



(
m
)





s
w



(

m
+
k

)





,

0

k

10











The autocorrelation values are windowed to reduce the probability of missing roots of line spectral pairs (LSPs) obtained from the LPC coefficients, as given by:








R


(


k


)=


h


(


k


)


R


(


k


), 0≦


k≦


10






resulting in a slight bandwidth expansion, e.g., 25 Hz. The values h(k) are preferably taken from the center of a 255 point Hamming window.




The LPC coefficients are then obtained from the windowed autocorrelation values using Durbin's recursion. Durbin's recursion, a well known efficient computational method, is discussed in the text Digital Processing of Speech Signals by Rabiner & Schafer.




B. LSI Calculation




In step


504


, the LPC coefficients are transformed into line spectrum information (LSI) coefficients for quantization and interpolation. The LSI coefficients are computed according to the present invention in the following manner.




As before, A(z) is given by








A


(


z


)=1−α


1




z




−1


− . . . −α


10




z




−10


,






where α


1


are the LPC coefficients, and 1≦i≦10.




P


A


(z) and Q


A


(z) are defined as the following








P




A


(


z


)=


A


(


z


)+


z




−11




A


(


z




−1


)=


p




0




+p




1




z




−1




+ . . . +p




11




z




−11


,










Q




A


(


z


)=


A


(


z


)−


z




−11




A


(


z




−1


)=


q




0




+q




1




z




−1




+ . . . +q




11




z




−11


,






where








p




i


=−α


i


−α


11−l


, 1≦


i≦


10










q




i


=−α


i





11−l


, 1≦


i≦


10






and








p




0


=1


p




11


=1










q




0


=1


q




11


=−1






The line spectral cosines (LSCs) are the ten roots in −1.0<x<1.0 of the following two functions:








P′


(


x


)=


p′




o


cos(5 cos


−1


(


x


))+


p′




1


(4 cos


−1


(


x


))+ . . . +


p′




4




+p′




5


/2










Q′


(


x


)=


q′




o


cos(5 cos


−1


(


x


))+


q′




1


(4 cos


−1


(


x


))+ . . . +


q′




4




x+q′




5


/2






where




p′


o


=1




q′


o


=1




p′


l


=p


i


−p′


i−1


1≦i≦5




q′


l


=q


l


+q′


i−1


1≦i≦5




The LSI coefficients are then calculated as:







lsi
i

=

{




0.5



1
-

lsc
i








lsc
i


0






1.0
-

0.5



1
+

lsc
i









lsc
i

<
0















The LSCs can be obtained back from the LSI coefficients according to:







lsc
i

=

{




1.0
-

4


lsi
i
2







lsi
i


0.5







(

4
-

4


lsi
i
2



)

-
1.0





lsi
i

>
0.5















The stability of the LPC filter guarantees that the roots of the two functions alternate, i. e., the smallest root, lsc


1


, is the smallest root of P′(x), the next smallest root, lsc


2


, is the smallest root of Q′(x), etc. Thus, lsc


1


, lsc


3


, lsc


5


, lsc


7


, and lsc


9


are the roots of P′(x), and ls


2


, lsc


4


, lsc


6


, lsc


8


, and lsc


10


are the roots of Q′(x).




Those skilled in the art will recognize that it is preferable to employ some method for computing the sensitivity of the LSI coefficients to quantization. “Sensitivity weightings” can be used in the quantization process to appropriately weight the quantization error in each LSI.




The LSI coefficients are quantized using a multistage vector quantizer (VQ). The number of stages preferably depends on the particular bit rate and codebooks employed. The codebooks are chosen based on whether or not the current frame is voiced.




The vector quantization minimizes a weighted-mean-squared error (WMSE) which is defined as







E


(


x


,

y



)


=




i
=
0


P
-
1






w
i



(


x
i

-

y
i


)


2












where {right arrow over (x)} is the vector to be quantized, {right arrow over (w)} the weight associated with it, and {right arrow over (y)} is the codevector. In a preferred embodiment, {right arrow over (w)} are sensitivity weightings and P=10.




The LSI vector is reconstructed from the LSI codes obtained by way of quantization







q


l







si

=




i
=
1

N



CB



i



code
i














where CBi is the i


th


stage VQ codebook for either voiced or unvoiced frames (this is based on the code indicating the choice of the codebook) and code


i


is the LSI code for the i


th


stage.




Before the LSI coefficients are transformed to LPC coefficients, a stability check is performed to ensure that the resulting LPC filters have not been made unstable due to quantization noise or channel errors injecting noise into the LSI coefficients. Stability is guaranteed if the LSI coefficients remain ordered.




In calculating the original LPC coefficients, a speech window centered between the 119


th


and 120


th


sample of the frame was used. The LPC coefficients for other points in the frame are approximated by interpolating between the previous frame's LSCs and the current frame's LSCs. The resulting interpolated LSCs are then converted back into LPC coefficients. The exact interpolation used for each subframe is given by:







ilsc




j


=(1−α


i


)


lscprev




j





i




lsccurr




j


, 1≦


j≦


10




where α


i


are the interpolation factors 0.375, 0.625, 0.875, 1.000 for the four subframes of 40 samples each and ilsc are the interpolated LSCs. {circumflex over (P)}


A


(z) and {circumflex over (Q)}


A


(z) are computed by the interpolated LSCs as









P
^

A



(
z
)


=



(

1
+

z

-
1



)






j
=
1

5


1


-

2


ilsc


2

j

-
1




z

-
1



+

z

-
2










Q
^

A



(
z
)


=



(

1
+

z

-
1



)






j
=
1

5


1


-

2


ilsc

2

j1




z

-
1



+

z

-
2













The interpolated LPC coefficients for all four subframes are computed as coefficients of








A
^



(
z
)


=





P
^

A



(
z
)


+



Q
^

A



(
z
)



2





Thus
,



a
^

i

=

{




-




p
^

i

+


q
^

i


2





1

i

5






-




p
^


11
-
i


-


q
^


11
-
i



2





6

i

10
















C. NACF Calculation




In step


506


, the normalized autocorrelation functions (NACFs) are calculated according to the current invention.




The formant residual for the next frame is computed over four 40 sample subframes as







r


(
n
)


=


s


(
n
)


-




i
=
1

10





a
~

i



s


(

n
-
i

)















where ã


l


is the i


th


interpolated LPC coefficient of the corresponding subframe, where the interpolation is done between the current frame's unquantized LSCs and the next frame's LSCs. The next frame's energy is also computed as







E
N

=

0.5



log
2

(





i
=
0

159




r
2



(
n
)



160

)












The residual calculated above is low pass filtered and decimated, preferably using a zero phase FIR filter of length


15


, the coefficients of which df


i


, −7≦i≦7, are {0.0800, 0.1256, 0.2532, 0.4376, 0.6424, 0.8268, 0.9544, 1.000, 0.9544, 0.8268, 0.6424, 0.4376, 0.2532, 0.1256, 0.0800}. The low pass filtered, decimated residual is computed as












r
d



(
n
)


=




i
=

-
7


7




df
i



r


(

Fn
+
i

)





,




0

n
<

160
/
F














where F=2 is the decimation factor, and r(Fn+i), −7≦Fn+i≦6 are obtained from the last 14 values of the current frame's residual based on unquantized LPC coefficients. As mentioned above, these LPC coefficients are computed and stored during the previous frame.




The NACFs for two subframes (40 samples decimated) of the next frame are calculated as follows:











Exx
k

=




i
=
0

39





r
d



(


40

k

+
i

)





r
d



(


40

k

+
i

)





,





k
=
0

,
1













Exy

k
,
j


=




i
=
0

39





r
d



(


40

k

+
i

)





r
d



(


40

k

+
i
-
j

)





,






12
/
2


j
<

128
/
2


,





k
=
0

,
1








Eyy

k
,
j


=




i
=
0

39





r
d



(


40

k

+
i
-
j

)





r
d



(


40

k

+
i
-
j

)





,






12
/
2


j
<

128
/
2


,





k
=
0

,
1








n_corr

k
,

j
-

12
/
2




=



(

Exy

k
,
j


)

2


ExxEyy

k
,
j




,






12
/
2


j
<

128
/
2


,





k
=
0

,
1













For r


d


(n) with negative n, the current frame's low-pass filtered and decimated residual (stored during the previous frame) is used. The NACFs for the current subframe c_corr were also computed and stored during the previous frame.




D. Pitch Track and Lag Calculation




In step


508


, the pitch track and pitch lag are computed according to the present invention. The pitch lag is preferably calculated using a Viterbi-like search with a backward track as follows.











R1
i

=


n_corr

0
,
i


+

max


{

n_corr

1
,

j
+

FAN

i
,
0





}




,





0

i
<

116
/
2


,




0

j
<

FAN

i
,
1











R2
i

=


c_corr

l
,
i


+

max


{

R1

j
+

FAN

i
,
o








)

,





0

i
<

116
/
2


,




0

j
<

FAN

i
,
1











RM

2

i


=


R2
i

+

max


{

c_corr

0
,

j
+

FAN

i
,
0









)

,





0

i
<

116
/
2


,




0

j
<

FAN

i
,
1















where FAN


ij


is the 2×58 matrix, {{


0


,


2


}, {


0


,


3


}, {


2


,


2


}, {


2


,


3


}, {


2


,


4


}, {


3


,


4


}, {


4


,


4


}, {


5


,


4


}, {


5


,


5


}, {


6


,


5


}, {


7


,


5


}, {


8


,


6


}, {


9


,


6


}, {


10


,


6


}, {


11


,


6


}, {


11


,


7


}, {


12


,


7


}, {


13


,


7


}, {


14


,


8


}, {


15


,


8


}, {


16


,


8


}, {


16


,


9


}, {


17


,


9


}, {


18


,


9


}, {


19


,


9


}, {


20


,


10


}, {


21


,


10


}, {


22


,


10


}, {


22


,


11


}, {


23


,


11


}, {


24


,


11


}, {


25


,


12


}, {


26


,


12


}, {


27


,


12


}, {


28


,


12


}, {


28


,


13


}, {


29


,


13


}, {


30


,


13


}, {


31


,


14


}, {


32


,


14


}, {


33


,


14


}, {


33


,


15


}, {


34


,


15


}, {


35


,


15


}, {


36


,


15


}, {


37


,


16


}, {


38


,


16


}, {


39


,


16


}, {


39


,


17


}, {


40


,


17


}, {


41


,


16


}, {


42


,


16


}, {


43


,


15


}, {


44


,


14


}, {


45


,


13


}, {


45


,


13


}, {


46


,


12


}, {


47


,


11


}}. The vector RM


2i


is interpolated to get values for R


2i+1


as











RM

iF
+
1


=




j
=
0

4




cf
j



RM


(

i
-
1
+
j

)


F





,




1

i
<

112
/
2








RM
1

=


(


RM
0

+

RM
2


)

/
2













RM


2
*
56

+
1


=


(


RM

2
*
56


+

RM

2
*
57



)

/
2













RM


2
*
57

+
1


=

RM

2
*
57




















where cf


j


is the interpolation filter whose coefficients are {−0.0625, 0.5625, 0.5625, −0.0625}. The lag L


C


is then chosen such that R


L






c−12




=max{R


i


}, 4≦i≦116 and the current frame's NACF is set equal to R


L






C−12




/4. Lag multiples are then removed by searching for the lag corresponding to the maximum correlation greater than 0.9 R


L






C−12




amidst:








R




max{└L






C






/M┘−


14, 16}


. . . R




└L






C/M┘−10




for all 1≦


M≦└L




C


/16┘.






E. Calculation of Band Energy and Zero Crossing Rate




In step


510


, energies in the 0-2 kHz band and 2 kHz-4 kHz band are computed according to the present invention as







E
L

=




i
=
0

159




s
L
2



(
n
)








E
H

=




i
=
0

159




s
H
2



(
n
)







where
,







S
L



(
z
)


=


S


(
z
)






bl
0

+




i
=
1

15




bl
i



z

-
i







al
0

+




i
=
1

15




al
i



z

-
i














S
H



(
z
)


=



S


(
z
)




bh
0


+





i
=
1

15




bh
i



z

-
i






ah
0

+




i
=
1

15




ah
i



z

-
i

















S(z), S


L


(z) and S


H


(z) being the z-transforms of the input speech signal s(n), low-pass signal s


L


(n) and high-pass signal s


H


(n), respectively, bl={0.0003, 0.0048, 0.0333, 0.1443, 0.4329, 0.9524, 1.5873, 2.0409, 2.0409, 1.5873, 0.9524, 0.4329, 0.1443, 0.0333, 0.0048, 0.0003}, al={1.0, 0.9155, 2.4074, 1.6511, 2.0597, 1.0584, 0.7976, 0.3020, 0.1465, 0.0394, 0.0122, 0.0021, 0.0004, 0.0, 0.0, 0.0}, bh={0.0013, −0.0189, 0.1324, −0.5737, 1.7212, −3.7867, 6.3112, −8.1144, 8.1144, −6.3112, 3.7867, −1.7212, 0.5737, −0.1324, 0.0189, −0.0013}and ah={1.0, −2.8818, 5.7550, −7.7730, 8.2419, −6.8372, 4.6171, −2.5257, 1.1296, −0.4084, 0.1183, −0.0268, 0.0046, −0.0006, 0.0, 0.0}.




The speech signal energy itself is






E
=




i
=
0

159





s
2



(
n
)


.












The zero crossing rate ZCR is computed as








if


(


s


(


n


)


s


(


n+


1)<0)


ZCR=ZCR+


1, 0≦


n<


159






F. Calculation of the Formant Residual




In step


512


, the formant residual for the current frame is computed over four subframes as








r
curr



(
n
)


=


s


(
n
)


-




i
=
1

10





a
^

i



s


(

n
-
i

)















where â


i


is the i


th


LPC coefficient of the corresponding subframe.




IV. Active/Inactive Speech Classification




Referring back to

FIG. 3

, in step


304


, the current frame is classified as either active speech (e.g., spoken words) or inactive speech (e.g., background noise, silence).

FIG. 6

is a flowchart


600


that depicts step


304


in greater detail. In a preferred embodiment, a two energy band based thresholding scheme is used to determine if active speech is present. The lower band (band


0


) spans frequencies from 0.1-2.0 kHz and the upper band (band


1


) from 2.0-4.0 kHz. Voice activity detection is preferably determined for the next frame during the encoding procedure for the current frame, in the following manner.




In step


602


, the band energies Eb[i] for bands i=0, 1 are computed. The autocorrelation sequence, as described above in Section III.A., is extended to


19


using the following recursive equation:











R


(
k
)


=




i
=
1

10




a
i



R


(

k
-
i

)





,




11

k

19













Using this equation, R(


11


) is computed from R(


1


) to R(


10


), R(


12


) is computed from R(


2


) to R(


11


), and so on. The band energies are then computed from the extended autocorrelation sequence using the following equation:












E
b



(
i
)


=


log
2



(



R


(
0
)





R
h



(
0
)




(
0
)


+

2





k
=
1

19




R


(
k
)





R
h



(
i
)




(
k
)





)



,





i
=
0

,
1













where R(k) is the extended autocorrelation sequence for the current frame and R


h


(i)(k) is the band filter autocorrelation sequence for band i given in Table 1.












TABLE 1











Filter Autocorrelation Sequences for Band Energy Calculations













k




R


h


(0)(k) band 0




R


h


(1(k) band 1
















0




4.230889E-01




 4.042770E-O1






1




2.693014E-01




−2.503076E-01






2




−1.124000E-02 




−3.059308E-02






3




−1.301279E-01 




 1.497124E-01






4




−5.949044E-02 




−7.905954E-02






5




1.494007E-02




 4.371288E-03






6




−2.087666E-03 




−2.088545E-02






7




−3.823536E-02 




 5.622753E-02






8




−2.748034E-02 




−4.420598E-02






9




3.015699E-04




 1.443167E-02






10




3.722060E-03




−8.462525E-03






11




−6.416949E-03 




 1.627144E-02






12




−6.551736E-03 




−1.476080E-02






13




5.493820E-04




 6.187041E-03






14




2.934550E-03




−1.898632E-03






15




8.041829E-04




 2.053577E-03






16




−2.857628E-04 




−1.860064E-03






17




2.585250E-04




 7.729618E-04






18




4.816371E-04




−2.297862E-04






19




1.692738E-04




 2.107964E-04














In step


604


, the band energy estimates are smoothed. The smoothed band energy estimates, E


sm


(i), are updated for each frame using the following equation.








E




sm


(


i


)=0.6


E




sm


(


i


)+0.4


E




b


(


i


),


i=


0, 1






In step


606


, signal energy and noise energy estimates are updated. The signal energy estimates, E


s


(i), are preferably updated using the following equation:








E




s


(


i


)=max(


E




sm


(


i


),


E




s


(


i


)),


i=


0, 1






The noise energy estimates, E


n


(i), are preferably updated using the following equation:








E




n


(


i


)=min(


E




sm


(


i


),


E




n


(


i


)),


i=


0, 1






In step


608


, the long term signal-to-noise ratios for the two bands, SNR(i), are computed as








SNR


(


i


)=


E




s


(


i


)−


E




n


(


i


),


i=


0, 1






In step


610


, these SNR values are preferably divided into eight regions Reg


SNR


(i) defined as








Reg
SNR



(
i
)


=

{








0











0.6


SNR


(
i
)



-
4

<
0












round


(


0.6


SNR


(
i
)



-
4

)














0.6


SNR


(
i
)



-
4

<
7











7










0.6


SNR


(
i
)




7
















In step


612


, the voice activity decision is made in the following manner according to the current invention. If either E


b


(


0


)−E


n


(


0


)>THRESH(Reg


SNR


(


0


)), or E


b


(


1


)−E


n


(


1


)>THRESH(Reg


SNR


(


1


)), then the frame of speech is declared active. Otherwise, the frame of speech is declared inactive. The values of THRESH are defined in Table 2.












TABLE 2











Threshold Factors as A function of the SNR Region














SNR Region




THRESH











0




2.807







1




2.807







2




3.000







3




3.104







4




3.154







5




3.233







6




3.459







7




3.982















The signal energy estimates, E


s


(i), are preferably updated using the following equation:








E




s


(


i


)=


E




s


(


i


)−0.014499,


i=


0, 1.






The noise energy estimates, E


n


(i), are preferably updated using the following equation:








E
n



(
i
)


=

{








4












E
n



(
i
)


+
0.0066

<
4











23










23
<



E
n



(
i
)


+
0.0066


,





i
=
0

,
1














E
n



(
i
)


+
0.0066









otherwise















A. Hangover Frames




When signal-to-noise ratios are low, “hangover” frames are preferably added to improve the quality of the reconstructed speech. If the three previous frames were classified as active, and current frame is classified inactive, then the next M frames including the current frame are classified as active speech. The number of hangover frames, M, is preferably determined as a function of SNR(


0


) as defined in Table 3.












TABLE 3











Hangover Frames as a Function of SNR(0)














SNR(0)




M











0




4







1




3







2




3







3




3







4




3







5




3







6




3







7




3















V. Classification of Active Speech Frames




Referring back to

FIG. 3

, in step


308


, current frames which were classified as being active in step


304


are further classified according to properties exhibited by the speech signal s(n). In a preferred embodiment, active speech is classified as either voiced, unvoiced, or transient. The degreed of periodicity exhibited by the active speech signal determines how it is classified. Voiced speech exhibits the highest degree of periodicity (quasi-periodic in nature). Unvoiced speech exhibits little or no periodicity. Transient speech exhibits degrees of periodicity between voiced and unvoiced.




However, the general framework described herein is not limited to the preferred classification scheme and the specific encoder/decoder modes described below. Active speech can be classified in alternative ways, and alternative encoder/decoder modes are available for coding. Those skilled in the art will recognize that many combinations of classifications and encoder/decoder modes are possible. Many such combinations can result in a reduced average bit rate according to the general framework described herein, i.e., classifying speech as inactive or active, further classifying active speech, and then coding the speech signal using encoder/decoder modes particularly suited to the speech falling within each classification.




Although the active speech classifications are based on degree of periodicity, the classification decision is preferably not based on some direct measurement of periodicity. Rather, the classification decision is based on various parameters calculated in step


302


, e.g., signal to noise ratios in the upper and lower bands and the NACFs. The preferred classification may be described by the following pseudo-code:




if not(previousN ACF<0.5 and currentN ACF>0.6)




if (currentN ACF<0.75 and ZCR>60) UNVOICED




else if (previousN ACF<0.5 and currentN ACF<0.55 and ZCR>50) UNVOICED




else if (currentN ACF<0.4 and ZCR>40) UNVOICED




if (UNVOICED and currentSNR>28 dB and E


L


>αE


H


) TRANSIENT




if (previousN ACF<0.5 and currentN ACF<0.5 and E <5e4+N) UNVOICED




if (VOICED and low-bandSNR>high-bandSNR and previousN ACF<0.8 and 0.6<currentN ACF<0.75) TRANSIENT




where






α
=

{




1.0
,




E
>


5

e5

+

N
noise








20.0
,




E



5

e5

+

N
noise

















and N


noise


is an estimate of the background noise. E


prev


is the previous frame's input energy.




The method described by this pseudo code can be refined according to the specific environment in which it is implemented. Those skilled in the art will recognize that the various thresholds given above are merely exemplary, and could require adjustment in practice depending upon the implementation. The method may also be refined by adding additional classification categories, such as dividing TRANSIENT into two categories: one for signals transitioning from high to low energy, and the other for signals transitioning from low to high energy.




Those skilled in the art will recognize that other methods are available for distinguishing voiced, unvoiced, and transient active speech. Similarly, skilled artisans will recognize that other classification schemes for active speech are also possible.




VI. Encoder/Decoder Mode Selection




In step


310


, an encoder/decoder mode is selected based on the classification of the current frame in steps


304


and


308


. According to a preferred embodiment, modes are selected as follows: inactive frames and active unvoiced frames are coded using a NELP mode, active voiced frames are coded using a PPP mode, and active transient frames are coded using a CELP mode. Each of these encoder/decoder modes is described in detail in following sections.




In an alternative embodiment, inactive frames are coded using a zero rate mode Skilled artisans will recognize that many alternative zero rate modes are available which require very low bit rates. The selection of a zero rate mode may be further refined by considering past mode selections. For example, if the previous frame was classified as active, this may preclude the selection of a zero rate mode for the current frame. Similarly, if the next frame is active, a zero rate mode may be precluded for the current frame. Another alternative is to preclude the selection of a zero rate mode for too many consecutive frames (e.g., 9 consecutive frames). Those skilled in the art will recognize that many other modifications might be made to the basic mode selection decision in order to refine its operation in certain environments.




As described above, many other combinations of classifications and encoder/decoder modes might be alternatively used within this same framework. The following sections provide detailed descriptions of several encoder/decoder modes according to the present invention. The CELP mode is described first, followed by the PPP mode and the NELP mode.




VII. Code Excited Linear Prediction (CELP) Coding Mode




As described above, the CELP encoder/decoder mode is employed when the current frame is classified as active transient speech. The CELP mode provides the most accurate signal reproduction (as compared to the other modes described herein) but at the highest bit rate.





FIG. 7

depicts a CELP encoder mode


204


and a CELP decoder mode


206


in farther detail. As shown in

FIG. 7A

, CELP encoder mode


204


includes a pitch encoding module


702


, an encoding codebook


704


, and a filter update module


706


. CELP encoder mode


204


outputs an encoded speech signal, s


enc


(n), which preferably includes codebook parameters and pitch filter parameters, for transmission to CELP decoder mode


206


. As shown in

FIG. 7B

, CELP decoder mode


206


includes a decoding codebook module


708


, a pitch filter


710


, and an LPC synthesis filter


712


. CELP decoder mode


206


receives the encoded speech signal and outputs synthesized speech signal ŝ(n).




A. Pitch Encoding Module




Pitch encoding module


702


receives the speech signal s(n) and the quantized residual from the previous frame, p


c


(n) (described below). Based on this input, pitch encoding module


702


generates a target signal x(n) and a set of pitch filter parameters. In a preferred embodiment, these pitch filter parameters include an optimal pitch lag L* and an optimal pitch gain b*. These parameters are selected according to an “analysis-by-synthesis” method in which the encoding process selects the pitch filter parameters that minimize the weighted error between the input speech and the synthesized speech using those parameters.





FIG. 8

depicts pitch encoding module


702


in greater detail. Pitch encoding module


702


includes a perceptual weighting filter


802


, adders


804


and


816


, weighted LPC synthesis filters


806


and


808


, a delay and gain


810


, and a minimize sum of squares


812


.




Perceptual weighting filter


802


is used to weight the error between the original speech and the synthesized speech in a perceptually meaningful way. The perceptual weighting filter is of the form







W


(
z
)


=


A


(
z
)



A


(

z
/
γ

)













where A(z) is the LPC prediction error filter, and y preferably equals 0.8. Weighted LPC analysis filter


806


receives the LPC coefficients calculated by initial parameter calculation module


202


. Filter


806


outputs a


zir


(n), which is the zero input response given the LPC coefficients. Adder


804


sums a negative input a


zir


(n) and the filtered input signal to form target signal x(n).




Delay and gain


810


outputs an estimated pitch filter output bp


L


(n) for a given pitch lag L and pitch gain b. Delay and gain


810


receives the quantized residual samples from the previous frame, p


c


(n), and an estimate of future output of the pitch filter, given by p


o


(n), and forms p(n) according to:







p


(
n
)


=

{





p
c



(
n
)






-
128

<
n
<
0







p
o



(
n
)





0

n
<

L
p
















which is then delayed by L samples and scaled by b to form bp


L


(n). Lp is the subframe length (preferably 40 samples). In a preferred embodiment, the pitch lag, L, is represented by 8 bits and can take on values 20.0, 20.5, 21.0, 21.5, . . . 126.0, 126.5, 127.0, 127.5.




Weighted LPC analysis filter


808


filters bp


L


(n) using the current LPC coefficients resulting in by


L


(n). Adder


816


sums a negative input by


L


(n) with x(n), the output of which is received by minimize sum of squares


812


. Minimize sum of squares


812


selects the optimal L, denoted by L* and the optimal b, denoted by b*, as those values of L and b that minimize E


pitch


(L) according to:








E
pitch



(
L
)


=




n
=
0



L
p

-
1





{


x


(
n
)


-


by
L



(
n
)



}

2







If







E
xy



(
L
)





Δ
_

_










n
=
0



L
p

-
1





x


(
n
)





y
L



(
n
)







and







E
yy



(
L
)





Δ
_

_










n
=
0



L
p

-
1






y
L



(
n
)


2





,










then the value of b which minimizes E


pitch


(L) for a given value of L is







b
*

=



E
xy



(
L
)




E
yy



(
L
)







for





which






E
pitch



(
L
)


=

K
-




E
xy



(
L
)


2



E
yy



(
L
)














where K is a constant that can be neglected.




The optimal values of L and b (L* and b*) are found by first determining the value of L which minimizes E


pitch


(L) and then computing b*.




These pitch filter parameters are preferably calculated for each subframe and then quantized for efficient transmission. In a preferred embodiment, the transmission codes PLAGj and PGAINj for the j


th


subframe are computed as






PGAINj
=





min


{


b
*

,
2

}



8
2


+
0.5



-
1






PLAG
j

=

{




0
,




PGAINj
=

-
1








2


L
*


,




0

PGAINj
<
8















PGAIN


J


is then adjusted to −1 if PLAG


J


is set to 0. These transmission codes are transmitted to CELP decoder mode


206


as the pitch filter parameters, part of the encoded speech signal s


enc


(n).




B. Encoding Codebook




Encoding codebook


704


receives the target signal x(n) and determines a set of codebook excitation parameters which are used by CELP decoder mode


206


, along with the pitch filter parameters, to reconstruct the quantized residual signal.




Encoding codebook


704


first updates x(n) as follows.








x


(


n


)=


x


(


n


)−


y




pzir


(


n


), 0≦


n<


40






where y


pzir


(n) is the output of the weighted LPC synthesis filter (with memories retained from the end of the previous subframe) to an input which is the zero-input-response of the pitch filter with parameters {circumflex over (L)}* and {circumflex over (b)}*(and memories resulting from the previous subframe's processing).




A backfiltered target {right arrow over (d)}≈{d


n


}, 0≦n<40 is created as {right arrow over (d)}≈H


T


{right arrow over (x)} where






H
=

[




h
0



0


0





0





h
1




h
0



0





0






















h
39




h
38




h
37







h
0




]











is the impulse response matrix formed from the impulse response {h


n


} and {circumflex over (x)}≈{x(n)},0≦n<40. Two more vectors {circumflex over (φ)}={φ


n


} and {right arrow over (s)} are created as well.








{right arrow over (S)}≈


sign(


{right arrow over (d)}


)










φ
n

=

{







2





i
=
0


39
-
n





h
i



h

i
+
n





,




0
<
n
<
40










i
=
0

39



h
i
2


,




n
=
0









where






sign


(
x
)



=

{




1
,




x

0







-
1

,




x
<
0

















Encoding codebook


704


initializes the values Exy* and Eyy* to zero and searches for the optimum excitation parameters, preferably with four values of N (


0


,


1


,


2


,


3


), according to:







p


=


(

N
+

{

0
,
1
,
2
,
3
,
4

}


)


%5





A
=

{


p
0

,


p
0

+
5

,





,


i


<
40


}





B
=

{


p
1

,


p
1

+
5

,





,


k


<
40


}







Den

i
,
k


=


2


φ
0


+


s
i



s
k



φ

|

k
-
i

|





,





i


A





k


B






{


I
0

,

I
1


}

=



arg





max





i

A






i

B







{


|

d
i

|

+

|

d
k

|




Den

i
,
k



}







{


S
0

,

S
1


}

=

{


s

I
0


,

s

I
1



}





Exy0
=


|

d

I
0


|

+

|

d

I
1


|




Eyy0



=

Eyy


I
0

,

I
1








A
=

{


p
2

,


p
2

+
5

,





,


i


<
40


}





B
=

{


p
3

,


p
3

+
5

,





,


k


<
40


}









Den

i
,
k


=





Eyy0
+

2


φ
0


+


s
i



(



S
0



φ

|


I
0

-
i

|



+


S
1



φ

|


I
1

-
i

|




)


+














s
k



(



S
0



φ

|


I
0

-
k

|



+


S
1



φ

|


I
1

-
k

|




)


+


s
i



s
k



φ

|

k
-
i

|











i

Ak

B





{


I
2

,

I
3


}

=



arg





max





i

A






k

B







{



Exy0
+

|

d
i

|

+

|

d
k

|




Den

i
,
k



}







{

S2
,

S
3


}

=

{


s

I
2


,

s

I
3



}





Exy1
=



Exy0
+

|

d

I
2


|

+

|

d

I
3


|




Eyy1



=

Den


I
2

,

I
3








A
=

{


p
4

,


p
4

+
5

,





,


i


<
40


}









Den
i

=





Eyy1
+

φ
0

+


s
i

(



S
0



φ

|


I
0

-
i

|



+


S
1



φ

|


I
1

-
i

|



+


















S
2



φ

|


I
2

-
i

|



+


S
3



φ

|


I
3

-
i

|




)


,





i

A









I
4

=



arg





max


i

A




{



Exy1
+

|

d
i

|


Den
i


}







S
4

=

s

I
4






Exy2
=



Exy1
+

|

d

I
4


|




Eyy2

=

Den

I
4











If






Exy2
2



Eyy
*


>






Exy

*
2



Eyy2
{













Exy
*

=
Exy2













Eyy
*

=
Eyy2













{


ind
p0

,

ind
p1

,

ind
p2

,

ind
p3

,

ind
p4


}

=












{


I
0

,

I
1

,

I
2

,

I
3

,

I
4


}













{


sgn
p0

,

sgn
p1

,

sgn
p2

,

sgn
p3

,

sgn
p4


}

=













{


S
0

,

S
1

,

S
2

,

S
3

,

S
4


}

}














Encoding codebook


704


calculates the codebook gain








G
*






as







Exy
*


Eyy
*



,










and then quantizes the set of excitation parameters as the following transmission codes for the j


th


subframe:






CBIjk
=








ind
k

5



,




0

k
<
5








SIGNjk
=

{





0
,





sgn
k

=
1


















1
,





sgn
k

=

-
1





,


0

k
<
5







CBGj
=




min


{



log
2



(

max


{

1
,

G
*


}


)


,
11.2636

}



31
11.2636


+
0.5

















and the quantized gain








G
^

*






is







2

CBGj


11.2636
31



.











Lower bit rate embodiments of the CELP encoder/decoder mode may be realized by removing pitch encoding module


702


and only performing a codebook search to determine an index I and gain G for each of the four subframes. Those skilled in the art will recognize how the ideas described above might be extended to accomplish this lower bit rate embodiment.




C. CELP Decoder




CELP decoder mode


206


receives the encoded speech signal, preferably including codebook excitation parameters and pitch filter parameters, from CELP encoder mode


204


, and based on this data outputs synthesized speech ŝ(n). Decoding codebook module


708


receives the codebook excitation parameters and generates the excitation signal cb(n) with a gain of G. The excitation signal cb(n) for the j


th


subframe contains mostly zeroes except for the five locations:








I




k


=5


CBIjk+k,


0≦


k<


5






which correspondingly have impulses of value








S




k


=1−2


SIGNjk,


0<


k


<5






all of which are scaled by the gain G which is computed to be







2

CBGj


11.2636
31



,










to provide Gcb(n).




Pitch filter


710


decodes the pitch filter parameters from the received transmission codes according to:








L
^

*

=

PLAGj
2







b
^

*

=

{




0
,






L
^

*

=
0








2
8


PGAINj

,






L
^

*


0















Pitch filter


710


then filters Gcb(n), where the filter has a transfer function given by







1

P


(
z
)



=

1

1
-

b
*

z

-

L
*
















In a preferred embodiment, CELP decoder mode


206


also adds an extra pitch filtering operation, a pitch prefilter (not shown), after pitch filter


710


. The lag for the pitch prefilter is the same as that of pitch filter


710


, whereas its gain is preferably half of the pitch gain up to a maximum of 0.5.




LPC synthesis filter


712


receives the reconstructed quantized residual signal {circumflex over (r)}(n) and outputs the synthesized speech signal ŝ(n).




D. Filter Update Module




Filter update module


706


synthesizes speech as described in the previous section in order to update filter memories. Filter update module


706


receives the codebook excitation parameters and the pitch filter parameters, generates an excitation signal cb(n), pitch filters Gcb(n), and then synthesizes ŝ(n). By performing this synthesis at the encoder, memories in the pitch filter and in the LPC synthesis filter are updated for use when processing the following subframe.




VIII. Prototype Pitch Period (PPP) Coding Mode




Prototype pitch period (PPP) coding exploits the periodicity of a speech signal to achieve lower bit rates than may be obtained using CELP coding. In general, PPP coding involves extracting a representative period of the residual signal, referred to herein as the prototype residual, and then using that prototype to construct earlier pitch periods in the frame by interpolating between the prototype residual of the current frame and a similar pitch period from the previous frame (i.e., the prototype residual if the last frame was PPP). The effectiveness (in terms of lowered bit rate) of PPP coding depends, in part, on how closely the current and previous prototype residuals resemble the intervening pitch periods. For this reason, PPP coding is preferably applied to speech signals that exhibit relatively high degrees of periodicity (e.g., voiced speech), referred to herein as quasi-periodic speech signals.





FIG. 9

depicts a PPP encoder mode


204


and a PPP decoder mode


206


in further detail. PPP encoder mode


204


includes an extraction module


904


, a rotational correlator


906


, an encoding codebook


908


, and a filter update module


910


. PPP encoder mode


204


receives the residual signal r(n) and outputs an encoded speech signal s


enc


(n), which preferably includes codebook parameters and rotational parameters. PPP decodermode


206


includes a codebook decoder


912


, a rotator


914


, an adder


916


, a period interpolator


920


, and a warping filter


918


.





FIG. 10

is a flowchart


1000


depicting the steps of PPP coding, including encoding and decoding. These steps are discussed along with the various components of PPP encoder mode


204


and PPP decoder mode


206


.




A. Extraction Module




In step


1002


, extraction module


904


extracts a prototype residual r


p


(n) from the residual signal r(n). As described above in Section III.F., initial parameter calculation module


202


employs an LPC analysis filter to compute r(n) for each frame. In a preferred embodiment, the LPC coefficients in this filter are perceptually weighted as described in Section VII.A. The length of r


p


(n) is equal to the pitch lag L computed by initial parameter calculation module


202


during the last subframe in the current frame.





FIG. 11

is a flowchart depicting step


1002


in greater detail. PPP extraction module


904


preferably selects a pitch period as close to the end of the frame as possible, subject to certain restrictions discussed below.

FIG. 12

depicts an example of a residual signal calculated based on quasi-periodic speech, including the current frame and the last subframe from the previous frame.




In step


1102


, a “cut-free region” is determined. The cut-free region defines a set of samples in the residual which cannot be endpoints of the prototype residual. The cut-free region ensures that high energy regions of the residual do not occur at the beginning or end of the prototype (which could cause discontinuities in the output were it allowed to happen). The absolute value of each of the final L samples of r(n) is calculated. The variable P


S


is set equal to the time index of the sample with the largest absolute value, referred to herein as the “pitch spike.” For example, if the pitch spike occurred in the last sample of the final L samples, P


S


=L−1. In a preferred embodiment, the minimum sample of the cut-free region, CF


min


, is set to be P


S


−6 or P


S


−0.25L, whichever is smaller. The maximum of the cut-free region, CF


max


, is set to be P


S


+6 or P


S


+0.25L, whichever is larger.




In step


1104


, the prototype residual is selected by cutting L samples from the residual. The region chosen is as close as possible to the end of the frame, under the constraint that the endpoints of the region cannot be within the cut-free region. The L samples of the prototype residual are determined using the algorithm described in the following pseudo-code:




if(CF


min


<0) {




for(i=0 to L+CF


min


−1)r


p


(i)=r(i+160−L)




for(i=CF


min


to L−1)r


p


(i)=r(i+160−2L)




}




else if(CF


max


≦L {




for(i=0 to CF


min


−1)r


p


(i)=r(i+160−L)




for(i=CF


min


to L−1)r


p


(i)=r(i+160−2L)




}




else {




for(i=0 to L−1)r


p


(i)=r(i+160−L)




}




B. Rotational Correlator




Referring back to

FIG. 10

, in step


1004


, rotational correlator


906


calculates a set of rotational parameters based on the current prototype residual, r


p


(n), and the prototype residual from the previous frame, r


prev


(n). These parameters describe how r


prev


(n) can best be rotated and scaled for use as a predictor of r


p


(n). In a preferred embodiment, the set of rotational parameters includes an optimal rotation R* and an optimal gain b*.

FIG. 13

is a flowchart depicting step


1004


in greater detail.




In step


1302


, the perceptually weighted target signal x(n), is computed by circularly filtering the prototype pitch residual period r


p


(n). This is achieved as follows. A temporary signal tmp


1


(n) is created from r


p


(n) as







tmp1


(
n
)


=

{











r
p



(
n
)


,










0

n
<
L












0
,










L

n
<

2

L

















which is filtered by the weighted LPC synthesis filter with zero memories to provide an output tmp


2


(n). In a preferred embodiment, the LPC coefficients used are the perceptually weighted coefficients corresponding to the last subframe in the current frame. The target signal x(n) is then given by








x


(


n


)=


tmp


2(


n


)+


tmp


2(


n+L


), 0≦


n<L








In step


1304


, the prototype residual from the previous frame, r


prev


(n), is extracted from the previous frame's quantized formant residual (which is also in the pitch filter's memories). The previous prototype residual is preferably defined as the last L


p


values of the previous frame's formant residual, where L


p


is equal to L if the previous frame was not a PPP frame, and is set to the previous pitch lag otherwise.




In step


1306


, the length of r


prev


(n) is altered to be of the same length as x(n) so that correlations can be correctly computed. This technique for altering the length of a sampled signal is referred to herein as warping. The warped pitch excitation signal, rw


prev


(n), may be described as








rw




prev


(


n


)=


r




prev


(


n*TWF


), 0≦


n<L








where TWF is the time warping factor







L
p

L










The sample values at non-integral points n * TWF are preferably computed using a set of sinc function tables. The sinc sequence chosen is sinc(−3−F: 4−F) where F is the fractional part of n * TWF rounded to the nearest multiple of







1
8

.










The beginning of this sequence is aligned with r


prev


((N−3)% L


p


) where N is the integral part of n*TWF after being rounded to the nearest eighth.




In step


1308


, the warped pitch excitation signal rw


prev


(n) is circularly filtered, resulting in y(n). This operation is the same as that described above with respect to step


1302


, but applied to rw


prev


(n).




In step


1310


, the pitch rotation search range is computed by first calculating an expected rotation E


rot


,







E
rot

=

L
-

round


(

L






frac


(



(

160
-
L

)



(


L
p

+
L

)



2


L
p


L


)



)













where frac(x) gives the fractional part of x. If L<80, the pitch rotation search range is defined to be {E


rot


−8, E


rot


−7.5, . . . E


rot


+7.5}, and {E


rot


−16, E


rot


−15, . . . E


rot


+15} where L≧80.




In step


1312


, the rotational parameters, optimal rotation R* and an optimal gain b*, are calculated. The pitch rotation which results in the best prediction between x(n) and y(n) is chosen along with the corresponding gain b. These parameters are preferably chosen to minimize the error signal e(n)=x(n)−y(n). The optimal rotation R* and the optimal gain b* are those values of rotation R and gain b which result in the maximum value of








Exy
R
2


E
yy


,


where






Exy
R


=





i
=
0


L
-
1









x


(


(

i
+
R

)


%

L

)




y


(
i
)







and





Eyy


=




i
=
0


L
-
1









y


(
i
)




y


(
i
)
















for which the optimal gain







b
*






is







Exy

R
*


Eyy











at rotation R*. For fractional values of rotation, the value of Exy


R


is approximated by interpolating the values of Exy


R


computed at integer values of rotation. A simple four tap interpolation filter is used. For example,








Exy




R


=0.54(


Exy




R′




+Exy




R′+1


)−0.04*(


Exy




R′−1




+Exy




R′+2


)






where R is a non-integral rotation (with precision of 0.5) and R′=└R┘.




In a preferred embodiment, the rotational parameters are quantized for efficient transmission. The optimal gain b* is preferably quantized uniformly between 0.0625 and 4.0 as






PGAIN
=

max


{


min


(





63


(



b
*

-
0.0625


4
-
0.0625


)


+
0.5



,
63

)


,
0

}












where PGAIN is the transmission code and the quantized gain {circumflex over (b)}* is given by






max



{


0.0625
+

(


PGAIN


(

4
-
0.0625

)


63

)


,
0.0625

}

.











The optimal rotation R* is quantized as the transmission code PROT, which is set to 2(R*−E


rot


+8) if L<80, and R*−E


rot


+16 where L≧80.




C. Encoding Codebook




Referring back to

FIG. 10

, in step


1006


, encoding codebook


908


generates a set of codebook parameters based on the received target signal x(n). Encoding codebook


908


seeks to find one or more codevectors which, when scaled, added, and filtered sum to a signal which approximates x(n). In a preferred embodiment, encoding codebook


908


is implemented as a multi-stage codebook, preferably three stages, where each stage produces a scaled codevector. The set of codebook parameters therefore includes the indexes and gains corresponding to three codevectors.

FIG. 14

is a flowchart depicting step


1006


in greater detail.




In step


1402


, before the codebook search is performed, the target signal x(n) is updated as








x


(


n


)=


x


(


n


)−


by


((


n−R


*)o/oL), 0≦


n<L








If in the above subtraction the rotation R* is non-integral (i.e., has a fraction of 0.5), then








y


(


i−


0.5)=−0.0073(


y


(


i−


4)+


y


(


i+


3))+0.0322(


y


(


i−


3)+


y


(


i+


2))−0.1363(


y


(


i−


2)+


y


(


i+


1))+0.6076(


y


(


i−


1)+


y


(


i


))






where i=n−└R*┘.




In step


1404


, the codebook values are partitioned into multiple regions. According to a preferred embodiment, the codebook is determined as







c


(
n
)


=

{




1
,




n
=
0






0
,




0
<
n
<
L







CBP


(

n
-
L

)


,




L

n
<

128
+
L
















where CBP are the values of a stochastic or trained codebook. Those skilled in the art will recognize how these codebook values are generated. The codebook is partitioned into multiple regions, each of length L. The first region is a single pulse, and the remaining regions are made up of values from the stochastic or trained codebook. The number of regions N will be ┌128/L┐.




In step


1406


, the multiple regions of the codebook are each circularly filtered to produce the filtered codebooks, y


reg


(n), the concatenation of which is the signal y(n). For each region, the circular filtering is performed as described above with respect to step


1302


.




In step


1408


, the filtered codebook energy, Eyy(reg), is computed for each region and stored:








Eyy


(
reg
)


=




i
=
0


L
-
1









y
reg



(
i
)




,





0

reg
<
N











In step


1410


, the codebook parameters (i.e., codevector index and gain) for each stage of the multi-stage codebook are computed. According to a preferred embodiment, let Region(I)=reg, defined as the region in which sample I resides, or







Region


(
I
)


=

{




0
,




0

I
<
L






1
,




L

I
<

2

L







2
,





2

L


I
<

3

L
























and let Exy(I) be defined as







Exy


(
I
)


=




i
=
0


L
-
1









x


(
i
)





y

Region


(
I
)





(


(

i
+
I

)


%

L

)














The codebook parameters, I* and G*, for the j


th


codebook stage are computed using the following pseudo-code.








Exy
*

=
0

,


Eyy
*

=
0








for






(

I
=





0





to





127


)

{











computeExy


(
I
)













if






(



Exy


(
I
)





Eyy
*



>



Exy
*



(
I
)






Eyy


(

Region


(
I
)


)


)


{
















Exy
*

=

Exy


(
I
)















Eyy
*

=

Eyy


(

Region


(
I
)


)

















I
*

=
I

}


}






and




G
*

=



Exy
*


Eyy
*


.











According to a preferred embodiment, the codebook parameters are quantized for efficient transmission. The transmission code CBIj (j=stage number−0, 1 or 2) is preferably set to I* and the transmission codes CBGj and SIGNj are set by quantizing the gain G*.






SIGNj
=

{






0
,





G
*


0






1
,





G
*

<
0









CBGj

=




min


{


max


{

0
,


log
2



(

|

G
*

|

)



}


,
11.25

}



4
3


+
0.5















and the quantized gain Ĝ* is








G
^

*

=

{




2

0.75

CBGj





SIGNj
=
0






-

2


0.75

CBGj

,






SIGNj

0















The target signal x(n) is then updated by subtracting the contribution of the codebook vector of the current stage








x


(


n


)=


x


(


n


)−


Ĝ*y




Region(I*)


((


n+I


*)%


L


),0≦


n<L








The above procedures starting from the pseudo-code are repeated to computeI*, G*, and the corresponding transmission codes, for the second and third stages.




D. Filter Update Module




Referring back to

FIG. 10

, in step


1008


, filter update module


910


updates the filters used by PPP encoder mode


204


. Two alternative embodiments are presented for filter update module


910


, as shown in

FIGS. 15A and 16A

. As shown in the first alternative embodiment in

FIG. 15A

, filter update module


910


includes a decoding codebook


1502


, a rotator


1504


, a warping filter


1506


, an adder


1510


, an alignment and interpolation module


1508


, an update pitch filter module


1512


, and an LPC synthesis filter


1514


. The second embodiment, as shown in

FIG. 16A

, includes a decoding codebook


1602


, a rotator


1604


, a warping filter


1606


, an adder


1608


, an update pitch filter module


1610


, a circular LPC synthesis filter


1612


, and an update LPC filter module


1614


.

FIGS. 17 and 18

are flowcharts depicting step


1008


in greater detail, according to the two embodiments.




In step


1702


(and


1802


, the first step of both embodiments), the current reconstructed prototype residual, r


curr


(n), L samples in length, is reconstructed from the codebook parameters and rotational parameters. In a preferred embodiment, rotator


1504


(and


1604


) rotates a warped version of the previous prototype residual according to the following:








r




curr


((


n+R*


)%


L


)=


b rw




prev


(


n


),0≦


<L








where r


curr


is the current prototype to be created, rw


prev


is the warped (as described above in Section VIII.A., with







TWF
=


L
p

L


)










version of the previous period obtained from the most recent L samples of the pitch filter memories, b the pitch gain and R the rotation obtained from packet transmission codes as






b
=

max


{


0.0625


(


PGAIN


(

4
-
0.0625

)


63

)


,




0.0625

}






R
=

{






PROT
2

+

E
rot

-
8

,




L
<
80







PROT
+

E
rot

-
16

,




L

80















where E


rot


is the expected rotation computed as described above in Section VIII.B.




Decoding codebook


1502


(and


1602


) adds the contributions for each of the three codebook stages to r


curr


(n) as








r
curr



(


(


n
--


i

)


%

L

)


=



r
curr



(


(

n
-
I

)


%

L

)


+

{









G
,











I
<
L

,

n
=
0














G






CBP


(

I
-
L
+
n

)



,











I

L

,

0

n
<
L


















where I=CBIj and G is obtained from CBGj and SIGNj as described in the previous section, j being the stage number.




At this point, the two alternative embodiments for filter update module


910


differ. Referring first to the embodiment of

FIG. 15A

, in step


1704


, alignment and interpolation module


1508


fills in the remainder of the residual samples from the beginning of the current frame to the beginning of the current prototype residual (as shown in

FIG. 12

which is an illustration


1200


that depicts a prototype residual period extracted from the current frame of a residual signal, and the prototype residual period from the previous frame). Here, the alignment and interpolation are performed on the residual signal. However, these same operations can also be performed on speech signals, as described below.

FIG. 19

is a flowchart describing step


1704


in further detail.




In step


1902


, it is determined whether the previous lag L


p


is a double or a half relative to the current lag L. In a preferred embodiment, other multiples are considered too improbable, and are therefore not considered. If L


p


>1.85L, L


p


is halved and only the first half of the previous period r


prev


(n) is used. If L


p


<0.54L, the current lag L is likely a double and consequently L


p


is also doubled and the previous period r


prev


(n) is extended by repetition.




In step


1904


, r


prev


(n) is warped to form rw


prev


(n) as described above with respect to step


1306


, with







TWF
=


L
p

L


,










so that the lengths of both prototype residuals are now the same. Note that this operation was performed in step


1702


, as described above, by warping filter


1506


. Those skilled in the art will recognize that step


1904


would be unnecessary if the output of warping filter


1506


were made available to alignment and interpolation module


1508


.




In step


1906


, the allowable range of alignment rotations is computed. The expected alignment rotation, E


A


, is computed to be the same as E


rot


as described above in Section VIII.B. The alignment rotation search range is defined to be {E


A


−δA, E


A


−δA+0.5, E


A


−δA+1, . . . , E


A


+δA−1.5, E


A


+δA−1}, where δA=max{6,0.15L}.




In step


1908


, the cross-correlations between the previous and current prototype periods for integer alignment rotations, R, are computed as







C


(
A
)


=




i
=
0


L
-
1






r
curr



(


(

i
+
A

)


%

L

)





rw
prev



(
i
)














and the cross-correlations for non-integral rotations A are approximated by interpolating the values of the correlations at integral rotation:








C


(


A


)=0.54(


C


(


A′


)+


C


(


A′+


1))−0.04(


C


(


A′−


1)+


C


(


A′+


2))






where A′=A−0.5.




In step


1910


, the value of A (over the range of allowable rotations) which results in the maximum value of C(A) is chosen as the optimal alignment, A*.




In step


1912


, the average lag or pitch period for the intermediate samples, L


av


, is computed in the following manner. A period number estimate, N


per


, is computed as







N
per

=

round


(



A
*

L

+



(

160
-
L

)



(


L
p

+
L

)



2


L
p


L



)












with the average lag for the intermediate samples given by







L
av

=



(

160
-
L

)


L




N
per


L

-

A
*













In step


1914


, the remaining residual samples in the current frame are calculated according to the following interpolation between the previous and current prototype residuals:








r
^



(
n
)


=

{





(

1
-

n

160
-
L



)




rw
prev



(


(

n





α

)


%





L

)















+

n

160
-
L






r
curr



(


(


n





α

+

A
*


)


%

L

)



,




0

n
<

160
-
L









r
curr



(

n
+
L
-
160

)


,





160
-
L


n
<
160















where






α
=


L

L
av


.











The sample values at non-integral points ñ (equal to either nα or nα+A*) are computed using a set of sinc function tables. The sinc sequence chosen is sinc(−3−F: 4−F) where F is the fractional part of ñ rounded to the nearest multiple of







1
8

.










The beginning of this sequence is aligned with r


prev


((N−3)%L


p


) where N is the integral part of ñ after being rounded to the nearest eighth.




Note that this operation is essentially the same as warping, as described above with respect to step


1306


. Therefore, in an alternative embodiment, the interpolation of step


1914


is computed using a warping filter. Those skilled in the art will recognize that economies might be realized by reusing a single warping filter for the various purposes described herein.




Returning to

FIG. 17

, in step


1706


, update pitch filter module


1512


copies values from the reconstructed residual {circumflex over (r)}(n) to the pitch filter memories. Likewise, the memories of the pitch prefilter are also updated.




In step


1708


, LPC synthesis filter


1514


filters the reconstructed residual {circumflex over (r)}(n), which has the effect of updating the memories of the LPC synthesis filter.




The second embodiment of filter update module


910


, as shown in

FIG. 16A

, is now described. As described above with respect to step


1702


, in step


1802


, the prototype residual is reconstructed from the codebook and rotational parameters, resulting in r


curr


(n).




In step


1804


, update pitch filter module


1610


updates the pitch filter memories by copying replicas of the L samples from r


curr


(n), according to






pitch_mem(


i


)=


r




curr


((


L


−(131%L)+


i


)%L), 0≦


i<


131






or alternatively,






pitch_mem(131−1


−i


)=r


curr


(L−1−


i


%L),0≦


i<


131






where


131


is preferably the pitch filter order for a maximum lag of 127.5. In a preferred embodiment, the memories of the pitch prefilter are identically replaced by replicas of the current period r


curr


(n):






pitch_prefilt_mem(


i


)=pitch_mem(


i


),0≦


i<


131






In step


1806


, r


curr


(n) is circularly filtered as described in Section VIII.B., resulting in s


c


(n), preferably using perceptually weighted LPC coefficients.




In step


1808


, values from s


c


(n), preferably the last ten values (for a 10


th


order LPC filter), are used to update the memories of the LPC synthesis filter.




E. PPP Decoder




Returning to

FIGS. 9 and 10

, in step


1010


, PPP decoder mode


206


reconstructs the prototype residual r


curr


(n) based on the received codebook and rotational parameters. Decoding codebook


912


, rotator


914


, and warping filter


918


operate in the manner described in the previous section. Period interpolator


920


receives the reconstructed prototype residual r


curr


(n) and the previous reconstructed prototype residual r


prev


(n), interpolates the samples between the two prototypes, and outputs synthesized speech signal ŝ(n). Period interpolator


920


is described in the following section.




F. Period Interpolator




In step


1012


, period interpolator


920


receives r


curr


(n) and outputs synthesized speech signal ŝ(n). Two alternative embodiments for period interpolator


920


are presented herein, as shown in

FIGS. 15B and 16B

. In the first alternative embodiment,

FIG. 15B

, period interpolator


920


includes an alignment and interpolation module


1516


, an LPC synthesis filter


1518


, and an update pitch filter module


1520


. The second alternative embodiment, as shown in

FIG. 16B

, includes a circular LPC synthesis filter


1616


, an alignment and interpolation module


1618


, an update pitch filter module


1622


, and an update LPC filter module


1620


.

FIGS. 20 and 21

are flowcharts depicting step


1012


in greater detail, according to the two embodiments.




Referring to

FIG. 15B

, in step


2002


, alignment and interpolation module


1516


reconstructs the residual signal for the samples between the current residual prototype r


curr


(n) and the previous residual prototype r


prev


(n), forming {circumflex over (r)}(n). Alignment and interpolation module


1516


operates in the manner described above with respect to step


1704


(as shown in FIG.


19


).




In step


2004


, update pitch filter module


1520


updates the pitch filter memories based on the reconstructed residual signal {circumflex over (r)}(n), as described above with respect to step


1706


.




In step


2006


, LPC synthesis filter


1518


synthesizes the output speech signal ŝ(n) based on the reconstructed residual signal {circumflex over (r)}(n). The LPC filter memories are automatically updated when this operation is performed.




Referring now to

FIGS. 16B and 21

, in step


2102


, update pitch filter module


1622


updates the pitch filter memories based on the reconstructed current residual prototype, r


curr


(n), as described above with respect to step


1804


.




In step


2104


, circular LPC synthesis filter


1616


receives r


curr


(n) and synthesizes a current speech prototype, s


c


(n) (which is L samples in length), as described above in Section VIII.B.




In step


2106


, update LPC filter module


1620


updates the LPC filter memories as described above with respect to step


1808


.




In step


2108


, alignment and interpolation module


1618


reconstructs the speech samples between the previous prototype period and the current prototype period. The previous prototype residual, r


prev


(n), is circularly filtered (in an LPC synthesis configuration) so that the interpolation may proceed in the speech domain. Alignment and interpolation module


1618


operates in the manner described above with respect to step


1704


(see FIG.


19


), except that the operations are performed on speech prototypes rather than residual prototypes. The result of the alignment and interpolation is the synthesized speech signal ŝ(n).




IX. Noise Excited Linear Prediction (NELP) Coding Mode




Noise Excited Linear Prediction (NELP) coding models the speech signal as a pseudo-random noise sequence and thereby achieves lower bit rates than may be obtained using either CELP or PPP coding. NELP coding operates most effectively, in terms of signal reproduction, where the speech signal has little or no pitch structure, such as unvoiced speech or background noise.





FIG. 22

depicts a NELP encoder mode


204


and a NELP decoder mode


206


in further detail. NELP encoder mode


204


includes an energy estimator


2202


and an encoding codebook


2204


. NELP decoder mode


206


includes a decoding codebook


2206


, a random number generator


2210


, a multiplier


2212


, and an LPC synthesis filter


2208


.





FIG. 23

is a flowchart


2300


depicting the steps of NELP coding, including encoding and decoding. These steps are discussed along with the various components of NELP encoder mode


204


and NELP decoder mode


206


.




In step


2302


, energy estimator


2202


calculates the energy of the residual signal for each of the four subframes as











Esf
i

=

0.5



log
2

(





n
=

40

i




40

i

+
39





s
2



(
n
)



40

)



,




0

i
<
4













In step


2304


, encoding codebook


2204


calculates a set of codebook parameters, forming encoded speech signal s


enc


(n). In a preferred embodiment, the set of codebook parameters includes a single parameter, index I


0


. Index I


0


is set equal to the value of j which minimizes












i
=
0

3




(


Esf
i

-

SFEQ


(

j
,
i

)



)

2






where





0


j
<
128













The codebook vectors, SFEQ, are used to quantize the subframe energies Esf


l


and include a number of elements equal to the number of subframes within a frame (i. e., 4 in a preferred embodiment). These codebook vectors are preferably created according to standard techniques known to those skilled in the art for creating stochastic or trained codebooks.




In step


2306


, decoding codebook


2206


decodes the received codebook parameters. In a preferred embodiment, the set of subframe gains G


l


is decoded according to:








G




i


=2


SFEQC(I0,i)


,






or








G




i


2


0.2SFEQ(I0,i),+0.8log




2




Gprev-2


(where the previous frame was coded using a zero-rate coding scheme)






where 0≦i<4 and Gprev is the codebook excitation gain corresponding to the last subframe of the previous frame.




In step


2308


, random number generator


2210


generates a unit variance random vector nz(n). This random vector is scaled by the appropriate gain Gi within each subframe in step


2310


, creating the excitation signal G


i


nz(n).




In step


2312


, LPC synthesis filter


2208


filters the excitation signal G


r


nz(n) to form the output speech signal, ŝ(n).




In a preferred embodiment, a zero rate mode is also employed where the gain G


i


and LPC parameters obtained from the most recent non-zero-rate NELP subframe are used for each subframe in the current frame. Those skilled in the art will recognize that this zero rate mode can effectively be used where multiple NELP frames occur in succession.




X. Conclusion




While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of the present invention should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.




The previous description of the preferred embodiments is provided to enable any person skilled in the art to make or use the present invention. While the invention has been particularly shown and described with reference to preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.



Claims
  • 1. A method for the variable rate coding of a speech signal, comprising:classifying the speech signal as either active or inactive, wherein classifying speech as active or inactive comprises a two energy band based thresholding scheme; classifying said active speech into one of a plurality of woes of active speech, wherein said plurality of types of active speech include voiced, unvoiced, and transient speech; selecting an encoder mode based on whether the speech signal is active or inactive, and if active, based further on said type of active speech, wherein said selected encoder mode is characterized by either a coding bit rate or a coding algorithm, or by a coding bit rate and a coding algorithm; and encoding the speech signal according to said encoder mode, forming an encoded speech signal.
  • 2. A method for the variable rate coding of a speech signal, comprising:classifying the speech signal as either active or inactive, wherein classifying speech as active or inactive comprises classifying the next M frames as active if the previous Nho frames were classified as active; classifying said active speech into one of a plurality of types of active speech, wherein said plurality of types of active speech include voiced, unvoiced, and transient speech; selecting an encoder mode based on whether the speech signal is active or inactive, and if active, based further on said type of active speech, wherein said selected encoder mode is characterized by either a coding bit rate or a coding algorithm, or by a coding bit rate and a coding algorithm; and encoding the speech signal according to said encoder mode forming an encoded speech signal.
  • 3. A variable rate coding system for coding a speech signal, comprising:classification means for classifying the speech signal as active or inactive based on a two energy band thresholding scheme, and if active, for classifying the active speech as one of a plurality of types of active speech; and a plurality of encoding means for encoding the speech signal as an encoded speech signal, wherein said encoding means are dynamically selected to encode the speech signal based on whether the speech signal is active or inactive, and if active, based further on said type of active speech.
  • 4. A variable race coding system for coding a speech signal, comprising:classification means for classifying the speech signal as active or inactive, wherein said classification means classifies the next M frames as active if the previous Nho frames were classified as active, and if active, for classifying the active speech as one of a plurality of types of active speech; and a plurality of encoding means for encoding the speech signal as an encoded speech signal, wherein said encoding means are dynamically selected to encode the speech signal based on whether the speech signal is active or inactive, and if active, based further on said type of active speech.
US Referenced Citations (51)
Number Name Date Kind
3633107 Brady Jan 1972 A
4012595 Ota Mar 1977 A
4076958 Fulghum Feb 1978 A
4214125 Mozer et al. Jul 1980 A
4360708 Taguchi et al. Nov 1982 A
4535472 Tomcik Aug 1985 A
4610022 Kitayama et al. Sep 1986 A
4672669 DesBlache et al. Jun 1987 A
4672670 Wang et al. Jun 1987 A
4677671 Galand et al. Jun 1987 A
RE32580 Atal et al. Jan 1988 E
4764963 Atal Aug 1988 A
4771465 Bronson et al. Sep 1988 A
4797925 Lin Jan 1989 A
4797929 Gerson et al. Jan 1989 A
4827517 Atal et al. May 1989 A
4843612 Brusch et al. Jun 1989 A
4852179 Fette Jul 1989 A
4856068 Quatieri, Jr. et al. Aug 1989 A
4864561 Ashenfelter et al. Sep 1989 A
4885790 McAulay et al. Dec 1989 A
4890327 Bertrand et al. Dec 1989 A
4896361 Gerson Jan 1990 A
4899384 Crouse et al. Feb 1990 A
4899385 Ketchum et al. Feb 1990 A
4918734 Muramatsu et al. Apr 1990 A
4933957 Bottau et al. Jun 1990 A
4937873 McAulay et al. Jun 1990 A
4965789 Bottau et al. Oct 1990 A
5023910 Thomson Jun 1991 A
5054072 McAulay et al. Oct 1991 A
5140638 Moulsley et al. Aug 1992 A
5222189 Fielder Jun 1993 A
5414796 Jacobs et al. May 1995 A
5459814 Gupta et al. Oct 1995 A
5495555 Swaminathan Feb 1996 A
5548680 Cellario Aug 1996 A
5596676 Swaminathan et al. Jan 1997 A
5657418 Gerson et al. Aug 1997 A
5734789 Swaminathan et al. Mar 1998 A
5812965 Massaloux Sep 1998 A
5884252 Ozawa Mar 1999 A
5884253 Kleijn Mar 1999 A
5890108 Yeldener Mar 1999 A
5909663 Iijima et al. Jun 1999 A
5911128 DeJaco Jun 1999 A
5933802 Emori Aug 1999 A
5956673 Weaver, Jr. et al. Sep 1999 A
5995923 Mermelstein et al. Nov 1999 A
6205423 Su et al. Mar 2001 B1
6240386 Thyssen et al. May 2001 B1
Foreign Referenced Citations (1)
Number Date Country
0718822 Dec 1994 WO
Non-Patent Literature Citations (19)
Entry
DeMartin (“Mixed-Domain Coding of Speech at 3 KB/S,” Conference on Acoustics, Speech & Signal Processing, May 1996).*
“M-LCELP speech coding at 4 kbps”, Ozawa, K.; Serizawa, M.; Miyano, T.; Nomura, T., International Conference on Acoustics, Speech, and Signal Processing, Apr. 1994.*
Lupini, et al. “A Multi-Mode Variable Rate CELP Coder Based on Frame Classification” Proceedings of the Int'l Conf. On Communications 1:406-409 (1993).
Paksoy, et al. “Variable Rate Speech Coding for Multiple Access Wireless Networks” Prceedings of the Mediterranean Electrotechnical Conf. 1:47-50 (1994).
Atal et al., “Adaptive Predictive Coding of Speech Signals”, The Bell System Technical Journal, Oct. 1970, pp. 1973-1986.
Schroeder et al. “Stochastic Coding of Speech Signals at Very Low Bit Rates: The Importance of Speech Perception,” Speech Communication 4 (1985), pp. 155-162.
Schroeder et al. “Code-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates,” 1985 IEEE Communications, pp. 937-940 (25.1.1-25.1.4).
Bishnu S. Atal. “Predictive Coding of Speech at Low Bit RAtes”, IEEE Transactions on Communications, vol. Com-30, No. 4, Apr. 1982, pp. 600-614.
Singhai et al. “Improving Performance of Multi-Pulse LPC Coders at Low Bit Rates”, IEEE Transactions on Communications, 1984, pp. 1.3.1-1.3.4.
Atal et al. “Stochastic Coding of Speech Signals at Very Low Bit Rates,” 1984 IEEE, pp. 1610-1613.
Wang et al. “Phonetically-Based Vector Excitation Coding of Speech at 3.6 kbps,” 1989 IEEE, pp. 49-52.
Rabiner et al., “Linear Predictive Coding of Speech”, 1978 Digital Processing of Speech Signals, pp. 411-413.
T. Tremain et al., “A 4.8 KBPS Code Excited Linear Predictive Coder”, 1988 Proceedings of the Mobile Satellite Conference, pp. 491-496.
W. Bastiaan Kleijn et al., “Methods for Waveform Interpolation in Speech Coding”, 1991 Digital Signal Processing, pp. 215-230.
Paul Mennen, “DSP Chips Can Produce Random Numbers Using Proven Algorithm”, EDN, Jan. 21, 1991, pp. 141-145.
N.S. Jayant, “Variable Rate Speech Coding: A Review”, 1984 IEEE, pp. 1614-1617.
R. DiFrancesco et al., “Variable Rate Speech Coding With Online Segmentation and Fast Algebraic Codes”, 1990 IEEE, pp. 233-236.
Saeed V. Vaseghi, “Finite State CELP for Variable Rate Speech Coding”, 1990 IEEE, pp. 37-40.
H. Nakada et al., “Variable Rate Speech Coding for Asynchronous Transfer Mode”, IEEE Transaction on Communications, vol. 38, No. 3, Mar. 1990, pp. 277-284.