Determining linear predictive coding filter parameters for encoding a voice signal

Information

  • Patent Grant
  • 6782359
  • Patent Number
    6,782,359
  • Date Filed
    Wednesday, May 28, 2003
    22 years ago
  • Date Issued
    Tuesday, August 24, 2004
    21 years ago
Abstract
Linear predictive coding (LPC) filter parameters are determined for use in encoding a voice signal. Samples of a speech signal using a z-transform function are pre-emphasized. The pre-emphasized samples are analyzed to produce LPC reflection coefficients. The LPC reflection coefficients are quantized by a voiced quantizer and by an unvoiced quantizer producing sets of quantized reflection coefficients. Each set is converted into respective spectral coefficients. The set which produces a smaller lag-spectral distance is determined. The determined set is selected to encode the voice signal.
Description




BACKGROUND




This invention relates to digital voice coders performing at relatively low voice rates but maintaining high voice quality. In particular, it relates to improved multipulse linear predictive voice coders.




The multipulse coder incorporates the linear predictive all-pole filter (LPC filter). The basic function of a multipulse coder is finding a suitable excitation pattern for the LPC all-pole filter which produces an output that closely matches the original speech waveform. The excitation signal is a series of weighted impulses. The weight values and impulse locations are found in a systematic manner. The selection of a weight and location of an excitation impulse is obtained by minimizing an error criterion between the all-pole filter output and the original speech signal. Some multipulse coders incorporate a perceptual weighting filter in the error criterion function. This filter serves to frequency weight the error which in essence allows more error in the format regions of the speech signal and less in low energy portions of the spectrum. Incorporation of pitch filters improve the performance, of multipulse speech coders. This is done by modeling the long term redundancy of the speech signal thereby allowing the excitation signal to account for the pitch related properties of the signal.




SUMMARY




Linear predictive coding (LPC) filter parameters are determined for use in encoding a voice signal. Samples of a speech signal using a z-transform function are pre-emphasized. The pre-emphasized samples are analyzed to produce LPC reflection coefficients. The LPC reflection coefficients are quantized by a voiced quantizer and by an unvoiced quantizer producing sets of quantized reflection coefficients. Each set is converted into respective spectral coefficients. The set which produces a smaller lag-spectral distance is determined. The determined set is selected to encode the voice signal.











BRIEF DESCRIPTION OF THE DRAWING(S)





FIG. 1

is a block diagram of an 8 kbps multipulse LPC speech coder.





FIG. 2

is a block diagram of a sample/hold and AID circuit used in the system of FIG.


1


.





FIG. 3

is a block diagram of the spectral whitening circuit of FIG.


1


.





FIG. 4

is a block diagram of the perceptual speech weighting circuit of FIG.


1


.





FIG. 5

is a block diagram of the reflection coefficient quantization circuit of FIG.


1


.





FIG. 6

is a block diagram of the LPC interpolation/weighting circuit of FIG.


1


.





FIG. 7

is a flow chart diagram of the pitch analysis block of FIG.


1


.





FIG. 8

is a flow chart diagram of the multipulse analysis block of FIG.


1


.





FIG. 9

is a block diagram of the impulse response generator of FIG.


1


.





FIG. 10

is a block diagram of the perceptual synthesizer circuit of FIG.


1


.





FIG. 11

is a block diagram of the ringdown generator circuit of FIG.


1


.





FIG. 12

is a diagrammatic view of the factorial tables address storage used in the system of FIG.


1


.











DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)




This invention incorporates improvements to the prior art of multipulse coders, specifically, a new type LPC spectral quantization, pitch filter implementation, incorporation of pitch synthesis filter in the multipulse analysis, and excitation encoding/decoding.




Shown in

FIG. 1

is a block diagram of an 8 kbps multipulse LPC speech coder, generally designated


10


.




It comprises a pre-emphasis block


12


to receive the speech signals s(n). The pre-emphasized signals are applied to an LPC analysis block


14


as well as to a spectral whitening block


16


and to a perceptually weighted speech block


18


.




The output of the block


14


is applied to a reflection coefficient quantization and LPC conversion block


20


, whose output is applied both to the bit packing block


22


and to an LPC interpolation/weighting block


24


.




The output from block


20


to block


24


is indicated at


α


and the outputs from block


24


are indicated at


α


,


α




1


and at αρ, α


1


ρ.




The signal


α


,


α




1


is applied to the spectral whitening block


16


and the signal αρ, α


1


ρ is applied to the impulse generation block


26


.




The output of spectral whitening block


16


is applied to the pitch analysis block


28


whose output is applied to quantizer block


30


. The quantized output {circumflex over (p)} from quantizer


30


is applied to the bit packer


22


and also as a second input to the impulse response generation block


26


. The output of block


26


, indicated at h(n), is applied to the multiple analysis block


32


.




The perceptual weighting block


18


receives both outputs from block


24


and its output, indicated at Sp(n), is applied to an adder


34


which also receives the output r(n) from a ringdown generator


36


. The ringdown component r(n) is a fixed signal due to the contributions of the previous frames. The output x(n) of the adder


34


is applied as a second input to the multipulse analysis block


32


. The two outputs Ê and Ĝ of the multipulse analysis block


32


are fed to the bit packing block


22


.




The signals


α


,


α




1


, p and Ê, Ĝ are fed to the perceptual synthesizer block


38


whose output y(n), comprising the combined weighted reflection coefficients, quantized spectral coefficients and multipulse analysis signals of previous frames, is applied to the block delay N/2


40


. The output of block


40


is applied to the ringdown generator


36


.




The output of the block


22


is fed to the synthesizer/postfilter


42


.




The operation of the aforesaid system is described as follows: The original speech is digitized using sample/hold and A/D circuitry


44


comprising a sample and hold block


46


and an analog to digital block


48


. (FIG.


2


). The sampling rate is 8 kHz. The digitized speech signal, s(n), is analyzed on a block basis, meaning that before analysis can begin, N samples of s(n) must be acquired. Once a block of speech samples s(n) is acquired, it is passed to the preemphasis filter


12


which has a z-transform function








P


(


z


)=1


−α*z




−1


  (1)






It is then passed to the LPC analysis block


14


from which the signal K is fed to the reflection coefficient quantizer and LPC converter whitening block


20


, (shown in detail in FIG.


3


). The LPC analysis block


14


produces LPC reflection coefficients which are related to the all-pole filter coefficients. The reflection coefficients are then quantized in block


20


in the manner shown in detail in

FIG. 5

wherein two sets of quantizer tables are previously stored. One set has been designed using training databases based on voiced speech, while the other has been designed using unvoiced speech. The reflection coefficients are quantized twice; once using the voiced quantizer


48


and once using the unvoiced quantizer


50


. Each quantized set of reflection coefficients is converted to its respective spectral coefficients, as at


52


and


54


, which, in turn, enables the computation of the log-spectral distance between the unquantized spectrum and the quantized spectrum. The set of quantized reflection coefficients which produces the smaller log-spectral distance shown at


56


, is then retained. The retained reflection coefficient parameters are encoded for transmission and also converted to the corresponding all-pole LPC filter coefficients in block


58


.




Following the reflection quantization and LPC coefficient conversion, the LPC filter parameters are interpolated using the scheme described herein. As previously discussed, LPC analysis is performed on speech of block length N which corresponds to N/8000 seconds (sampling rate=8000 Hz). Therefore, a set of filter coefficients is generated for every N samples of speech or every N/8000 sec.




In order to enhance spectral trajectory tracking, the LPC filter parameters are interpolated on a sub-frame basis at block


24


where the sub-frame rate is twice the frame rate. The interpolation scheme is implemented (as shown in detail in

FIG. 6

) as follows: let the LPC filter coefficients for frame k−1 be α


0


and for frame k be α


1


. The filter coefficients for the first sub-frame of frame k is then








α


=(


α




0


+


α




1


)/2  (2)






and α


1


parameters are applied to the second sub-frame. Therefore a different set of LPC filter parameters are available every 0.5*(N/8000) sec.




Pitch Analysis




Prior methods of pitch filter implementation for multipulse LPC coders have focused on closed loop pitch analysis methods (U.S. Pat. No. 4,701,954). However, such closed loop methods are computationally expensive. In the present invention the pitch analysis procedure indicated by block


28


, is performed in an open loop manner on the speech spectral residual signal. Open loop methods have reduced computational requirements. The spectral residual signal is generated using the inverse LPC filter which can be represented in the z-transform domain as A(z); A(z)=1/H(z) where H(z) is the LPC all-pole filter. This is known as spectral whitening and is represented by block


16


. This block


16


is shown in detail in FIG.


3


. The spectral whitening process removes the short-time sample correlation which in turn enhances pitch analysis.




A flow chart diagram of the pitch analysis block


28


of

FIG. 1

is shown in FIG.


7


. The first step in the pitch analysis process is the collection of N samples of the spectral residual signal. This spectral residual signal is obtained from the pre-emphasized speech signal by the method illustrated in FIG.


3


. These residual samples are appended to the prior K retained residual samples to form a segment, r(n), where —K≦n≦N.




The autocorrelation Q(i) is performed for τ


1


≦i≦τ


h


or













Q


(
i
)




=

n
=

-
K


N






r


(
n
)




r


(

n
-
i

)



















τ
1


i


τ
h








(
3
)













The limits of i are arbitrary but for speech sounds a typical range is between 20 and 147 (assuming 8 kHz sampling). The next step is to search Q(i) for the max value, M


1


, where








M




1


=max(


Q


(


i


))=


Q


(


k




1


)  (4)






The value k is stored and Q(k


1


−1), Q(k


1


) and Q(K


1


+1) are set to a large negative value.




We next find a second value M


2


where








M




2


=max(


Q


(


i


))=


Q


(


k




2


)  (5)






The values k


1


and k


2


correspond to delay values that produce the two largest correlation values. The values k


1


and k


2


are used to check for pitch period doubling. The following algorithm is employed: If the ABS (k


2


−2*k


1


)<C, where C can be chosen to be equal to the number of taps (3 in this invention), then the delay value, D, is equal to k


2


otherwise D=k


1


. Once the frame delay value, D, is chosen the 3-tap gain terms are solved by first computing the matrix and vector values in eq. (6).










[




Σ






r


(
i
)




r


(

n
-
τ
-
1

)








Σ






r


(
n
)




r


(

n
-
i

)








Σ






r


(
n
)




r


(

n
-
i
+
1

)






]

=

[




Σ






r


(

n
-
i
-
1

)




r


(

n
-
i
-
1

)






Σ






r


(

n
-
i

)




r


(

n
-
i
-
1

)






Σ






r


(

n
-
i
+
1

)




r


(

n
-
i
-
1

)








Σ






r


(

n
-
i
-
1

)




r


(

n
-
i

)






Σ






r


(

n
-
i

)




r


(

n
-
i

)






Σ






r


(

n
-
i
+
1

)




r


(

n
-
i

)








Σ






r


(

n
-
i
-
1

)




r


(

n
-
i
+
1

)






Σ






r


(

n
-
i

)




r


(

n
-
i
+
1

)






Σ






r


(

n
-
i
+
1

)




r


(

n
-
i
+
1

)






]





(
6
)













The matrix is solved using the Cholesky matrix decomposition. Once the gain values are calculated, they are quantized using a 32 word vector codebook. The codebook index along with the frame delay parameter are transmitted. The {circumflex over (P)} signifies the quantized delay value and index of the gain codebook.




Excitation Analysis




Multipulse's name stems from the operation of exciting a vocal tract model with multiple impulses. A location and amplitude of an excitation pulse is chosen by minimizing the mean-squared error between the real and synthetic speech signals. This system incorporates the perceptual weighting filter


18


. A detailed flow chart of the multipulse analysis is shown in FIG.


8


. The method of determining a pulse location and amplitude is accomplished in a systematic manner. The basic algorithm can be described as follows: let h(n) be the system impulse response of the pitch analysis filter and the LPC analysis filter in cascade; the synthetic speech is the system's response to the multipulse excitation. This is indicated as the excitation convolved with the system response or











s
^



(
n
)


=




k
=
1

n




ex


(
k
)




h


(

n
-
k

)








(
7
)













where ex(n) is a set of weighted impulses located at positions n


1


,n


2


, . . . n


j


or








ex


(


n


)=β


1


δ(


n−n




1


)+β


2


δ(


n−n




2


)+ . . . +β


j


δ(


n−n




j


)  (8)






The synthetic speech can be re-written as











s
^



(
n
)


=




j
=
1

j




β
j



h


(

n
-

n
j


)








(
9
)













In the present invention, the excitation pulse search is performed one pulse at a time, therefore j=1. The error between the real and synthetic speech is








e


(


n


)=


s




p


(


n


)−


ŝ


(


n


)−


r


(


n


)  (10)






The squared error









E
=




n
=
1

N




e
2



(
n
)







(
11
)













or









E
=




n
=
1

N




(



s
p



(
n
)


-


s
^



(
n
)


-

r


(
n
)



)

2






(
12
)













where s


p


(n) is the original speech after pre-emphasis and perceptual weighting (

FIG. 4

) and r(n) is a fixed signal component due to the previous frames' contributions and is referred to as the ringdown component.





FIGS. 10 and 11

show the manner in which this signal is generated,

FIG. 10

illustrating the perceptual synthesizer


38


and

FIG. 11

illustrating the ringdown generator


36


. The squared error is now written as









E
=




n
=
1

N



(


x


(
n
)


-


β
1




h


(

n
-

n
j


)


2









(
13
)













where x(n) is the speech signal s


p


(n)−r(n) as shown in FIG.


1


.








E=S−


2


BC+B




2




H


  (14)






where









C
=




n
=
1


N
-
1





x


(
n
)




h


(

n
-

n
j


)








(
15
)













and









S
=




n
=
1


N
-
1





x
2



(
n
)







(
16
)













and









H
=




n
=
1


N
-
1




h
(

n
-


n
1



h


(

n
-

n
1


)










(
17
)













The error, E, is minimized by setting the dE/dB=0 or








dE/dB


=−2


C


+2


HB


=0  (18)






or








B=C/H


  (19)






The error, E, can then be written as








E=S−C




2




/H


  (20)






From the above equations it is evident that two signals are required for multipulse analysis, namely h(n) and x(n). These two signals are input to the multipulse analysis block


32


.




The first step in excitation analysis is to generate the system impulse response. The system impulse response is the concatentation of the 3-tap pitch synthesis filter and the LPC weighted filter. The impulse response filter has the z-transform:











H
p



(
z
)


=


1

1
-




i
=
1

3








b
i



z


-
τ

-
i








1

1
-




τ
=
1

ρ








α
i



μ
i



z

-
i











(
20
)













The b values are the pitch gain coefficients, the α values are the spectral filter coefficients, and μ is a filter weighting coefficient. The error signal, e(n), can be written in the z-transform domain as








E


(


z


)=


X


(


z


)−


BH




p


(


z


)


z




−n1


  (21)






where X(z) is the z-transform of x(n) previously defined.




The impulse response weight β, and impulse response time shift location n


1


are computed by minimizing the energy of the error signal, e(n). The time shift variable n


1


(1=1 for first pulse) is now varied from 1 to N. The value of n


1


is chosen such that it produces the smallest energy error E. Once n


1


is found β


1


can be calculated. Once the first location, n


1


and impulse weight, β


1


, are determined the synthetic signal is written as








ŝ


(


n


)=β


1




h


(


n−n




1


)  (22)






When two weighted impulses are considered in the excitation sequence, the error energy can be written as








E


=Σ(


x


(


n


)−β


1




h


(


n−n




1


)−β


2




h


(


n−n




2


))


2








Since the first pulse weight and location are known, the equation is rewritten as







E


=Σ(


x


′(


n


)−β


2




h


(


n−n




2


))


2


  (23)




where








x


′(


n


)=


x


(


n


)−β


1




h


(


n−n




2


)  (24)






The procedure for determining β


2


and n


2


is identical to that of determining β


1


and n


1


. This procedure can be repeated p times. In the present instancetion p=5. The excitation pulse locations are encoded using an enumerative encoding scheme.




EXCITATION ENCODING




A normal encoding scheme for 5 pulse locations would take 5*Int(log


2


N+0.5), where N is the number of possible locations. For p=5 and N=80, 35 bits are required. The approach taken here is to employ an enumerative encoding scheme. For the same conditions, the number of bits required is 25 bits. The first step is to order the pulse locations (i.e.


0


L


1


≦L


2


≦L


3


≦L


4


≦L


5


≦N−1 where L


1


=min(n


1


, n


2


, n


3


, n


4


, n


5


) etc.). The 25 bit number, B, is:






B
=


(



L1




1



)

+

(



L2




2



)

+

(



L3




3



)

+

(



L4




4



)

+

(



L5




5



)












Computing the 5 sets of factorials is prohibitive on a DSP device, therefore the approach taken here is to pre-compute the values and store them on a DSP ROM. This is shown in FIG.


12


. Many of the numbers require double precision (32 bits). A quick calculation yields a required storage (for N=80) of 790 words ((N−1)*2*5). This amount of storage can be reduced by first realizing







(



L1




1



)

&AutoRightMatch;










is simply L


1


; therefore no storage is required. Secondly,







(



L2




2



)

&AutoRightMatch;










contains only single precision numbers; therefore storage can be reduced to 553 words. The code is written such that the five addresses are computed from the pulse locations starting with the 5th location (Assumes pulse location range from 1 to 80). The address of the 5th pulse is 2*L


5


+393. The factor of 2 is due to double precision storage of L


5


's elements. The address of L


4


is 2*L


4


+235, for L


3


,


2


*L


3


+77, for L


2


, L


2


−1. The numbers stored at these locations are added and a 25-bit number representing the unique set of locations is produced. A block diagram of the enumerative encoding schemes is listed.




Excitation Decoding




Decoding the 25-bit word at the receiver involves repeated subtractions. For example, given B is the 25-bit word, the 5th location is found by finding the value X such that















B


-














(



79




5



)


<
0

















B
-

(



X




5



)


<
0












B
-

(




X
-
1





5



)


>
0

















then L


5


=x−1. Next let






B
=

B
-

(



L5




5



)












The fourth pulse location is found by finding a value X such that















B


-














(




L5
-
1





4



)


<
0

















B
-

(



X




4



)


<
0












B
-

(




X
-
1





4



)


>
0

















then L


4


=X−1. This is repeated for L


3


and L


2


. The remaining number is L


1


.



Claims
  • 1. Method of processing speech comprising:receiving an original speech signal; using sample and hold techniques to digitize the original speech signal at a predetermined sampling rate to produce samples; analyzing the samples on a block basis by acquiring a predetermined number of the samples; providing preemphasis filtering of the block of samples; generating reflection coefficients for the block of samples; quantizing the reflection coefficients for voiced and unvoiced speech values; converting the voiced and unvoiced speech values to respective spectral coefficients; and using the spectral coefficients to compute respective log-spectral distances between the unquantized spectrum and the quantized spectrum.
  • 2. The method of claim 1, further comprising the preemphasis filtering providing a z-transform function.
  • 3. The method of claim 1, further comprising the quantitizing of the reflection coefficients performed by using quantizer tables, the quantizer tables corresponding to the respective voiced and unvoiced speech values, thereby resulting in quantizing the reflection coefficients for voiced speech and quantizing the reflection coefficients for unvoiced speech.
  • 4. The method of claim 1, wherein the digitization of the original speech signal uses A/D circuitry along with said sample and hold techniques.
  • 5. The method of claim 1, further comprising providing the quantitized reflection coefficients to a circuit for signal whitening.
  • 6. The method of claim 1, further comprising the performing a predictive all-pole (LPC) analysis of the samples to generate the reflection coefficients.
  • 7. The method of claim 1, comprising:determining log-spectral distances of the quantized reflection coefficients; and selecting and retaining the set of quantized reflection coefficients which produces a smaller log-spectral distance.
  • 8. The method of claim 7, further comprising:encoding the retained reflection coefficient parameters for transmission; and converting the encoded retained reflection coefficient parameters to corresponding all-pole linear predictive LPC filter coefficients.
  • 9. The method of claim 1, further comprising:the LPC analysis performed on speech of block length N which corresponds to N/x seconds, where x is a sampling rate; and generating a set of filter coefficients is generated for every N samples of speech or every N/x sec.
  • 10. The method of claim 9, further comprising interpolating the LPC parameters on a sub-frame basis at a sub-frame rate of twice the frame rate, thereby providing a set of parameters at a rate of twice the frame rate.
  • 11. The method of claim 1, wherein the digitization of the original speech signal uses sample/hold and A/D circuitry at sampling rate of 8 kHz.
  • 12. The method of claim 11, further comprising:the LPC analysis performed on speech of block length N which corresponds to N/8000 seconds; and generating a set of filter coefficients is generated for every N samples of speech or every N/8000 sec.
Parent Case Info

This application is a continuation of U.S. patent application Ser. No. 10/083,237, filed Feb. 26, 2002, now U.S. Pat. No. 6,611,799 which is a continuation of U.S. patent application Ser. No. 09/805,634, filed Mar. 14, 2001, now U.S. Pat. No. 6,385,577, which is a continuation of U.S. patent application Ser. No. 09/441,743, filed Nov. 16, 1999, now U.S. Pat. No. 6,223,152, which is a continuation of U.S. patent application Ser. No. 08/950,658, filed Oct. 15, 1997, now U.S. Pat. No. 6,006,174, which is a file wrapper continuation of U.S. patent application Ser. No. 08/670,986, filed Jun. 28, 1996 now abandoned, which is a file wrapper continuation of U.S. patent application Ser. No. 08/104,174, filed Aug. 9, 1993, now abandoned, which is a continuation of U.S. patent application Ser. No. 07/592,330, filed Oct. 3, 1990, now U.S. Pat. No. 5,235,670, which applications are incorporated herein by reference.

US Referenced Citations (21)
Number Name Date Kind
4618982 Horvath et al. Oct 1986 A
4669120 Ono May 1987 A
4776015 Takeda et al. Oct 1988 A
4815134 Picone et al. Mar 1989 A
4845753 Yasunaga Jul 1989 A
4868867 Davidson et al. Sep 1989 A
4890327 Bertrand et al. Dec 1989 A
4980916 Zinser Dec 1990 A
4991213 Wilson Feb 1991 A
5001759 Fukui Mar 1991 A
5027405 Ozawa Jun 1991 A
5235670 Lin et al. Aug 1993 A
5265167 Akamine et al. Nov 1993 A
5307441 Tzeng Apr 1994 A
5999899 Robinson Dec 1999 A
6006174 Lin et al. Dec 1999 A
6223152 Lin et al. Apr 2001 B1
6385577 Lin et al. May 2002 B2
6591234 Chandran et al. Jul 2003 B1
6611799 Lin et al. Aug 2003 B2
6633839 Kushner et al. Oct 2003 B2
Foreign Referenced Citations (1)
Number Date Country
WO8602726 Jun 1986 WO
Non-Patent Literature Citations (6)
Entry
Proc. ICASSP '82, A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates, B.S. Atal and J.R. Remde, pp 614-617, Apr. 1982.
Proc. ICASSP '84, Improving Performance of Multi-Pulse Coders at Low Bit Rates, S. Singhal and B.S. Atal, paper 1.3, Mar. 1984.
Proc. ICASSP '84, Efficient Computation and Encoding of the Multiple Excitation for LPC, M. Berouti et al., paper 10. Mar. 1, 1984.
Proc. ICASSP '86, Implementation of Multi-Pulse Coder on a Single Chip Floating-Point Signal Processor, H. Alrutz, paper 44. Apr. 3, 1986.
Digital Telephony, John Bellamy, pp 153-154, 1991.
Veeneman et al., Computationally Efficient Stochastic Coding of Speech, 1990, IEEE 40th Vehicular Technology Conference, May 1990, pp. 331-335.
Continuations (7)
Number Date Country
Parent 10/083237 Feb 2002 US
Child 10/446314 US
Parent 09/805634 Mar 2001 US
Child 10/083237 US
Parent 09/441743 Nov 1999 US
Child 09/805634 US
Parent 08/950658 Oct 1997 US
Child 09/441743 US
Parent 08/670986 Jun 1996 US
Child 08/950658 US
Parent 08/104174 Aug 1993 US
Child 08/670986 US
Parent 07/592330 Oct 1990 US
Child 08/104174 US