Systems and methods for encoding and decoding speech for lossy transmission networks

Information

  • Patent Grant
  • 6389006
  • Patent Number
    6,389,006
  • Date Filed
    Wednesday, May 6, 1998
    28 years ago
  • Date Issued
    Tuesday, May 14, 2002
    24 years ago
  • Inventors
  • Original Assignees
  • Examiners
    • Ton; Dang
    • Nguyen; Brian
    Agents
    • Eitan, Pearl, Latzer & Cohen-Zedek
Abstract
A voice encoder and decoder which attempt to minimize the effects of voice data packet loss, typically over wide area networks is provided. The voice encoder utilizes future data, such as the lookahead data typically available for linear predictive coding (LPC), to partially encode a future packet and to send the partial encoding as part of the current packet. The decoder utilizes the partial encoding of the previous packet to decode the current packet if the latter did not arrive properly.
Description




FIELD OF THE INVENTION




The present relates to systems and methods for transmitting speech and voice over a packet data network.




BACKGROUND OF THE INVENTION




Packet data networks send packets of data from one computer to another. They can be configured as local area networks (LANs) or as wide area networks (WANs). One example of the latter is the Internet.




Each packet of data is separately addressed and sent by the transmitting computer. The network routes each packet separately and thus, each packet might take a different amount of time to arrive at the destination. When the data being sent is part of a file which will not be touched until it has completely arrived, the varying delays is of no concern.




However, files and email messages are not the only type of data sent on packet data networks. Recently, it has become possible to also send real-time voice signals, thereby providing the ability to have voice conversations over the networks. For voice conversations, the voice data packets are played shortly after they are received which becomes difficult if a data packet is significantly delayed. For voice conversations, a packet which arrives very late is equivalent to being lost. On the Internet, 5%-25% of the packets are lost and, as a result, Internet phone conversations are often very choppy.




One solution is to increase the delay between receiving a packet and playing it, thereby allowing late packets to be received. However, if the delay is too large, the phone conversation becomes awkward.




Standards for compressing voice signals exist which define how to compress (or encode) and decompress (e.g. decode) the voice signal and how to create the packet of compressed data. The standards also define how to function in the presence of packet loss.




Most vocoders (systems which encode and decode voice signals) utilize already stored information regarding previous voice packets to interpolate what the lost packet might sound like. For example,

FIGS. 1A

,


1


B and


1


C illustrate a typical vocoder and its operation, where

FIG. 1A

illustrates the encoder


10


,

FIG. 1B

illustrates the operation of a pitch processor and

FIG. 1C

illustrates the decoder


12


. Examples of many commonly utilized methods are described in the book by Sadaoki Furui,


Digital Speech Processing, Synthesis and Recognition,


Marcel Dekker Inc., New York, N.Y., 1989. This book and the articles in its bibliography are incorporated herein by reference.




The encoder


10


receives a digitized frame of speech data and includes a short term component analyzer


14


, such as a linear prediction coding (LPC) processor, a long term component analyzer


16


, such as a pitch processor, a history buffer


18


, a remnant excitation processor


20


and a packet creator


17


. The LPC processor


14


determines the spectral coefficients (e.g. the LPC coefficients) which define the spectral envelope of each frame and, using the spectral coefficients, creates a noise shaping filter with which to filter the frame. Thus, the speech signal output of the LPC processor


14


, a “residual signal”, is generally devoid of the spectral information of the frame. An LPC converter


19


converts the LPC coefficients to a more transmittable form, known as “LSP” coefficients.




The pitch processor


16


analyses the residual signal which includes therein periodic spikes which define the pitch of the signal. To determine the pitch, pitch processor


16


correlates the residual signal of the current frame to residual signals of previous frames produced as described hereinbelow with respect to FIG.


1


B. The offset at which the correlation signal has the highest value is the pitch value for the frame. In other words, the pitch value is the number of samples prior to the start of the current frame at which the current frame best matches previous frame data. Pitch processor


16


then determines a long-term prediction which models the fine structure in the spectra of the speech in a subframe, typically of 40-80 samples. The resultant modeled waveform is subtracted from the signal in the subframe thereby producing a “remnant” signal which is provided to remnant excitation processor


20


and is stored in the history buffer


18


.





FIG. 1B

schematically illustrates the operation of pitch processor


16


where the residual signal of the current frame is shown to the right of a line


11


and data in the history buffer is shown to its left. Pitch processor


16


takes a window


13


of data of the same length as the current frame and which begins P samples before line


11


, where P is the current pitch value to be tested and provides window


13


to an LPC synthesizer


15


.




If the pitch value P is less than the size of a frame, there will not be enough history data to fill a frame. In this case, pitch processor


16


creates window


13


by repeating the data from the history buffer until the window is full.




Synthesizer


15


then synthesizes the residual signal associated with the window


13


of data by utilizing the LPC coefficients. Typically, synthesizer


15


also includes a format perceptual weighting filter which aids in the synthesis operation. The synthesized signal, shown at


21


, is then compared to the current frame and the quality of the difference signal is noted. The process is repeated for a multiplicity of values of pitch P and the selected pitch P is the one whose synthesized signal is closest to the current residual signal (i.e. the one which has the smallest difference signal).




The remnant excitation processor


20


characterizes the shape of the remnant signal and the characterization is provided to packet creator


17


. Packet creator


17


combines the LPC spectral coefficients, the pitch value and the remnant characterization into a packet of data and sends them to decoder


12


(FIG.


1


C), which includes a packet receiver


25


, a selector


22


, an LSP converter


24


, a history buffer


26


, a summer


28


, an LPC synthesizer


30


and a post-filter


32


.




Packet receiver


25


receives the packet and separates the packet data into the pitch value, the remnant signal and the LSP coefficients. LSP converter


24


converts the LSP coefficients to LPC coefficients.




History buffer


26


stores previous residual signals up to the present moment and selector


22


utilizes the pitch value to select a relevant window of the data from history buffer


26


. The selected window of the data is added to the remnant signal (by summer


28


) and the result is stored in the history buffer


26


, as a new signal. The new signal is also provided to LPC synthesis unit


30


which, using the LPC coefficients, produces a speech waveform. Post-filter


32


then distorts the waveform, also using the LPC coefficients, to reproduce the input speech signal in a way which is pleasing to the human ear.




In the G.723 vocoder standard of the International Telephone Union (ITU) remnants are interpolated in order to reproduce a lost packet. The remnant interpolation is performed in two different ways, depending on the state of the last good frame prior to the lost, or erased, frame. The state of the last good frame is checked with a voiced/unvoiced classifier.




The classifier is based on a cross-correlation maximization function. The last 120 samples of the last good frame (“vector”) are cross correlated with a drift of up to three samples. The index which reaches the maximum correlation value is chosen as the interpolation index candidate. Then, the prediction gain of the best vector is tested. If its gain is more than 2 dB, the frame is declared as voiced. Otherwise, the frame is declared as unvoiced.




The classifier returns 0 for the unvoiced case and the estimated pitch value for the voiced case. If the frame was declared unvoiced, an average gain is saved. If the current frame is marked as erased and the previous frame is classified as unvoiced, the remnant signal for the current frame is generated using a uniform random number generator. The random number generator output is scaled using the previously computed gain value.




In the voiced case, the current frame is regenerated with periodic excitation having a period equal to the value provided by the classifier. If the frame erasure state continues for the next two frames, the regenerated vector is attenuated by an additional 2 dB for each frame. After three interpolated frames, the output is muted completely.




SUMMARY OF THE INVENTION




There is provided, in accordance with a preferred embodiment of the present invention, a voice encoder and decoder which attempt to minimize the effects of voice data packet loss, typically over wide area networks.




Furthermore, in accordance with a preferred embodiment of the present invention, the voice encoder utilizes future data, such as the lookahead data typically available for linear predictive coding (LPC), to partially encode a future packet and to send the partial encoding as part of the current packet. The decoder utilizes the partial encoding of the previous packet to decode the current packet if the latter did not arrive properly.




There is also provided, in accordance with a preferred embodiment of the present invention, a voice data packet which includes a first portion containing information regarding the current voice frame and a second portion containing partial information regarding the future voice frame.











BRIEF DESCRIPTION OF THE DRAWINGS




The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the appended drawings in which:





FIGS. 1A

,


1


B and


1


C are of a prior art vocoder and its operation, where

FIG. 1A

is a block diagram of an encoder,

FIG. 1B

is a schematic illustration of the operation of a part of the encoder of FIG.


1


A and

FIG. 1C

is a block diagram illustration of decoder;





FIG. 2

is a schematic illustration of the data utilized for LPC encoding;





FIG. 3

is a schematic illustration of a combination packet, constructed and operative in accordance with a preferred embodiment of the present invention;





FIGS. 4A and 4B

are block diagram illustrations of a voice encoder and decoder, respectively, in accordance with a preferred embodiment of the present invention; and





FIG. 5

is a schematic illustration, similar to

FIG. 1B

, of the operation of one part of the encoder of FIG.


4


A.











DETAILED DESCRIPTION OF THE PRESENT INVENTION




Reference is now made to

FIGS. 2

,


3


,


4


A,


4


B and


5


which illustrate the vocoder of the present invention.

FIG. 2

illustrates the data which is utilized for LPC encoding,

FIG. 3

illustrates the packet which is transmitted,

FIG. 4A

illustrates the encoder,

FIG. 4B

illustrates the decoder and

FIG. 5

illustrates how the data is used for future frame encoding.




It is noted that the short term analysis, such as the LPC encoding performed by LPC processor


14


, typically utilizes lookahead and lookbehind data. This is illustrated in

FIG. 2

which shows three frames, the current frame


40


, the future frame


42


and the previous frame


44


. The data utilized for the short term analysis is indicated by arc


46


and includes all of current frame


40


, a lookbehind portion


48


of previous frame


44


and a lookahead portion


50


of future frame


42


. The sizes of portions


48


and


50


and typically 30-50% of the size of frames


40


,


42


and


44


and is set for a specific vocoder.




Applicant has realized that lookahead portion


50


can be utilized to provide at least partial information regarding future frame


42


to help the decoder reconstruct future frame


42


, if the packet containing future frame


42


is improperly received (i.e. lost or corrupted).




In accordance with a preferred embodiment of the present invention and as shown in

FIG. 3

, a voice data packet


52


comprises a current frame portion


54


having a compressed version of current frame


40


and a future frame portion


56


having some data regarding future frame


42


based on lookahead portion


50


. It is noted that future frame portion


56


is considerably smaller than current frame portion


54


; typically, future frame portion


56


is of the order of 2-4 bits. The size of future frame portion


56


can be preset or, if there is a mechanism to determine the extent of packet loss, the size can be adaptive, increasing when there is greater packet loss and decreasing when the transmission is more reliable.




In the example provided hereinbelow, the future frame portion


56


stores a change in the pitch from current frame


40


to lookahead portion


50


assuming that the LPC coefficients have decayed slightly. Thus, all that has to be transmitted is just the change in the pitch; the LPC coefficients are present from current frame


40


as is the base pitch. It will be appreciated that the present invention incorporates all types of future frame portions


56


and the vocoders which encode and decode them.





FIGS. 4A and 4B

illustrate an exemplary version of an updated encoder


10


′ and decoder


12


′, respectively, for a future frame portion


56


storing a change in pitch. Similar reference numerals refer to similar elements.




Encoder


10


′ processes current frame


40


as in prior art encoder


10


. Accordingly, encoder


10


′ includes a short term analyzer and encoder, such as LPC processor


14


and LPC converter


25


, a long term analyzer, such as pitch processor


16


, history butter


18


, remnant excitation processor


20


and packet creator


17


. Encoder


10


′ operates as described hereinabove with respect to

FIG. 1B

, determining the LPC coefficients, LPC


c


, pitch P


c


and remnants for the current frame and providing the residual signal to the history buffer


18


.




Packet creator


17


combines the LSP, pitch and remnant data and, in accordance with a preferred embodiment of the present invention, creates current frame portion


54


of the allotted size. The remaining bits of the packet will hold the future frame portion


56


.




To create future frame portion


56


for this embodiment, encoder


10


′ additionally includes an LSP converter


60


, a multiplier


62


and a pitch change processor


64


which operate to provide an indication of the change in pitch which is present in future frame


42


.




Encoder


10


′ assumes that the spectral shape of lookahead portion


50


(FIG.


2


), is almost the same as that in current frame


40


. Thus, multiplier


62


multiplies the LSP coefficients LSP


c


of current frame


40


by a constant α, where α is close to 1, thereby creating the LSP coefficients LSP


L


of lookahead portion


50


. LSP converter


61


converts the LSP


L


coefficients to LPC


L


coefficients.




Encoder


10


′ then assumes that the pitch of lookahead portion


50


is close to the pitch of current frame


40


. Thus, pitch change processor


64


extends or shrinks the pitch value P


c


of current frame


40


by a few samples in each direction where the maximal shift s depends on the number of bits N available for future frame portion


56


of packet


52


. Thus, maximal shift s is: 2


N−1


samples.




As shown in

FIG. 5

, pitch change processor


64


retrieves windows


65


starting at the sample which is P


c


+s samples from an input end (indicated by line


68


) of the history buffer


18


. It is noted that the history buffer already includes the residual signal for current frame


40


. In this embodiment, pitch change processor


64


provides each window


65


to an LPC synthesizer


69


which synthesizes the residual signal associated with the window


65


by utilizing the LPC


L


coefficients of the lookahead portion


50


. Synthesizer


69


does not include a format perceptual weighting filter.




As with pitch processor


16


, pitch change processor


64


compares the synthesized signal to the lookahead portion


50


and the selected pitch P


c


+s is the one which best matches the lookahead portion


50


. Packet creator


17


then includes the bit value of s in packet


52


as future frame portion


56


.




If lookahead portion


50


is part of an unvoiced frame, then the quality of the matches will be low. Encoder


10


′ can include a threshold level which defines the minimal match quality. If none of the matches is greater than the threshold level, then the future frame is declared an unvoiced frame. Accordingly, packet creator


17


provides a bit value for the future frame portion


56


which is out of the range of s. For example, if s has the values of −2, −1, 0, 1 or 2 and future frame portion


56


is three bits wide, then there are three bit combinations which are not used for the value of s. One or more of these combinations can be defined as an “unvoiced flag”.




When future frame


42


is an unvoiced frame, encoder


10


′ does not add anything into history buffer


18


.




In this embodiment (as shown in FIG.


4


B), decoder


12


′ has two extra elements, a summer


70


and a multiplier


72


. For decoding current frame


40


, decoder


12


′ includes packet receiver


25


, selector


22


, LSP converter


24


, history buffer


26


, summer


28


, LPC synthesizer


30


and post-filter


32


. Elements


22


,


24


,


26


,


28


,


30


and


32


operate as described hereinabove on the LPC coefficients LPC


c


, current frame pitch P


c


, and the remnant excitation signal of the current frame, thereby to create the reconstructed current frame signal. The latter operation is marked with solid lines.




Decoding future frame


42


, indicated with dashed lines, only occurs if packet receiver


25


determines that the next packet has been improperly received. If the pitch change value s is the unvoiced flag value, packet receiver


25


randomly selects a pitch value P


R


. Otherwise, summer


70


adds the pitch change value s to the current pitch value P


c


to create the pitch value P


L


of the lost frame. Selector


22


then selects the data of history buffer


26


beginning at the P


L


sample (or at the P


R


sample for an unvoiced frame) and provides the selected data both to the LPC synthesizer


30


and back into the history buffer


26


.




Multiplier


72


multiplies the LSP coefficients LSP


c


of the current frame by α (which has the same value as in encoder


10


′) and LSP converter


24


converts the resultant LSP


L


coefficients to create the LPC coefficients LPC


L


of the lookahead portion. The latter are provided to both LPC synthesizer


30


and post-filter


32


. Using the LPC coefficients LPC


L


, LPC synthesizer


30


operates on the output of history buffer


26


and post-filter


32


operates on the output of LPC synthesizer


30


. The result is an approximate reconstruction of the improperly received frame.




It will be appreciated that the present invention is not limited by what has been described hereinabove and that numerous modifications, all of which fall within the scope of the present invention, exist. For example, while the present invention has been described with respect to transmitting pitch change information, it also incorporates creating a future frame portion


56


describing other parts of the data, such as the remnant signal etc. in addition to or instead of describing the pitch change.




It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described herein above. Rather the scope of the invention is defined by the claims which follow.



Claims
  • 1. A voice decoder comprising:a packet receiver for receiving a current packet including a current frame portion including a pitch value and short term spectral parameters describing a current frame of voice data and a future frame portion including a pitch change value at least partially describing at least a section of a future frame of voice data; current decoding means for decoding said current frame of voice data from said current frame portion when said current packet is properly received; and future decoding means for decoding a future frame of voice data from at least the future frame portion of a previously properly received packet when said current packet is improperly received, said future decoding means including: means for creating a new pitch value for said improperly received packet from said pitch value and said pitch change value of said properly received packet; an extrapolator for extrapolating new short term spectral parameters for said improperly received packet from said short term spectral parameters of said properly received packet; and means for decoding said improperly received packet using said new pitch value and said new short term spectral parameters.
  • 2. A method for decoding a packet of voice data, the method comprising:receiving a current packet including a current frame portion including a pitch value and short term spectral parameters describing a current frame of voice data and a future frame portion including a pitch change value at least partially describing at least a section of a future frame of voice data; decoding said current frame of voice data from said current frame portion when said current packet is properly received; and decoding a future frame of voice data from at least the future frame portion of a previously properly received packet when said current packet is improperly received, including: creating a new pitch value for said improperly received packet from said pitch value and said pitch change value of said properly received packet; extrapolating new short term spectral parameters for said improperly received packet from said short term spectral parameters of said properly received packet; and decoding said improperly received packet using said new pitch value and said new short term spectral parameters.
Priority Claims (1)
Number Date Country Kind
120788 May 1997 IL
US Referenced Citations (12)
Number Name Date Kind
4969192 Chen et al. Nov 1990 A
5293449 Tzeng Mar 1994 A
5307441 Tzeng Apr 1994 A
5384891 Asakawa et al. Jan 1995 A
5457783 Chhatwal Oct 1995 A
5544278 Bialik et al. Aug 1996 A
5699485 Shoham Dec 1997 A
5732389 Kroon et al. Mar 1998 A
5774846 Morii Jun 1998 A
5778335 Ubale et al. Jul 1998 A
5890108 Yeldener Mar 1999 A
6018706 Huang et al. Jan 2000 A
Non-Patent Literature Citations (2)
Entry
Peter Kroon et al., “A Class Analysis by Synthesis Predictive Coders for High Quality Speech Coding at Rates Between 4.8 and 16 kbits/s” IEEE Journal on Selected Areas in Communications, vol. 5, No. 2, Feb. 1988, pp. 353-363.
Furui, “Digital Speech Processing, Synthesis and Recognition”, Marcel Dekker Inc., New York, 1989.