The invention relates to electronic devices, and more particularly to speech coding, transmission, storage, and decoding/synthesis methods and circuitry.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized-over-network (e.g., Voice over IP or Voice over Packet) transmissions benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding compression method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients ai, i=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by setting
r(n)=s(n)+ΣM≧i≧1ais(−i) (1)
and minimizing the energy Σr(n )2 of the residual r(n) in the frame. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network sampling for digital transmission); and the number of samples {s(n)} in a frame is typically 80 or 160 (10 or 20 ms frames). A frame of samples may be generated by various windowing operations applied to the input speech samples. The name “linear prediction” arises from the interpretation of r(n)=s(n)+ΣM≧i≧1 ai s(n−i) as the error in predicting s(n) by the linear combination of preceding speech samples −ΣM≧i≧1 ai s(n−i). Thus minimizing Σr(n)2 yields the {ai} which furnish the best linear prediction for the frame. The coefficients {ai} may be converted to line spectral frequencies (LSFs) for quantization and transmission or storage and converted to line spectral pairs (LSPs) for interpolation between subframes.
The {r(n)} is the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an excitation which emulates the LP residual from the encoded parameters. Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
The LP compression approach basically only transmits/stores updates for the (quantized) filter coefficients, the (quantized) residual (waveform or parameters such as pitch), and (quantized) gain(s). A receiver decodes the transmitted/stored items and regenerates the input speech with the same perceptual characteristics. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second). In more detail, the ITU standard G.729 uses frames of 10 ms length (80 samples) divided into two 5-ms 40-sample subframes for better tracking of pitch and gain parameters plus reduced codebook search complexity. Each subframe has an excitation represented by an adaptive-codebook contribution plus a fixed (algebraic) codebook contribution, and thus the name CELP for code-excited linear prediction. The adaptive-codebook contribution provides periodicity in the excitation and is the product of v(n), the prior frame's excitation translated by the current frame's pitch lag in time and interpolated, multiplied by a gain, gP. The algebraic codebook contribution approximates the difference between the actual residual and the adaptive codebook contribution with a four-pulse vector, c(n), multiplied by a gain, gC. Thus the excitation is u(n)=gP v(n)+gC c(n) where v(n) comes from the prior (decoded) frame and gP, gC, and c(n) come from the transmitted parameters for the current frame. The speech synthesized from the excitation is then postfiltered. to mask noise. Postfiltering essentially comprises three successive filters: a short-term filter, a long-term filter, and a tilt compensation filter. The short-term filter emphasizes the formants; the long-term filter emphasizes periodicity, and the tilt compensation filter compensates for the spectral tilt typical of the short-term filter.
Further, as illustrated in
CELP coders apparently perform well in the 6-16 kb/s bit rates often found with VoIP transmissions. However, known CELP coders perform less well at higher bit rates in a layered coding design, probably because the transmitter does not know how many layers will be decoded at the receiver.
The present invention provides a layered CELP coding with one or more filterings: progressively weaker perceptual filtering in the encoder, progressively weaker short-term postfiltering in the decoder, and pitch postfiltering for all layers in the decoder.
This has advantages including achieving non-layered quality with a layered CELP coding system.
a-2b illustrate a layered CELP encoder and decoder.
a-3c show filter spectra.
1. Overview
The preferred embodiment systems include preferred embodiment encoders and decoders which use layered CELP coding with one or more of three filterings: progressively weaker perceptual filtering in the encoder for enhancement layer codebook searches, progressively weaker short-term postfiltering in the decoder for successively higher bit rates, and decoder long-term postfiltering for all layers.
2. Encoder Details
First consider a layered CELP encoder with more detail in order to explain the preferred embodiment filters.
In more detail, a preferred embodiment includes the following steps.
(1) Sample an input speech signal (which may be preprocessed to filter out dc and low frequencies, etc.) at 8 kHz or 16 kHz to obtain a sequence of digital samples, s(n). Partition the sample stream into 80-sample or 160-sample frames (e.g., 10 ms frames) or other convenient frame size. The analysis and coding may use various size subframes of the frames.
(2) For each frame (or subframes) apply linear prediction (LP) analysis to find LP (and thus LSF/LSP) coefficients and thereby also define the LPC synthesis filter 1/A(z). Quantize the LSP coefficients for transmission; this also defines the quantized LPC synthesis filter 1/Â(z). The same synthesis filter will be used for all enhancement layers in addition to the base layer. Note that the roots of A(z)=0 are within the complex unit circle and correspond to formants (peaks) in the spectrum of the synthesis filter. LP analysis typically uses a windowed version of s(n).
(3) Perceptually filter the speech s(n) with the perceptual weighting filter (PWF) defined by W(z)=A(z/γ1)/A(z/γ2) to yield s′(n). This filtering masks quantization noise by shaping the noise to appear near formants where the speech signal is stronger and thereby give better results in the error minimization which defines the estimation. The parameters γ1 and γ2 determine the level of noise masking (1≧γ1≧γ2>0). In general, a low bit rate CELP encoder uses the PWF with stronger noise masking (e.g., γ1=0.9 and γ2=0.5) while a high bit rate CELP encoder uses a PWF with weaker noise masking (e.g., γ1=0.9 and γ2=0.65). As
In contrast, the first preferred embodiments progressively weaken the PWF from layer to layer as illustrated in
(4) Find a pitch delay (for the base layer) by searching correlations of s′(n) with s′(n+k) in a windowed range. The search may be in two stages: first perform an open loop search using correlations of s′(n) to find a pitch delay. Then perform a closed loop search to refine the pitch delay by interpolation from maximizations of the normalized inner product <x|yk> of the target speech x(n) in the (sub)frame with the speech yk(n) generated by applying the (sub)frame's quantized LP synthesis filter and PWF to the prior (sub)frame's base layer excitation delayed by k. The target x(n) is s′(n) minus the 0 response of the quantized LP synthesis filter plus PWF. The adaptive codebook vector v(n) is then the prior (sub)frame's base layer excitation (uprior(n)) translated by the refined pitch delay and interpolated. The same adaptive codebook vector applies to all enhancement layers in the sense that the enhancement layers only add to the fixed codebook contribution to the excitation. Thus the decoder will generate an excitation u(n) as gP v(n)+gC0 c0(n)+gC1 c1(n)+ . . . where gP is the adaptive codebook gain, gCj is the j layer fixed codebook gain, and cj(n) is the j layer fixed codebook vector.
(5) Determine the adaptive codebook gain, gP, as the ratio of the inner product <x|y> divided by <y|y> where x(n) is the target in the (sub)frame and y(n) is the (sub)frame signal generated by applying the quantized LP synthesis filter and then PWF to the adaptive codebook vector v(n) from step (4). Thus gPv(n) is the adaptive codebook contribution to the excitation and gPy(n) is the adaptive codebook contribution to the speech in the (sub)frame.
(6) Find the base layer (layer 0) fixed (algebraic) codebook vector c0(n) by essentially maximizing the correlation of c0(n) filtered by the quantized LP synthesis filter and then PWF with x(n)—gPy(n) as the target in the (sub)frame. That is, remove the adaptive codebook contribution to have a new target. In particular, search over possible algebraic codebook vectors c0(n) to maximize the ratio of the square of the correlation <x−gpy|H|c> divided by the energy <c|HTH|c> where h(n) is the impulse response of the quantized LP synthesis filter (with perceptual filtering) and H is the lower triangular Toeplitz convolution matrix with diagonals h(0), h(1), . . . .
The preferred embodiments use fixed codebook vectors c(n) with 40 positions in the case of 40-sample (5 ms for 8 kHz sampling rate) (sub)frames as the encoding granularity. The 40 samples are partitioned into two interleaved tracks with 1 pulse (which is ±1) positioned within each track. For the base layer each track has 20 samples; whereas for the enhancement layers each track has 8 samples and the tracks are offset. That is, with the 40 positions labeled 0,1,2, . . . ,39, layer 1 has tracks {0,5,10, . . . 35} and {1,6,11 , . . . 36}; layer 2 has tracks {2,7,12, . . . 37} and {3,8,13, . . . 38}, and so forth with rollover.
(6) Determine the base layer fixed codebook gain, gC0 by minimizing |x−gPy−gC0z0| where, as in the foregoing description, x(n) is the target in the (sub)frame, gP is the adaptive codebook gain, y(n) is the quantized LP synthesis filter plus PWF applied to v(n), and z0(n) is the signal in the frame generated by applying the quantized LP synthesis filter plus PWF to the algebraic codebook vector c0(n).
As
(7) Sequentially, determine enhancement layer fixed codebook vectors and gains as illustrated in
a-3b illustrate the filtering. In particular,
In more detail, denote by ŝ(0)(n) the output of the LP synthesis filter applied to the layer 0 excitation, gP v(n)+gC0 c0(n). Thus ŝ(0)(n) estimates the original signal s(n) but was derived from minimizing the error e0′=PWF0[s(n)−ŝ(0)(n)]; that is, minimizing the difference of perceptually weighted versions of the original signal and the LP synthesis filter output. And the strength of PWF0 depends upon the bit rate of the base layer.
For the first enhancement layer the total bit rate is greater than that of the base layer alone, so apply less perceptual weighting to difference being minimized during the fixed codebook 1 search. In particular, the total excitation for layers 0 plus 1 is gP v(n)+gC0 c0(n)+gC1 c1(n) and thus the total estimate for s(n) output by the LP synthesis filter is ŝ(0)(n)+ŝ(1)(n) where ŝ(1)(n) is the output of the LP synthesis filter applied to the layer 1 fixed codebook 1 excitation contribution gC1 c1(n). Thus minimize the error e1′=PWF1[s(n)−ŝ(0)(n)−ŝ(1)(n)] where PWF1 is perceptual weighting filter for layer 1. Now as
Analogous to the foregoing description of the first enhancement layer, for the second enhancement layer the total bit rate is greater than that of the first plus base layers, so apply even less perceptual weighting to the difference being minimized during the fixed codebook 2 search. In particular, the total excitation for layers 0 plus 1 plus 2 is gP v(n)+gC0 c0(n)+gC1 c1(n)+gC2 c2(n) and thus the total estimate for s(n) output by the LP synthesis filter is ŝ(0)(n)+ŝ(1)(n)+ŝ(2)(n) where ŝ(2)(n) is the output of the LP synthesis filter applied to the layer 2 fixed codebook 2 excitation contribution gC2 c2(n). Thus minimize the error e2′=PWF2[s(n)−ŝ(1)(n)−ŝ(1)(n)−ŝ(2)(n)] where PWF2 is the perceptual weighting filter for layer 2. Similarly for higher enhancement layers and perceptual filters.
The LP synthesis filter is the same for all enhancement layers.
(8) Quantize the adaptive codebook pitch delay and gain gP and the fixed (algebraic) codebook vectors c0(n), c1(n), c2(n), . . . and gains gc0, gc1, gc2, gc3, . . . to be parts of the layered transmitted codeword. The algebraic codebook gains may factored and predicted, and the two layer 0 gains may be jointly quantized with a vector quantization codebook. The layer 0 excitation for the (sub)frame is u(n)=gpv(n)+gc0c0(n), and the excitation memory is updated for use with the next (sub)frame.
Note that all of the items quantized typically would be differential values with the preceding frame's values used as predictors. That is, only the differences between the actual and the predicted values would be encoded.
The final codeword encoding the (sub)frame would include bits for the quantized LSF/LSP coefficients, quantized adaptive codebook pitch delay, algebraic codebook vectors, and the quantized adaptive codebook and algebraic codebook gains.
3. Decoder Details
A first preferred embodiment decoder and decoding method essentially reverses the encoding steps for a bitstream encoded by the preferred embodiment layered encoding method and also applies preferred embodiment short-term postfiltering and preferred embodiment long-term postfiltering. In particular, for a coded (sub)frame in the bitstream presume layers 0 through N are being used for the (sub)frame:
(1) Decode the quantized LP coefficients; these are in layer 0 and always present unless the frame has been erased. The coefficients may be in differential LSP form, so a moving average of prior frames' decoded coefficients may be used. The LP coefficients may be interpolated every 40 samples in the LSP domain to reduce switching artifacts.
(2) Decode the adaptive codebook quantized pitch delay, and apply this pitch delay to the prior decoded (sub)frame's excitation to form the decoded adaptive codebook vector v(n). Again, the pitch delay is in layer 0.
(3) Decode the algebraic codebook vectors c0(n), c1(n), c2(n), . . . cN(n).
(4) Decode the quantized adaptive codebook gain, gp, and the algebraic codebook gains gc0, gc1, gc2, gc3, . . . gCN.
(5) Form the excitation for the (sub)frame as u(n)=gP v(n)+gC0 c0(n)+gC1 c1(n)+gC2 c2(n)+ . . . +gCN cN(n) using the decodings from steps (2)-(4).
(6) Synthesize speech by applying the LP synthesis filter from step (1) to the excitation from step (5) to yield ŝ(n).
(7) Apply preferred embodiment short-term postfiltering to the synthesized speech with filter PS(z)=Â(z/α1)/Â(z/α2) to sharpen the formant peaks. The factors α1 and α2 depend upon the number of enhancement layers used, and as the number of enhancement layers increases the sharpening decreases. Of course, the short-term postfilter PS(z) has the same form as the perceptual weighting filter but does the opposite: it sharpens formant peaks because α1<α2 rather γ1>γ2 as in the PWF. Sharpened peaks tends to mask quantization noise.
The following table shows preferred embodiment α1 and α2 dependence on bit rates where layer 0 requires 6.25 kbps and each enhancement layer above layer 0 requires another 2.2 kbps.
c illustrates these filters with the example of
(8) Apply preferred embodiment long-term postfiltering to the short-term postfiltered synthesized speech with filter PL(z)=(1+gγz−T)/(1+gγ) where T is the pitch delay, g is the gain, and γ is a factor controlling the degree of filtering and typically would equal 0.5. Filtering with PL(z) emphasizes periodicity and suppresses noise between pitch harmonic peaks. In more detail, the pitch delay T can be the decoded pitch delay from step (2) or a further refinement of the decoded pitch delay, and the gain can be derived from the refinement computations. Indeed, take the residual {hacek over (r)}(n) to be the decoded estimate ŝ(n) from step (6) filtered through Â(z/α1), the analysis part of the short-term postfilter. Then search over fractional k about the integer part of the decoded pitch delay to maximize the correlation:
[Σn{hacek over (r)}(n){hacek over (r)}k(n)]2/[Σn{hacek over (r)}k(n){hacek over (r)}k(n)][Σn{hacek over (r)}(n){hacek over (r)}(n)]
where {hacek over (r)}k(n) is {hacek over (r)}(n) delayed by k and found by interpolation for non-integral k. If the correlation is less than 0.5, then take the gain g=0 so there is no long-term postfiltering because the periodicity is small. Otherwise, take
g=Σn{hacek over (r)}(n){hacek over (r)}k(n)/Σb{hacek over (r)}k(n){hacek over (r)}k(n)
This long-term postfilter applies to all bit rates (all numbers of enhancement layers) and compensates for the use of a single pitch determination in the base layer rather than in each enhancement layer.
4. System Preferred Embodiments
5. Modifications
The preferred embodiments may be modified in various ways while retaining the features of layered coding with encoders having a weaker perceptual filter for at least one of the enhancement layers than for the base layer, decoders having weaker short-term postfiltering for at least one enhancement layer than for the base layer, or decoders having long-term postfiltering for all layers.
For example, the overall sampling rate, frame size, LP order, codebook bit allocations, prediction methods, and so forth could be varied while retaining a layered coding. Further, the filter parameters γ and α could be varied while enhancement layers are included provided filters maintain strength or weaken for each layer for the layered encoding and/or the short-term postfiltering. The long-term postfiltering could have the correlation at which the gain is taken as zero varied and its synthesis filter factor γ1 could be separately varied.
This application claims priority from provisional application: Ser. No. 60/248,988, filed Nov. 15, 2000, which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
4969192 | Chen et al. | Nov 1990 | A |
5495555 | Swaminathan | Feb 1996 | A |
5657420 | Jacobs et al. | Aug 1997 | A |
5751901 | DeJaco et al. | May 1998 | A |
5845244 | Proust | Dec 1998 | A |
5913187 | Mermelstein | Jun 1999 | A |
6052659 | Mermelstein | Apr 2000 | A |
6182030 | Hagen et al. | Jan 2001 | B1 |
6260017 | Das et al. | Jul 2001 | B1 |
6324505 | Choy et al. | Nov 2001 | B1 |
6397178 | Benyassine | May 2002 | B1 |
6449592 | Das | Sep 2002 | B1 |
6470317 | Ladd et al. | Oct 2002 | B1 |
6928406 | Ehara et al. | Aug 2005 | B1 |
6961698 | Gao et al. | Nov 2005 | B1 |
Number | Date | Country | |
---|---|---|---|
20020107686 A1 | Aug 2002 | US |
Number | Date | Country | |
---|---|---|---|
60248988 | Nov 2000 | US |