Recently, there has been growing interest in developing toll-quality speech coders at rates of 4 kbps and below. The speech quality produced by waveform coders such as code-excited linear prediction (CELP) coders degrades rapidly at rates below 5 kbps [B. S. Atal, and M. R. Schroder, “Stochastic Coding of Speech at Very Low Bit Rate”, Proc. Int. Conf. Comm, Amsterdam, pp. 1610-1613, 1984]. On the other hand, parametric coders such as the waveform-interpolative (WI) coder, the sinusoidal-transform coder (STC), and the multiband-excitation (MBE) coder produce good quality at low rates, but they do not achieve toll quality [Y. Shoham, “High Quality Speech Coding at 2.4 and 4.0 kbps Based on Time Frequency-Interpolation”, IEEE ICASSP'93, Vol. II, pp. 167-170, 1993; W. B. Kleijn, and J. Haagen, “Waveform Interpolation for Coding and Synthesis”, in Speech Coding Synthesis by W. B. Kleijn and K. K. Paliwal, Elsevier Science B. V., Chapter 5, pp. 175-207, 1995; I. S. Burnett, and D. H. Pham, “Multi-Prototye Waveform Coding using Frame-by-Frame Analysis-by-Synthesis”, IEEE ICASSP'97, pp. 1567-1570, 1997; R. J. McAulay, and T. F. Quatieri, “Sinusoidal Coding”, in Speech Coding Synthesis by W. B. Kleijn and K. K. Paliwal, Elsevier Science B. V., Chapter 4, pp. 121-173, 1995; and D. Griffin, and J. S. Lim, “Multiband Excitation Vocoder”, IEEE Trans. ASSP, Vol. 36, No. 8, pp. 1223-1235, August 1988]. This is mainly due to lack of robustness to parameter estimation, which is commonly done in open loop, and to inadequate modeling of non-stationary speech segments. Also, in parametric coders the phase information is commonly not transmitted, and this is for two reasons: first, the phase is of secondary perceptual significance; and second, no efficient phase quantization scheme is known. WI coders typically use a fixed phase vector for the slowly evolving waveform [Shoham, supra; Kleijn et al, supra; and Burnett et al, supra]. For example, in Kleijn et al, a fixed male speaker extracted phase was used. On the other hand, waveform coders such as CELP, by directly quantizing the waveform, implicitly allocate an excessive number of bits to the phase information—more than is perceptually required.
The present invention overcomes the foregoing drawbacks by implementing a paradigm that incorporates analysis-by-synthesis (AbS) for parameter estimation, and a novel pitch search technique that is well suited for the non-stationary segments. In one embodiment, the invention provides a novel, efficient AbS vector quantization (VQ) encoding of the dispersion phase of the excitation signal to enhance the performance of the waveform interpolative (WI) coder at a very low bit-rate, which can be used for parametric coders as well as for waveform coders. The enhanced analysis-by-synthesis waveform interpolative (EWI) coder of this invention employs this scheme, which incorporates perceptual weighting and does not require any phase unwrapping.
The WI coders use non-ideal low-pass filters for downsampling and unsampling of the slowly evolving waveform (SEW). In another embodiment of the invention, A novel AbS SEW quantization scheme is provided, which takes the non-ideal filters into consideration. An improved match between reconstructed and original SEW is obtained, most notably in the transitions.
Pitch accuracy is crucial for high quality reproduced speech in WI coders. Still another embodiment of the invention provides a novel pitch search technique based on varying segment boundaries; it allows for locking onto the most probable pitch period during transitions or other segments with rapidly varying pitch.
Commonly in speech coding, the gain sequence is downsampled and interpolated. As a result it is often smeared during plosives and onsets. To alleviate this problem, a further embodiment of the invention provides a novel switched-predictive AbS gain VQ scheme based on temporal weighting.
More particularly, the invention provides a method for interpolative coding of input signals at low data rates in which there may be significant pitch transitivity, the signals having an evolving waveform, the method incorporating at least one, and preferably all, of the following steps:
(a) AbS VQ of the SEQ whereby to reduce distortion in the signal by obtaining the accumulated weighted distortion between an original sequence of waveforms and a sequence of quantized and interpolated waveforms;
(b) AbS quantization of the dispersion phase;
(c) locking onto the most probable pitch period of the signal using both a spectral domain pitch search and a temporal domain pitch search;
(d) incorporating temporal weighting in the AbS VQ of the signal gain, whereby to emphasize local high energy events in the input signal;
(e) applying both high correlation and low correlation synthesis filters to a vector quantizer codebook in the AbS VQ of the signal gain whereby to add self correlation to the codebook vectors and maximize similarity between the signal waveform and a codebook waveform;
(f) using each value of gain in the AbS VQ of the signal gain to obtain a plurality of shapes, each composed of a predetermined number of values, and comparing said shapes to a vector quantized codebook of shapes, each having said predetermined number of values, e.g., in the range of 2-50, preferably 5-20; and
(g) using a coder in which a plurality of bits, e.g. 4 bits, are allocated to the SEW dispersion phase.
The method of the invention can be used in general with any waveform signal, and is particularly useful with speech signals. In the step of AbS VQ of the SEW, distortion is reduced in the signal by obtaining the accumulated weighted distortion between an original sequence of waveforms and a sequence of quantized and interpolated waveforms. In the step of AbS quantization of the dispersion phase, at least one codebook is provided that contains magnitude and phase information for predetermined waveforms. The linear phase of the input is crudely aligned, then iteratively shifted and compared to a plurality of waveforms reconstructed from the magnitude and phase information contained in one or more codebooks. The reconstructed waveform that best matches one of the iteratively shifted inputs is selected.
In the step of locking onto the most probable pitch period of the signal, the invention includes searching the temporal domain pitch, defining a boundary for a segment of said temporal domain pitch, maximizing the length of the boundary by iteratively shrinking and expanding the segment, and maximizing the similarity by shifting the segment. The searches are preferably conducted respectively at 100 Hz and 500 Hz.
The invention has a number of embodiments, some of which can be used independently of the others to enhance speech and other signal coding systems. The embodiments cooperate to produce a superior coding system, involving AbS SEW optimization, and novel dispersion phase quantizer, pitch search scheme, switched-predictive AbS gain VQ, and bit allocation.
AbS SEW Quantization
Commonly in WI coders the SEW is distorted by downsampling and upsampling with non-ideal low-pass filters. In order to reduce such distortion, an AbS SEW quantization scheme, illustrated in
where the first sum is that of many current distortions and the second sum is that of lookahead distortions. H denotes Hermitian (transposed+complex conjugate), M is the number of waveforms per frame, L is the lookahead number of waveforms, α(t) is some increasing interpolation function in the range 0≦α(t)≦1, and Wm is diagonal matrix whose elements, wkk, and the combined spectral-weighting and synthesis of the k-th harmonic given by:
where P is the pitch period, K is the number of harmonics, g is the gain , A(z) and Â(z) are the input and the quantized LPC polynomials respectively, and the spectral weighting parameters satisfy 0≦γ2<γ2≦1. It is also possible to leave out the inverse of the number of harmonics, i.e., the 1/K parameter, the gain, i.e. the g parameter, or another combination of input and quantized LPC polynomials, i.e. the A(Z) and Â(Z) parameters.
The interpolated SEW vectors are given by:
{circumflex over (r)}m=[1−α(tm)]{circumflex over (r)}0+α(tm){circumflex over (r)}M; m=1, . . . M (3)
where t is time, m is the number of waveforms in a frame, and {circumflex over (r)}0 and {circumflex over (r)}M are the quantized SEW at the previous and at the current frame respectively. The parameter α is an increasing linear function from 0 to 1. It can be shown that the accumulated distortion in equation (1) is equal to the sum of modeling distortion and quantization distortion:
where the quantization distortion is given by:
Dw({circumflex over (r)}M,rM,opt)=({circumflex over (r)}M−rM,opt)HWM,opt({circumflex over (r)}M−rM,opt) (5)
The optimal vector, rM,opt, which minimizes the modeling distortion, is given by:
Therefore, VQ with the accumulated distortion of equation (1) can be simplified by using the distortion of equation (5), and:
An improved match between reconstructed and original SEW is obtained, most notably in the translations.
AbS Phase Quantization
The dispersion-phase vector quantization scheme is illustrated in
Dw(r,{circumflex over (r)})=(r−{circumflex over (r)})HW(r−{circumflex over (r)}) (7)
The magnitude is perceptually more significant than the phase; and should therefore be quantized first. Furthermore, if the phase were quantized first, the very limited bit allocation available for the phase would lead to an excessively degraded spectral matching of the magnitude in favor of a somewhat improved, but less important, matching of the waveform. For the above distortion, the quantized phase vector is given by:
where i is the running phase codebook index, and ej{circumflex over (φ)}
The AbS search for phase quantization is based on evaluating (8) for each candidate phase codevector. Since only trigonometric functions of the phase candidates are used, phase unwrapping is avoided. The EWI coder uses the optimized SEW, rM,opt, and the optimized weighting, wM,opt, for the AbS phase quantization.
Equivalently, the quantized phase vector can be simplified to:
where {circumflex over (φ)}(k) is the phase of, r(k), the k-th input DFT coefficient. The average global distortion measure for M vector set is:
The centroid equation [A. Gersho et al, “Vector Quantization and Signal Compression”, Kluwer Academic Publishers, 1992] of the k-th harmonic's phase for the j-th cluster, which minimizes the global distortion in equation (11), is given by:
These centroid equations use trigonometric functions of the phase, and therefore do not require any phase unwrapping. It is possible to use |r(k)m|2 instead of |{tilde over (r)}(k)m∥r(k)m|.
The phase vector's dimension depends on the pitch period and, therefore, a variable dimension Q has been implemented. In the WI system the possible pitch period value was divided into eight ranges, and for each range of pitch period an optimal codebook was designed such that vectors of dimension smaller than the largest pitch period in each range are zero padded.
Pitch changes over time cause the quantizer to switch among the pitch-range codebooks. In order to achieve smooth phase variations whenever such switch occurs, overlapped training clusters were used.
The phase-quantization scheme has bene implemented as a part of WI coder, and used to quantize the SEW phase. The objective performance of the suggested phase VQ has been tested under the following conditions:
Recent WI coders have used a male speaker extracted dispersion phase [Kleijn et al, supra: Y. Shoham, “Very Low Complexity Interpolative Speech Coding at 1.2 to 2.4 KBPS”, IEEE ICASSP '97, pp. 1599-1602, 1997]. A subjective A/B testw as conducted to compare the dispersion phase of this invention, using only 4 bits, to a male extracted dispersion phase. The test data included 16 MIRS speech sentences, 8 of which are of female speakers, and 8 of male speakers. During the test, all pairs of file were played twice in alternating order, and the listeners could vote for either of the systems, or for no preference. The speech material was synthesized using WI system in which only the dispersion phase was quantized every 20 ms. Twenty one listeners participated in the test. The test results, illustrated in
Pitch Search
The pitch search of the EWI coder consists of a spectral domain search employed at 100 Hz and a temporal domain search employed at 500 Hz, as illustrated in
where τ is the shift in the segment, Δ is some incremental segment used in the summations for computational simplicity, and 0≦Nj≦└160/Δ┘. Then, every 10 ms a weighted-mean pitch value is calculated by:
where p(ni) is the normalized correlation for P(ni). The above values (160, 10, 5) are for the particular coder and is used for illustration. Equation (12) describes the temporal domain pitch search and the temporal domain pitch refinement blocks of
Gain Quantization
The gain trajectory is commonly smeared during plosives and onsets by downsampling and interpolation. This problem is addressed and speech crispness is improved in accordance with an embodiment of the invention that provides a novel switched-predictive AbS gain VQ technique, illustrated in
Bit Allocation
The bit allocation of the coder is given in Table 1. The frame length is 20 ms, and ten waveforms are extracted per frame. The pitch and the gain are coded twice per frame.
Subjective Results
A subjective A/B test was conducted to compare the 4 kbps EWI coder of this invention to MPEG-4 at 4 kbps, and to G.723.1. The test data included 24 MIRS speech sentences, 12 of which are of female speakers, and 12 of male speakers. Fourteen listeners participated in the test. The test results, listed in Tables 2 to 4, indicate that the subjective quality of EWI exceeds that of MPEG-4 at 4 kbps an of G.723.1 at 5.3 kbps, and it is slightly better than that of G.723.1 at 6.3 kbps.
Table 2 shows the results of subjective A/B tests for comparison between the 4 kbps WI coder and th 4 kbps MPEG-4. Within 95% certainty the WI preference lies in [58.63%, 68.75%].
Table 3 shows the results of subjective A/B tests for comparison between the 4 kbps WI coder to 5.3 kbps G.723.1. With 95% certainty the WI preference lies in [54.17%, 64.88%].
Table 4. Results of subjective A/B test for comparison between the 4 kbps WI coder to 6.3 kbps G.723.1. With 95% certainty the WI preference lies in [48.51%, 59.23%].
The present invention incorporates several new techniques that enhance the performance of the WI coder, analysis-by-synthesis vector-quantization of the dispersion-phase, AbS optimization of the SEW, a special pitch search for transitions, and switched-predictive analysis-by-synthesis gain VQ. These features improve the algorithm and its robustness. The test results indicate that the performance of the EWI coder slightly exceeds that of G.723.1 at 6.3 kbps and therefore EWI achieve very close to toll quality, at least under clean speech conditions.
This application claims the benefit of Provisional Patent Application Nos. 60/110,522, filed Dec. 1, 1998 and 60/110,641 filed Dec. 1, 1998.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US99/28449 | 12/1/1999 | WO | 00 | 8/13/2001 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO00/33297 | 6/8/2000 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
4653098 | Nakata et al. | Mar 1987 | A |
5086471 | Tanaka et al. | Feb 1992 | A |
5517595 | Kleijn | May 1996 | A |
6418408 | Udaya Bhaskar et al. | Jul 2002 | B1 |
6493664 | Udaya Bhaskar et al. | Dec 2002 | B1 |
Number | Date | Country | |
---|---|---|---|
60110522 | Dec 1998 | US | |
60110641 | Dec 1998 | US |