1. Field of the Invention
The present invention relates to a method and system for coding low bit rate speech for communication systems. More particularly, the present invention relates to a method and apparatus for performing prototype waveform magnitude quantization using vector quantization.
2. Background of the Invention
Currently, various speech encoding techniques are used to process speech. These techniques do not adequately address the need for a speech encoding technique that improves the modeling and quantization of a speech signal, specifically, the evolving spectral characteristics of a speech prediction residual signal which includes a prototype waveform (PW) gain vector, a PW magnitude vector, and a PW phase information.
In particular, prior art techniques are representative but not limited to the following see, e.g., L. R. Rabiner and R. W. Schafer, “Digital Processing of Speech Signals” Prentice-Hall 1978 (hereinafter known as reference 1), W. B. Klejin and J. Haagen, “Waveform Interpolation for Coding and Synthesis”, in Speech Coding and Synthesis, Edited by W. B. Klejin, K. K. Paliwal, Elsevier, 1995 (hereinafter known as reference 2); F. Iatakura, “Line Spectral Representation of Linear Predictive Coefficients of Speech Signals”, Journal of Acoustical Society of America, vol 4. 57, no. 1, 1975 (hereinafter known as reference 3); P. Kabal and R. P. Ramachandran, “The Computation of Line Spectral Frequencies Using Chebyshev Polybimials”, IEEE Trans. On ASSP, vol. 34, no. 6, pp. 1419–1426, December 1986 (hereinafter known as reference 4); W. B. Klejin, “Encoding Speech Using Prototype Waveforms” IEEE Transactions on Speech and Audio Processing, Vol. 1, No. 4, 386–399, 1993 (hereinafter known as reference 5); and W. B. Kleijn, Y. Shoman, D. Sen and R. Hagen, “A Low Complexity Waveform Interpolation Coder”, IEEE International Conference on Acoustics, Speech and Signal Processing, 1996 (hereinafter known as reference 6). All of the references 1 through 6 are herein incorporated in their entirety by reference.
The prototype waveforms are a sequence of complex Fourier transforms evaluated at pitch harmonic frequencies, for pitch period wide segments of the residual, at a series of points along the time axis. Thus, the PW sequence contains information about the spectral characteristics of the residual signal as well as the temporal evolution of these characteristics. A high quality of speech can be achieved at low coding rates by efficiently quantizing the important aspects of the PW sequence.
In PW based coders, the PW is separated into a shape component and a level component by computing the RMS (or gain) value of the PW and normalizing the PW to a unity RMS value. As the pitch frequency varies, the dimensions of the PW vectors also vary, typically in the range of 11–61. Existing VQ techniques, such as direct VQ, split VQ and multi-stage VQ are not well suited for variable dimension vectors. Adaptation of these techniques for variable dimension is not neither practical from an implementation viewpoint nor satisfactory from a performance viewpoint. It's not practical since the worst case high dimensionality results in a high computational cost and a high storage cost.
To address the variable dimensionality problem, prior art in reference 4 uses analytical functions of a fixed order to approximate the variable dimension vectors. The coefficients of the analytical function that provide the best fit to the vectors are used to represent the vectors for quantization. This approach suffers from three disadvantages. First, a modeling error is added to the quantization error, leading to a loss in performance. Second, analytical function approximation for reasonable orders in the magnitude of 5–10 deteriorate with increasing frequency. Third, if spectrally weighted distortion metrics are used during VQ, the complexity of these methods become formidable.
A PW magnitude vector sequence determines the evolving spectral characteristics of a linear predictive (LP) excitation signal and therefore is important in signal characterization. Prior art techniques separate the PW sequence into slowly evolving (SEW) and rapidly evolving (REW) components. This results in two disadvantages.
First the algorithmic delay of the coding scheme in prior art is significantly increased as it requires linear low pass and high pass filtering to separate the SEW and REW components. This delay can be noticeable in telephone conversations.
Second, the signal processing in prior art needed for this purpose is complicated due to the filtering that is necessary. This increases the computational complexity of processing the signal resulting higher cost.
Additionally, prior art techniques use a non-hierachical approach in quantizing the PW vectors (see references 2–6). This results in lower CODEC performance and less robustness to channel errors.
Thus, a need exists for a system and method that can accurately recreate perceptually important spectral features of the PW magnitude while maintaining computational and storage efficiency. Specifically, this permits the evolving spectral features of the LP residual signal to be reproduced accurately at the decoder.
An object of the present invention is to provide a system and method for accurately representing the spectral features of the LP residual signal and for reproducing the spectral features accurately at the decoder.
These and other objects are substantially achieved by a system and method employing a frequency domain interpolative CODEC system for low bit rate coding of speech. The CODEC comprises a linear prediction (LP) front end adapted to process an input signal that provides LP parameters which are quantized and encoded over predetermined intervals and used to compute a LP residual signal. An open loop pitch estimator adapted to process the LP residual signal, a pitch quantizer, and a pitch interpolator and provide a pitch contour within the predetermined intervals is also provided. Also provided is a signal processor responsive to the LP residual signal and the pitch contour and adapted to perform the following: provide a voicing measure, where the voicing measure characterizes a degree of voicing of the input speech signal and is derived from several input parameters that are correlated to degrees of periodicity of the signal over the predetermined intervals; extract a prototype waveform (PW) from the LP residual and the open loop pitch contour for a number of equal sub-intervals within the predetermined intervals; normalize the PW by a gain value of the PW; encode a magnitude of the PW; and directly quantize the PW in a magnitude domain without further decomposition of the PW into complex components, where the direct quantization is performed by a hierarchical quantization method based on a voicing classification using fixed dimension vector quantizers (VQ's).
The various objects, advantages and novel features of the present invention will be more readily understood from the following detailed description when read in conjunction with the appended drawings, in which:
Throughout the drawing figures, like reference numerals will be understood to refer to like parts and components.
Specifically, the coder portion 100A illustrates the computation of PW from an input speech signal. Voice activity detection (VAD) 102 is performed on the input speech to determine whether the input speech is actually speech or noise. The VAD 102 provides a VAD flag which indicates whether the input signal was noise or speech. The detected signal is then provided to a noise reduction module 104 where the noise level for the signal is reduced and provided to a linear predictive (LPC) analysis filter module 106.
The LPC module 106 provides filtered and residual signals to the prototype extraction module 108 as well as LPC parameters to decoder 100B. The pitch estimation and interpolation module 110 receives the LPC filtered and residual signals from the LPC analysis filter module 106 and pitch contours from the prototype extraction module 108 and provides a pitch and a pitch gain.
The extracted prototype waveform from prototype extraction module 108 is provided to compute prototype gain module 112, PW magnitude and computation and normalization module 114, compute subband nonstationarity measure module 116 and compute voicing measure module 118. Compute voicing measure (VM) module 118 also receives the pitch gain from pitch estimation and interpolation module 110 and computes a voicing measure.
The compute prototype gain module 112 computes a prototype gain and provides the PW gain value to decoder portion 100B. PW magnitude computation and normalization module 114 computes the PW magnitude and normalizes the PW magnitude.
Compute subband nonstationarity measure module 116 computes a subband nonstationarity measure from the extracted prototype waveform. The computed subband nonstationarity measure and computed voicing measure are provided to a subband nonstationarity measure—Vector quantizer (VQ) module 122 which processes the received signals.
A PW magnitude quantization module 120 receives the computed PW magnitude and normalized signal along with the VAD flag indication and quantizes the received signal and provides a PW magnitude value to the decoder 100B.
The decoder 100B further includes a periodic phase model module 124 and aperiodic phase model module 126 which receive the PW magnitude value and subband nonstationarity measure-voicing measure value from coder 100A and compute a periodic phase and an aperiodic phase, respectively, from the received signal. The periodic phase model module 124 provides a complex periodic vector having a periodic component level and the aperiodic phase model module 126 provides a complex aperiodic vector having an aperiodic component level to a summer which provides a complex PW vector to a normalize PW gain module 128. The normalize PW gain module also receives the PW gain value from coder 100A.
A pitch interpolation module 130 performs pitch interpolation on a pitch period provided by encoder 100A. The normalize PW gain signal and interpolated pitch frequency contour signal is provided to an interpolative synthesis module 132 which performs interpolative synthesis to obtain a reconstructed residual signal from the previously mentioned signals.
The reconstructed residual signal is provided to an all pole LPC synthesis filter module 134 which processes the reconstructed residual signal and provides the filtered signal to an adaptive postfilter and tilt correction module 136. Modules 134 and 136 also receive the VAD flag indication signal and interpolated LPC parameters from the encoder 100A. A reconstructed speech signal is provided by the adaptive postfilter and tilt correction module 136.
Specifically, the FDI codec 100 is based on techniques of linear predictive (LP) analysis, robust pitch estimation and frequency domain encoding of the LP residual signal. The FDI codec operates on a frame size of preferably 20 ms. Every 20 ms, the speech encoder 100A produces 80 bits representing compressed speech. The speech decoder 100B receives the 80 compressed speech bits and reconstructs a 20 ms frame of speech signal. The encoder 100A preferably uses a look ahead buffer of at least 20 ms, resulting in an algorithmic delay comprising buffering delay and look ahead delay of 40 ms.
The speech encoder 100A is equipped with a built-in voice activity detector (VAD) 102 and can operate in continuous transmission (CTX) mode or in discontinuous transmission (DTX) mode. In the DTX mode, comfort noise information (CNI) is encoded as part of the compressed bit stream during silence intervals. At the decoder 100B, the CNI packets are used by a comfort noise generation (CNG) algorithm to regenerate a close approximation of the ambient noise. The VAD information is also used by an integrated front end noise reduction scheme that can provide varying degrees of background noise level attenuation and speech signal enhancement.
A single parity check bit is preferably included in the 80 compressed speech bits of each frame of the input speech signal to detect channel errors in perceptually important compressed speech bits. This enables the codec 100 to operate satisfactorily in links with a random bit error rate up to about 10−3. In addition, the decoder 100B uses bad frame concealment and recovery techniques to extend signal processing operations during frame erasures.
Additionally, in addition to the speech coding functions, the codec 100 also has the ability to transparently pass dual tone multifrequency (DTMF) and signaling tones.
As discussed above, the FDI codec 100 uses the linear predictive analysis technique to model the short term Fourier spectral envelope of the input speech signal. Subsequently, a pitch frequency estimate is used to perform a frequency domain prototype waveform analysis of the LP residual signal. Specifically, the PW analysis provides a characterization of the harmonic or fine structure of the speech spectrum. More specifically, the PW magnitude spectrum provides the correction necessary to refine the short term LP spectral estimate to obtain a more accurate fit to the speech spectrum at the pitch harmonic frequencies. Information about the phase of the signal is implicitly represented by the degree of periodicity of the signal measured across a set of subbands.
In a preferred embodiment of the present invention, the input speech signal is processed in consecutive non-overlapping frames of 20 ms duration, which corresponds to 160 samples at the sampling frequency of 8000 samples/sec. The encoder 100A parameters are quantized and transmitted once for each 20 ms frame. A look-ahead of 20 ms is used for voice activity detection, noise reduction, LP analysis and pitch estimation. This produces in an algorithmic delay which is defined as a buffering delay and a look-ahead delay of 40 ms.
Referring to
The invention will now be discussed in terms of front end processing, specifically input preprocessing. The new input speech samples are first scaled down by preferably 0.5 to prevent overflow in fixed point implementation of the coder 100A. In another embodiment of the present invention, the scaled speech samples can be high-pass filtered using an infinite impulse response (IIR) filter with a cut-off frequency of 60 Hz, to eliminate undesired low frequency components. The transfer function of the 2nd order high pass filter is given by
In terms of the VAD module 102, the preprocessed signal is analyzed to detect the presence of speech activity. This comprises the following operations: scaling the signal via an automatic gain control (AGC) mechanism to improve VAD performance for low level signals, windowing the Automatic Gain Control (AGC) scaled speech and computing a set of autocorrelation lags, performing a 10th order autocorrelation LP analysis of the AGC scaled speech to determine a set of LP parameters which are used during pitch estimation, performing a preliminary pitch estimation based on the pitch candidates for the look-ahead part of the buffer, performing voice activity detection based on the autocorrelation lags and pitch estimate and the tone detection flag that is generated by examining the distance between adjacent line spectral frequencies (LSFs) which will be described in greater detail below with respect to conversion to line spectral frequencies.
This series of operations produces a VAD_FLAG and a VID_FLAG that have the following values depending on the detected voice activity:
It should be noted that the VAD_FLAG and the VID_FLAG represent the voice activity status of the look-ahead part of the buffer. A delayed VAD flag, VAD_FLAG_DL1 is also maintained to reflect the voice activity status of the current frame. In a presentation given during an IEEE speech and audio processing workshop in Finland during 1999, the entire contents of the documentation being incorporated by reference herein, the presenters F. Basbug, S. Nandkumar and K. Swamianthan described an AGC front-end for the VAD which itself is a variation of the voice activity detection algorithms used in cellular standards “TDMA cellular/PCS Radio Interface—Minimum Objective Standards for IS-136 B, DTX/CNG Voice Activity Detection”, which is also incorporated by reference in its entirety. A by-product of the AGC front-end is the global signal-to-noise ratio, which is used to control the degree of noise reduction.
The VAD flag is encoded explicitly only for unvoiced frames as indicated by the voicing measure flag. Voiced frames are assumed to be active speech. In the present embodiment of the invention, the VAD flag is not coded explicitly. The decoder sets the VAD flag to a one for all voiced frames. However, it will be appreciated by those skilled in the art that the VAD flag can be coded explicitly without departing from the scope of the present invention.
Noise reduction module 104 provides noise reduction to the voice activity detected speech signal. Specifically, the preprocessed speech signal is processed by a noise reduction algorithm to produce a noise reduced speech signal. The following is a series of steps comprising the noise reduction algorithm: A trapezoidal windowing and the computing of the complex discrete Fourier transform (DFT) of the signal is performed.
If the VVAD_FLAG, which is the VAD output prior to hangover, is a one which indicates voice activity, then the smoothed magnitude square of the DFT is taken to be the smoothed power spectrum of noisy speech S(k). However, if the VVAD_FLAG is a zero indicating voice inactivity, the smoothed DFT power spectrum is then used to update a recursive estimate of the average noise power spectrum Nav(k) as follows:
Nav(k)=0.9·Nav(k)+0.1·S(k) if VAD—FLAG=0 (2)
A spectral gain function is then computed based on the average noise power spectrum and the smoothed power spectrum of the noisy speech. The gain function Gnr(k) takes the following form:
Here, the factor Fnr is a factor that depends on the global signal-to-noise-ratio SNRglobal that is generated by the AGC front-end for the VAD. The factor Fnr can be expressed as an empirically derived piecewise linear function of SNRglobal that is monotonically non-decreasing. The gain function is close to unity when the smoothed power spectrum S(k) is much larger than the average noise power spectrum Nav(k). Conversely, the gain function becomes small when S(k) is comparable to or much smaller than Nav(k). The factor Fnr controls the degree of noise reduction by providing for a higher degree of noise reduction when the global signal-to-noise ratio is high (i.e., risk of spectral distortion is low since VAD and the average noise estimate are fairly accurate). Conversely, the factor restricts the amount of noise reduction when the global signal-to-noise ratio is low. For example, the risk of spectral distortion is high due to increased VAD inaccuracies and less accurate average noise power spectral estimate.
The spectral amplitude gain function is further clamped to a floor which is a monotonically non-increasing function of the global signal-to-noise ratio. This kind of clamping reduces the fluctuations in the residual background noise after noise reduction making the speech sound smoother. The clamping action is expressed as:
G′nr(k)=MAX(Gnr(k), Tglobal(SNRglobal) (4)
Thus, at high global signal-to-noise ratios, the spectral gain functions will be clamped to a lower floor since there is less risk of spectral distortion due to inaccuracies in the VAD or the average noise power spectral estimate Nav(k). But at lower global signal-to-noise ratio, the risks of spectral distortion outweigh the benefits of reduced noise and therefore a higher floor would be appropriate.
In order to reduce the frame-to-frame variation in the spectral amplitude gain function, a gain limiting device is used which limits the gain between a range that depends on the previous frame's gain for the same frequency. The limiting action can be expressed as follows:
Gnrnew(k)=MAX({SnrL.Gnrold(k)}, MIN({SnrH.Gnrold(k)},G′nr(k))) (5)
The scale factors SnrL and SnrH are updated using a state machine whose actions depend on whether the frame is active, inactive or transient.
At step 308 a determination is made as to whether the VAD_FLAG was zero for the last two frames. If the determination is affirmative the method proceeds to step 310 where the scale factors are limited to be very close to unity. However, if the determination was negative, the method 300 then proceeds to step 312 where the scale factors are limited to be away from unity.
If the determination at step 304 was negative, the method 300 then proceeds to step 314 where the scale factors are adjusted to be away from unity. The method 300 then proceeds to step 316 where the scale factors are limited to be far away from unity.
The steps 310, 312 and 316 proceed to step 318 where the updated scale factors are outputted.
The final spectral gain function Gnrnew(k) is multiplied with the complex DFT of the preprocessed speech, attenuating the noise dominant frequencies and preserving signal dominant frequencies. An overlap-and-add inverse DFT is then performed on the spectral gain scaled DFT to compute a noise reduced speech signal over the interval of the noise reduction window.
Since the noise reduction is carried out in the frequency domain, the availability of the complex DFT of the preprocessed speech is taken advantage of in order to carry out DTMF and Signaling tone detection. These detection schemes are based on examination of the strength of the power spectra at the tone frequencies, the out-of-band energy, the signal strength, and validity of the bit duration pattern. It should be noted that the incremental cost of having such detection schemes to facilitate transparent transmission of these signals is negligible since the power spectrum of the preprocessed speech is already available.
An embodiment of the invention will now be described in terms of LPC analysis filtering module 106. The noise reduced speech signal is subjected to a 10th order autocorrelation method of LP analysis where {snr(n),0≦n<400} denotes the noise reduced speech buffer, where {snr(n),80≦n<240} is the current frame being encoded and {snr(n),240≦n<320} is the look-ahead buffer 280 as shown in
Here, {am,0≦m≦M} are the LP parameters for the current frame and M=10 is the LP order. LP analysis is performed using the autocorrelation method with a modified Hanning window of size 40 ms (320 samples) which includes the 20 ms current frame and the 20 ms lookahead frame as shown in
The noise reduced speech signal over the LP analysis window {snr(n),80≦n<400} is windowed using a modified Hanning window function {wlp(n),0≦n<320} defined as follows:
The windowed speech buffer is computed by multiplying the noise reduced speech buffer with the window function as follows:
sw(n)=snr(80+n)wlp(n)0≦n<240. (8)
Normalized autocorrelation lags are computed from the windowed speech by
The autocorrelation lags are windowed by a binomial window with a bandwidth expansion of 60 Hz. The binomial window is given by the following recursive rule:
Lag windowing is performed by multiplying the autocorrelation lags by the binomial window:
rlpw(m)=rlp(m)lw(m)1≦m≦10. (11)
The zeroth windowed lag rlpm (0) is obtained by multiplying by a white noise correction factor of about 1.0001, which is equivalent to adding a noise floor at −40 dB:
rlpw(0)=1.0001rlp(0). (12)
Lag windowing and white noise correction are techniques are used to address problems that arise in the case of periodic or nearly periodic signals. For such signals, the all-pole LP filter is marginally stable, with its poles very close to the unit circle. It is necessary to prevent such a condition to ensure that the LP quantization and signal synthesis at the decoder 100B an be performed satisfactorily.
The LP paramerters that define a minimum phase spectral model to the short term spectrum of the current frame are determined by applying Levinson-Durbin recursions to the windowed autocorrelation lags {rlpw(m),0≦m≦10}. The resulting 10th order LP parameters for the current frame are {a′m,0≦m≦10}, with a′0=1. Since the LP analysis window is centered around the sample index of about 240 in the buffer, the LP parameters represent the spectral characteristics of the signal in the vicinity of this point.
During highly periodic signals, the spectral fit provided by the LP model tends to be excessively peaky in the low formant regions, resulting in audible distortions. To overcome this problem, a bandwidth broadening scheme has been employed in this embodiment of the present invention, where the formant bandwidth of the model is broadened adaptively, depending on the degree of peakiness of the spectral model. The LP spectrum is given by
where ωm denotes the pitch frequency estimate of the mth subframe (1≦m≦8) of the current frame in radians/sample. Given this pitch frequency, the index of the highest frequency pitch harmonic that falls within the frequency band of the signal (0–4000 Hz or 0–π radians) for the mth subframe is given by
where, └x┘ denotes the largest integer less than or equal to x. The magnitude of the LPC spectrum is evaluated at the pitch harmonics by
It should be noted that ω8 corresponds to the 8th subframe has been used here since the LP parameters have been evaluated for a window centered around a sample of about 240 as shown in
The peak-to-average ratio ranges from 0 dB (for flat spectra) to values exceeding 20 dB (for highly peaky spectra). The expansion in formant bandwidth (expressed in Hz) is then determined based on the log peak-to-average ratio according to a piecewise linear characteristic:
The expansion in bandwidth ranges from a minimum of about 10 Hz for flat spectra to a maximum of about 120 Hz for highly peaky spectra. Thus, the bandwidth expansion is adapted to the degree of peakiness of the spectra. The above piecewise linear characteristic have been experimentally optimized to provide the right degree of bandwidth expansion for a range of spectral characteristics. A bandwidth expansion factor αbw to apply this bandwidth expansion to the LP spectrum is obtained by
The LP parameters representing the bandwidth expanded LP spectrum are determined by
αm=α′mαbwm0≦m≦10. (19)
The bandwidth expanded LP filter coefficients are converted to line spectral frequencies (LSFs) for quantization and interpolation purposes which is described in “Line Spectral Representation of Linear Predictive Coefficients of Speech Signals” Journal of Acoustical Society of America, vol. 57, no. 1, 1975 by F. Itakura which is incorporated by reference in its entirety. An efficient approach to computing LSFs from LP parameters using Chebychev polynomials is described in “The Computation of Line Spectral Frequencies Using Chebyshev Polynomials,” IEEE Trans. On ASSP, vol. 34, no 6, pages 1419–1426, December 1986 by P. Kabal and R. P. Ramachandran which is herein incorporated by reference in its entirety. The resulting LSFs for the current frame are denoted by {λ(m),0≦m≦10}.
The LSF domain also lends itself to detection of highly periodic or resonant inputs. For such signals, the LSFs located near the signal frequency have very small separations. If the minimum difference between adjacent LSF values falls below a threshold for a number of consecutive frames, it is highly probable that the input signal is a tone.
If the method 404 is answered affirmatively, the tone counter detects that the LSF value is below the threshold and increments the counter by one. The methods 406 and 412 proceed to step 408.
At step 408 a determination is made as to whether the tone counter is at its maximum value. If the method 408 is answered negatively, the method 400 proceeds to step 410 where a tone flag equals false indication is provided. If the method 408 is answered negatively, the method 400 then proceeds to step 414 where a tone flag equals true indication is provided.
The steps 410 and 414 proceed to step 416 where the method 400 continues checking for tones. Specifically, method 400 provides a tone flag indication which is a one if a tone has been detected and a zero otherwise. This flag is also used in voice activity detection.
The invention will now be described in reference to the pitch estimation and interpolation module 110. Pitch estimation is performed based on an autocorrelation analysis of a spectrally flattened low pass filtered speech signal. Spectral flattening is accomplished by filtering the AGC scaled speech signal using a pole-zero filter, constructed using the LP parameters of AGC scaled speech signal. If {amagc,0≦m≦10} are the LP parameters of AGC scaled speech signal, the pole-zero filter is given by
The spectrally flattened signal is low-pass filtered by a 2nd order IIR filter with a 3 dB cutoff frequency of 1000 Hz. The transfer function of this filter is
The resulting signal is subjected to an autocorrelation analysis in two stages. In the first stage, a set of four raw normalized autocorrelation functions (ACF) are computed over the current frame. The windows for the raw ACFs are staggered by 40 samples as shown in
In each frame, raw ACFs corresponding to windows 2, 3, 4 and 5 as shown in
In the second stage, each raw ACF is reinforced by the preceding and the succeeding raw ACF, resulting in a composite ACF. For each lag l in the raw ACF in the range 20≦l≦120, peak values within a small range of lags [(l−wc(l)),(l+wc(l))] are determined in the preceding and the succeeding raw ACFs. These peak values reinforce the raw ACF at each lag l, via a weighted combination:
Here, wc(l) determines the window length based on the lag index l:
Also, mpeak(l) and npeak(l) are the locations of the peaks within the window. The weighting attached to the peak values from the adjacent ACFs ensures that the reinforcement diminishes with increasing difference between the peak location and the lag l. The reinforcement boosts a peak value if peaks also occur at nearby lags in the adjacent raw ACFs. This increases the probability that such a peak location is selected as the pitch period. ACF peaks locations due to an underlying periodicity do not change significantly across a frame. Consequently, such peaks are strengthened by the above process. On the other hand, spurious peaks are unlikely to have such a property and consequently are diminished. This improves the accuracy of pitch estimation.
Within each composite ACF the locations of the two strongest peaks are obtained. These locations are the candidate pitch lags for the corrresponding pitch window, and take values in the range 20–120 which is inclusive. In conjunction with the two peaks from the last composite ACF of the previous frame i.e., for window 5 in the previous frame, results in a set of 5 peak pairs, leading to 32 possible pitch tracks through the current frame. A pitch metric is used to maximize the continuity of the pitch track as well as the value of the ACF peaks along the pitch track to select one of these pitch tracks. The end point of the optimal pitch track determines the pitch period p8 and a pitch gain βpitch for the current frame. Note that due to the position of the pitch windows, the pitch period and pitch gain are aligned with the right edge of the current frame The pitch period is integer valued and takes on values in the range 20–120. It is mapped to a 7-bit pitch index l*p in the range of about 0–101.
In respect to the prototype extraction module 108 and the pitch estimation and interpolation module 110, the pitch period is converted to the radian pitch frequency corresponding to the right edge of the frame by
A subframe pitch frequency contour is created by linearly interpolating between the pitch frequency of the left edge ω0 and the pitch frequency of the right edge ω8:
If there are abrupt discontinuities between the left edge and the right edge pitch frequencies, the above interpolation is modified to make a switch from the pitch frequency to its integer multiple or submultiple at one of the subframe boundaries. It should be noted that the left edge pitch frequency ωo is the right edge pitch frequency of the previous frame. The index of the highest pitch harmonic within the 4000 Hz band is computed for each subframe by
The LSFs are quantized by a hybrid scalar-vector quantization scheme. The first 6 LSFs are scalar quantized using a combination of intraframe and interframe prediction using 4 bits/LSF. The last 4 LSFs are vector quantized using 7 bits. Thus, a total of 31 bits are used for the quantization of the 10-dimensional LSF vector.
The 16 level scalar quantizers for the first 6 LSFs in a preferred embodiment of the present invention is designed using a Linde-Buzo-Gray algorithm. An LSF estimate is obtained by adding each quantizer level to a weighted combination of the previous quantized LSF of the current frame and the adjacent quantized LSFs of the previous frame:
Here, {{circumflex over (λ)}(m),0≦m<6} are the first 6 quantized LSFs of the current frame and {{circumflex over (λ)}prev(m),0≦m≦10} are the quantized LSFs of the previous frame. {SL,m(l),0≦m<6,0≦l≦15} are the 16 level scalar quantizer tables for the first 6 LSFs. The squared distortion between the LSF and its estimate is minimized to determine the optimal quantizer level:
If l*L
The last 4 LSFs are vector quantized using a weighted mean squared error (WMSE) distortion measure. The weight vector {WL(m),6≦m≦9} is computed by the following procedure:
A set of predetermined mean values {λdc(m),6≦m<9} are used to remove the DC bias in the last 4 LSFs prior to quantization. These LSFs are estimated based on the mean removed quantized LSFs of the previous frame:
{tilde over (λ)}(l,m)=VL(l,m−6)+λdc(m)+0.5({circumflex over (λ)}prev(m)−λdc(m)),0≦l≦127,6≦m≦9. (33)
Here {VL(l,m),0≦l≦127,0≦m<3} is the 128 level, 4-dimensional codebook for the last 4 LSFs. The optimal code vector is determined by minimizing the WMSE between the estimated and the original LSF vectors:
If l*L
{circumflex over (λ)}(m)=VL(l*L
The stability of the quantized LSFs is checked by ensuring that the LSFs are monotonically increasing and are separated by a minimum value of about 0.008. If this criteria is not satisfied, stability is enforced by reordering the LSFs in a monotonically increasing order. If a minimum separation is not achieved, the most recent stable quantized LSF vector from a previous frame is substituted for the unstable LSF vector. The 6 4-bit SQ indices {l*L
The inverse quantized LSFs are interpolated each subframe by preferably linear interpolation between the current LSFs {{circumflex over (λ)}(m),0≦m≦10} and the previous LSFs {{circumflex over (λ)}prev(m),0≦m≦10}. The interpolated LSFs at each subframe are converted to LP parameters {âm(1),0≦m≦10,1≦l≦8}.
The prediction residual signal for the current frame is computed using the noise reduced speech signal {snr(n)} and the interpolated LP parameters. Residual is computed from the midpoint of a subframe to the midpoint of the next subframe, using the interpolated LP parameters corresponding to the center of this interval. This ensures that the residual is computed using locally optimal LP parameters. The residual for the past data as shown in
Further, residual computation extends 93 samples into the look-ahead part of the buffer to facilitate PW extraction. LP parameters of the last subframe are used computing the look-ahead part of the residual. By denoting the interpolated LP parameters for the jth subframe (0≦j≦8) of the current frame by {âm(j),0≦m≦10}, residual computation can be represented by:
The residual for past data, {elp(n),0≦n<80} is preserved from the previous frame.
The invention will now be discussed in reference to PW extraction. The prototype waveform in the time domain is essentially the waveform of a single pitch cycle, which contains information about the characteristics of the glottal excitation. A sequence of PWs contains information about the manner in which the excitation is changing across the frame. A time-domain PW is obtained for each subframe by extracting a pitch period long segment approximately centered at each subframe boundary. The segment is centered with an offset of up to ±10 samples relative to the subframe boundary, so that the segment edges occur at low energy regions of the pitch cycle. This minimizes discontinuities between adjacent PWs. For the mth subframe, the following region of the residual waveform is considered to extract the PW:
where pm is the interpolated pitch period (in samples) for the mth subframe. The PW is selected from within the above region of the residual, so as to minimize the sum of the energies at the beginning and at the end of the PW. The energies are computed as sums of squares within a 5-point window centered at each end point of the PW, as the center of the PW ranges over the center offset of about ±10 samples:
The center offset resulting in the smallest energy sum determines the PW. If imm(m) is the center offset at which the segment end energy is minimized, i.e.,
Eend(imin(m))≦Eend(i)−10≦i≦10, (39)
the time-domain PW vector for the mth subframe is
This is transformed by a pm-point discrete Fourier transform (DFT) into a complex valued frequency-domain PW vector:
Here ωm is the radian pitch frequency and Km is the highest in-band harmonic index for the mth subframe (see equation 17). The frequency domain PW is used in all subsequent operations in the encoder. The above PW extraction process is carried out for each of the 8 subframes within the current frame, so that the residual signal in the current frame is characterized by the complex PW vector sequence {P′m(k), 0≦k≦Km, 1≦m≦8}. In addition, an approximate PW is computed for subframe 1 of the look ahead frame, to facilitate a 3-point smoothing of PW gain and magnitude. Since the pitch period is not available for the look-ahead part of the buffer, the pitch period at the end of the current frame, i.e., p8, is used in extracting this PW. The region of the residual used to extract this extra PW is
By minimizing the end energy sum as before, the time-domain PW vector is obtained as
The frequency-domain PW vector is designated by P9 and is computed by the following DFT:
It should be noted that the approximate PW is only used for smoothing operations and not as the PW for subframe 1 during the encoding of the next frame. Rather, it is replaced by the exact PW computed during the next frame.
Each complex PW vector can be further decomposed into a scalar gain component representing the level of the PW vector and a normalized complex PW vector representing the shape of the PW vector. Such a decomposition, permits vector quantization that is efficient in terms of computation and storage with minimal degradation in quantization performance. The PW gain is the root-mean square (RMS) value of the complex PW vector. It is obtained by
PW gain is also computed for the extra PW by
A normalized PW vector sequence is obtained by dividing the PW vectors by the corresponding gains:
And for the extra PW:
For a majority of frames, especially during stationary intervals, gain values change slowly from one subframe to the next. This makes it possible to decimate the gain sequence by a factor of about 2, thereby reducing the number of values that need to be quantized. Prior to decimation, the gain sequence is smoothed by a 3-point window, to eliminate excessive variations across the frame. The smoothing operation is in the logarithmic gain domain and is represented by
g″pw(m)=0.3 log10 g′pw(m−1)+0.4 log10 g′pw(m)+0.3 log10 g′pw(m+1) 1≦m≦8. (47)
Conversion to logarithmic domain is advantageous since it corresponds to the scale of loudness of sound perceived by the human ear. The smoothed gain values are transformed by the following transformation:
This transformation limits extreme (very low or very high) values of the gain and thereby improves quantizer performance, especially for low-level signals. The transformed gains are decimated by a factor of 2, requiring that only the even indexed values, i.e., {gpw(2), gpw(4), gpw(6), gpw(8)}, are quantized.
At the decoder 100B, the odd indexed values are obtained by linearly interpolating between the inverse quantized even indexed values.
A 256 level, 4-dimensional vector quantizer is used to quantize the above gain vector. The design of the vector quantizer is one of the novel aspects of this algorithm. The PW gain sequence can exhibit two distinct modes of behavior. During stationary signals, such as voiced intervals, variations of the gain sequence across a frame are small.
On the other hand, during non-stationary signals such as voicing onsets, the gain sequence can exhibit large variations across a frame. The vector quantizer used must be able to represent both types of behavior. On the average, stationary frames far outnumber the non-stationary frames.
If a vector quantizer is trained using a database, which does not distinguish between the two types, the training is dominated by stationary frames leading to poor performance for non-stationary frames. To overcome this problem, the vector quantizer design was modified by classifying the PW gain vectors classified into a stationary class and a non-stationary class.
For the 256 level codebook, 192 levels were allocated to represent stationary frames and the remaining 64 were allocated for non-stationary frames. The 192 level codebook is trained using the stationary frames, and the 64 level codebook is trained using the non-stationary frames. The training algorithm with a binary split and random perturbation is based on the generalized Lloyd algorithm disclosed in “An algorithm for Vector Quantization Design”, by Y. Linde, A. Buzo and R. Gray, pages 84–95 of IEEE Transactions on Communications, VOL. COM-28, No. 1, January 1980 which is incorporated by reference in its entirety. In the case of the stationary codebook, a ternary split is used to derive the 192 level codebook from a 64 level codebook in the final stage of the training process. The 192 level codebook and the 64 level codebook are concatenated to obtain the 256-level gain codebook. The stationary/non-stationary classification is used only during the training phase. During quantization, stationary/non-stationary classification is not performed. Instead, the entire 256-level codebook is searched to locate the optimal quantized gain vector. The quantizer uses a mean squared error (MSE) distortion metric:
where, {Vg(l,m),0≦l≦255,1≦m≦4} is the 256 level, 4-dimensional gain codebook and Dg(1) is the MSE distortion for the lth codevector. In another embodiment of the present invention the optimal codevector {Vg(l*g, m), 1≦m≦4} is the one which minimizes the distortion measure over the entire codebook, i.e.,
Dg(l*g)≦Dg(l)0≦l≦255. (50)
The 8-bit index of the optimal code-vector l*g is transmitted to the decoder as the gain index.
The generation of the phase spectrum at the decoder 100B is facilitated by measuring pitch cycle stationarity at the encoder as a ratio of the energy of the non-stationary component to that of the stationary component in the PW sequence. Further, this energy ratio is measured over 5 subbands spanning the frequency band of interest, resulting in a 5-dimensional vector nonstationarity measure in each frame. This vector is quantized and transmitted to the decoder, where it is used to generate phase spectra that lead to the correct degree of periodicity across the band. The first step in measuring the stationarity of PW is to align the PW sequence.
In order to measure the degree of stationarity of the PW sequence, it is necessary to align each PW to the preceding PW. The alignment process applies a circular shift to the pitch cycle to remove apparent differences in adjacent PWs that are due to temporal shifts or variations in pitch frequency. Let {tilde over (P)}m−1 denote the aligned PW corresponding to subframe m-1 and let {tilde over (θ)}m−1 be the phase shift that was applied to Pm−1 to derive {tilde over (P)}m−1. In other words,
{tilde over (P)}m−1(k)=Pm−1(k)ej{tilde over (θ)}
For the alignment of Pm to {tilde over (P)}m−1, if the residual signal is perfectly periodic with the pitch period being an integer number of samples, Pm and Pm−1 are identical except for a circular shift. In this case, the pitch cycle for the mth subframe is identical to the pitch cycle for the m-1th subframe, except that the starting point for the former is at a later point in the pitch cycle compared to the latter. The difference in starting point arises due to the advance by a subframe interval and differences in center offsets at subframes m and m-1. With the subframe interval of 20 samples and with center offsets of imin(m) and imin(m−1), it can be seen that the mth pitch cycle is ahead of the m-1th pitch cycle by 20+imin(m)−imin(m−1) samples. If the pitch frequency is ωm, a phase shift of −ωm(20+imin(m)−imin(m−1)) is necessary to correct for this phase difference and align Pm with Pm−1. In addition, since Pm−1 has been circularly shifted by {tilde over (θ)}m−1 to derive {tilde over (P)}m−1, it follows that the phase shift needed to align Pm with {tilde over (P)}m−1 is a sum of these two phase shifts and is given by
{tilde over (θ)}m−1−ωm(20+imin(m)−imin(m−1)). (52)
In practice, the residual signal is not perfectly periodic and the pitch period can be non-integer valued. In such a case, the above cannot be used as the phase shift for optimal alignment. However, for quasi-periodic signals, the above phase angle can be used as a nominal shift and a small range of angles around this nominal shift angle are evaluated to find a locally optimal shift angle. Satisfactory results have been obtained with about an angle range of ±0.2π centered around the nominal shift angle, searched in steps of about 0.04π. For each shift within this range, the shifted version of Pm is correlated against {tilde over (P)}m−1. The shift angle that results in the maximum correlation is selected as the locally optimal shift. This correlation maximization can be represented by
where * represents complex conjugation and Re[ ] is the real part of a complex vector. If i=imax maximizes the above correlation, then the locally optimal shift angle is
{tilde over (θ)}m={tilde over (θ)}m−1−ωm(20+imin(m)−imin(m−1))+0.04 πimax (54)
and the aligned PW for the mth subframe is obtained from
{tilde over (P)}m(k)=Pm(k)ej{tilde over (θ)}mk0≦k≦Km. (55)
The process of alignment results in a sequence of aligned PWs from which any apparent dissimilarities due to shifts in the PW extraction window, pitch period etc. have been removed. Only dissimilarities due to the shape of the pitch cycle or equivalently the residual spectral characteristics are preserved. Thus, the sequence of aligned PWs provides a means of measuring the degree of change taking place in the residual spectral characteristics i.e., the degree of stationarity of the residual spectral characteristics. The basic premise of the FDI algorithm is that it is important to encode and reproduce the degree of stationarity of the residual in order to produce natural sounding speech at the decoder. Consider the temporal sequence of aligned PWs along the kth harmonic track, i.e.,
{{tilde over (P)}m(k),1≦m≦8}. (56)
If the signal is perfectly periodic, the kth harmonic is identical for all subframes, and the above sequence is a constant as a function of m. If the signal is quasi-periodic, the sequence exhibits slow variations across the frame, but is still a predominantly low frequency waveform. It should be noted that here frequency refers to evolutionary frequency, related to the rate at which PW changes across a frame. This is in contrast to harmonic frequency, which is the frequency of the pitch harmonic. Thus, a high frequency harmonic component changing slowly across the frame is said to have low evolutionary frequency content. Or a low frequency harmonic component changing rapidly across the frame is said to have high evolutionary frequency content.
As the signal periodicity decreases, variations in the above PW sequence increase, with decreasing energy at lower frequencies and increasing energy at higher frequencies. At the other extreme, if the signal is aperiodic, the PW sequence exhibits large variations across the frame, with a near uniform energy distribution across frequency. Thus, by determining the spectral energy distribution of aligned PW sequences along a harmonic track, it is possible to obtain a measure of the periodicity of the signal at that harmonic frequency. By repeating this analysis at all the harmonics within the band of interest, a frequency dependent measure of periodicity can be constructed.
The relative distribution of spectral energy of variations of PW between low and high frequencies can be determined by passing the aligned PW sequence along each harmonic track through a low pass filter and a high pass filter. In an embodiment of the present invention, the low pass filter used is a 3rd order chebyshev filter with a 3 dB cutoff at 35 Hz (for the PW sampling frequency of 400 Hz), with the following transfer function:
The high pass filter used is also a 3rd order chebyshev filter with a 3 dB cutoff at 18 Hz with the following transfer function:
The output of the low pass filter is the stationary component of the PW that gives rise to pitch cycle periodicity and is denoted by {Sm(k), 0≦k≦Km, 1≦m≦8}. The output of the high pass filter is the nonstationary component of PW that gives rise to pitch cycle aperiodicity and is denoted by {Rm(k), 0≦k≦Km, 1≦m≦8}. The energies of these components are computed in subbands and then averaged across the frame.
The harmonics of the stationary and nonstationary components are grouped into 5 subbands spanning the frequency band of interest where the band-edges in Hz is defined by the array
Brs=[1 400 800 1600 2400 3400]. (59)
The subband edges in Hz can be translated to subband edges in terms of harmonic indices such that the ith subband contains harmonics with indices {ηm(i−1)≦k<ηm(i),1≦i≦5} as follows:
The energy in each subband is computed by averaging the squared magnitude of each harmonic within the subband. For the stationary component, the subband energy distribution for the mth subframe is computed by
For the nonstationary component, the subband energy distribution for the mth subframe is computed by
Next, these subframe energies are averaged across the frame:
The subband nonstationarity measure is computed as the ratio of the energy of the nonstationary component to that of the stationary component in each subband:
If this ratio is very low, it indicates that the PW sequence has much higher energy at low evolutionary frequencies than at high evolutionary frequencies, corresponding to a predominantly periodic signal or stationary PW sequence. On the other hand, if this ratio is very high, it indicates that the PW sequence has much higher energy at high evolutionary frequencies than at low evolutionary frequencies, corresponding to a predominantly aperiodic signal or nonstationary PW sequence. Intermediate values of the ratio indicate different mixtures of periodic and aperiodic components in the signal or different degrees of stationarity of the PW sequence. This information can be used at the decoder to create the correct degree of variation from one PW to the next, as a function of frequency and thereby realize the correct degree of periodicity in the signal.
In case of nonstationary voiced signals, where the pitch cycle is changing rapidly across the frame, the nonstationarity measure may have high values even in low frequency bands. This is usually a characteristic of unvoiced signals and usually translates to a noise-like excitation at the decoder. However, it is important that non-stationary voiced frames are reconstructed at the decoder with glottal pulse-like excitation rather than with noise-like excitation. This information is conveyed by a scalar parameter called a voicing measure, which is a measure of the degree of voicing of the frame. During stationary voiced and unvoiced frames, there is some correlation between the nonstationarity measure and the voicing measure. However, while the voicing measure indicates if the excitation pulse should be a glottal pulse or a noise-like waveform, the nonstationarity measure indicates how much this excitation pulse should change from subframe to subframe. The correlation between the voicing measure and the nonstationarity measure is exploited by vector quantizing these jointly.
The voicing measure is estimated for each frame based on certain characteristics correlated with the voiced/unvoiced nature of the frame. It is a heuristic measure that assigns a degree of voicing to each frame in the range 0–1, with a zero indicating a perfectly voiced frame and a one indicating a completely unvoiced frame.
The voicing measure is determined based on six measured characteristics of the current frame which are, the average of the nonstationarity measure in the 3 low frequency subbands, a relative signal power which is computed as the difference between the signal power of the current frame and a long term average signal power, the pitch gain, the average correlation between adjacent aligned PWs, the 1st reflection coefficient obtained during LP Analysis, and the variance of the candidate pitch lags computed during pitch estimation.
The (squared) normalized correlation between the aligned PW of the mth and m−1th frames is obtained by
It should be noted that the upper limit of the summations are limited to 6 rather than Km to reduce computational complexity. This subframe correlation is averaged across the frame to obtain an average PW correlation:
The average PW correlation is a measure of pitch cycle to pitch cycle correlation after variations due to signal level, pitch period and PW extraction offset have been removed. It exhibits a strong correlation to the nature of glottal excitation. As mentioned earlier, the nonstationarity measure, especially in the low frequency subbands, has a strong correlation to the voicing of the frame. An average of the nonstationarity measure for the 3 lowest subbands provides a useful parameter in inferring the nature of the glottal excitation. This average is computed as
It will be appreciated by those skilled in the art that subbands other than the three lowest subbands can be used without departing from the scope of the present invention.
The pitch gain is a parameter that is computed as part of the pitch analysis function. It is essentially the value of the peak of the autocorrelation function (ACF) of the residual signal at the pitch lag. To avoid spurious peaks, the ACF used in the embodiment of this invention is a composite autocorrelation function, computed as a weighted average of adjacent residual raw autocorrelation functions.
The pitch gain, denoted by βpitch is the value of the peak of a composite autocorrelation function. The composite ACF are evaluated once every 40 samples within each frame at 80, 120, 160, 200 and 240 samples as shown in
The variation is computed by the average of the absolute deviations from this mean:
This parameter exhibits a moderate degree of correlation to the voicing of the signal.
The signal power also exhibits a moderate degree of correlation to the voicing of the signal. However, it is important to use a relative signal power rather than an absolute signal power, to achieve robustness to input signal level deviations from nominal values. The signal power in dB is defined as
An average signal power can be obtained by exponentially averaging the signal power during active frames. Such an average can be computed recursively using the following equation:
Esigavg=0.95Esigavg+0.05Esig. (72)
A relative signal power can be obtained as the difference between the signal power and the average signal power:
Esigrel=Esig−Esigavg. (73)
The relative signal power measures the signal power of the frame relative a long term average. Voiced frames exhibit moderate to high values of relative signal power, whereas unvoiced frames exhibit low values.
The 1st reflection coeffient ρ1 is obtained as a byproduct of LP analysis during Levinson-Durbin recursion. Conceptually it is equalivalent to the 1st order normalized autocorrelation coefficient of the noise reduced speech. During voiced speech segments, the speech spectrum tends to have a low pass characteristic, which results in a ρ1 close to 1. During unvoiced frames, the speech spectrum tends to have a flatter or high pass characteristic, resulting in smaller or even negative values for ρ1.
To derive the voicing measure, each of these six parameters are nonlinearly transformed using sigmoidal functions such that they map to the range 0–1, close to 0 for voiced frames and close to 1 for unvoiced frames. The parameters for the sigmoidal transformation have been selected based on an analysis of the distribution of these parameters. The following are the transformations for each of these parameters:
The voicing measure of the previous frame νprev determines the weighted sum of the transformed parameters which results in the voicing measure:
The weights used in the above sum are in accordence with the degree of correlation of the parameter to the voicing of the signal. Thus, the pitch gain receives the highest weight since it is most strongly correlated, followed by the PW correlation. The 1st reflection coefficient and low-band nonstationarity measure receive moderate weights. The weights also depend on whether the previous frame was strongly voiced, in which case more weight is given to the low-band nonstationarity measure. The pitch variation and relative signal power receive smaller weights since they are only moderately correlated to voicing.
If the resulting voicing measure ν is clearly in the voiced region (ν<0.45) or clearly in the unvoiced region (ν>0.6), it is not modified further. However, if it lies outside the clearly voiced or unvoiced regions, the parameters are examined to determined if there is a moderate bias towards a voiced frame. In such a case, the voicing measure is modified so that its value lies in the voiced region.
The resulting voicing measure ν takes on values in the range 0–1, with lower values for more voiced signals. In addition, a binary voicing measure flag is derived from the voicing measure as follows:
Thus, νflag is 0 for voiced signals and 1 for unvoiced signals. This flag is used in selecting the quantization mode for PW magnitude and the subband nonstationarity vector. The voicing measure ν is concatenated to the subband nonstationarity measure vector and the resulting 6-dimensional vector is vector quantized.
The subband nonstationarity measure can have occasional spurious large values, mainly due to the approximations and the averaging used during its computation. If this occurs during voiced frames, the signal is reproduced with excessive roughness and the voice quality is degraded. To prevent this, large values of the nonstationarity measure are attenuated. The attenuation charactersitic has been determined experimentally and is specified as follows for each of the five subbands:
Additionaly, for voiced frames, it is necessary to ensure that the values of the nonstationarity measure in the low frequency subbands are in a monotonically nondecreasing order. This condition is enforced for the 3 lower subbands according to the flow chart in
At step 604 a determination is made as to whether the voicing measure is less than 0.6. If the determination is answered negatively, the method proceeds to step 622. If the determination is answered affirmatively the method proceeds to step 606.
At step 606 a determination is made as to whether R1 is greater than R2. If the determination is answered negatively, the method proceeds to step 614. If the determination is answered affirmatively, the method proceeds to step 608.
At step 614 a determination is made as to whether R2 is greater than R3. If the determination is answered negatively the method proceeds to step 622. If the determination is answered affirmatively, the method proceeds to step 616.
At step 608 a determination is made as to whether 0.5(R1+R2) is less than or equal to R3. If the determination is answered affirmatively the method proceeds to step 610 where a formula is used to calculate R1 and R2. The method then proceeds to step 614.
If the determination at step 608 is answered negatively, the method proceeds to step 612 where a series of calculations is used to calculate R1, R2 and R3. The method then proceeds to step 614.
At step 616 a determination is made as to whether 0.5(R1+R3) is greater than or equal to R1. If the determination is answered affirmatively, the method proceeds to step 618 where a series of calculations is used to calculate R2 and R3. If the method is answered negatively, the method proceeds to step 620 where a series of calculations is used to calculate R1, R2 and R3.
The steps 614, 618 and 620 proceed to step 622 where the adjustment of the R vector ends.
The nonstationarity measure vector is vector quantized using a spectrally weighted quantization. The spectral weights are derived from the LPC parameters. First, the LPC spectral estimate corresponding to the end point of the current frame is estimated at the pitch harmonic frequencies. This estimate employs tilt correction and a slight degree of bandwidth broadening. These measures are needed to ensure that the quantization of formant valleys or high frequencies are not compromised by attaching excessive weight to formant regions or low frequencies.
This harmonic spectrum is converted to a subband spectrum by averaging across the 5 subbands used for the computation of the nonstationarity measure.
This is averaged with the subband spectrum at the end of the previous frame to derive a subband spectrum that corresponding to the center of the current frame. This average serves as the spectral weight vector for the quantization of the nonstationarity vector.
{overscore (W)}4(l)=0.5({overscore (W)}0(1)+{overscore (W)}8(1)) 1≦l≦5. (88)
The voicing measure is concatenated to the end of the nonstationarity measure vector, resulting in a 6-dimensional composite vector. This permits the exploitation of the considerable correlation that exists between these quantities. The composite vector is denoted by
c={(1)(2)(3(4) (5)θ}. (89)
The spectral weight for the voicing measure is derived from the spectral weight for the nonstationarity measure depending on the voicing measure flag. If the frame is voiced (θflag=0), the weight is computed as
In other words, it is lower than the average weight for the nonstationary component. This ensures that that the nonstationary component is quantized more accurately than the voicing measure. This is desirable since for voiced frames, it is important to preserve the nonstationarity in the various bands to achieve the right degree of periodicty. On the other hand, for unvoiced frames, voicing measure is more important. In this case, its weight is larger than the maximum weight for the nonstationary component.
A 64 level, 6-dimensional vector quantizer is used to quantize the composite nonstationarity measure-voicing measure vector. The first 8 codevectors (indices 0–7) assigned to represent unvoiced frames and the remaining 56 codevectors (indices 8–63) are assigned to respresent voiced frames. The voiced/unvoiced decision is made based on the voicing measure flag. The following weighted MSE distortion measure is used:
Here, {VR(l,m), 0≦l≦63,1≦m≦6} is the 64 level, 6-dimensional composite nonstationarity measure-voicing measure codebook and DR (l) is the weighted MSE distortion for the lth codevector. If the frame is unvoiced (θflag=1), this distortion is minimized over the indices 0–7. If the frame is voiced (θflag=0), the distortion is minimized over the indices 8–63. Thus,
This partitioning of the codebook reflects the higher importance given to the representation of the nonstationarity measure during voiced frames. The 6-bit index of the optimal codevector l*R is transmitted to the decoder as the nonstationarity measure index. It should be noted that the voicing measure flag, which is used in the decoder 100B for the inverse quantization of the PW magnitude vector, can be detected by examining the value of this index.
Up to this point, the PW vectors are processed in Cartesian (i.e., real-imaginary) form. The FDI codec 100 at 4.0 kbit/s encodes only the PW magnitude information to make the most efficient use of available bits. PW phase spectra are not encoded explicitly. Further, in order to avoid the computation intensive square-root operation in computing the magnitude of a complex number, the PW magnitude-squared vector is used during the quantization process.
The PW magnitude vector is quantized using a hierarchical approach, which allows the use of fixed dimension VQ with a moderate number of levels and precise quantization of perceptually important components of the magnitude spectrum. In this approach, the PW magnitude is viewed as the sum of two components: a PW mean component, which is obtained by averaging the PW magnitude across frequencies within a 7 band sub-band structure, and a PW deviation component, which is the difference between the PW magnitude and the PW mean. The PW mean component captures the average level of the PW magnitude across frequency, which is important to preserve during encoding. The PW deviation contains the finer structure of the PW magnitude spectrum and is not important at all frequencies. It is only necessary to preserve the PW deviation at a small set of perceptually important frequencies. The remaining elements of PW deviation can be discarded, leading to a small, fixed dimensionality of the PW deviation component.
The PW magnitude vector is quantized differently for voiced and unvoiced frames as determined by the voicing measure flag. Since the quantization index of the nonstationarity measure is determined by the voicing measure flag, the PW magnitude quantization mode information is conveyed without any additional overhead.
During voiced frames, the spectral characteristics of the residual are relatively stationary. Since the PW mean component is almost constant across the frame, it is adequate to transmit it once per frame. The PW deviation is transmitted twice per frame, at the 4th and 8th subframes. Further, interframe predictive quantization can be used in the voiced mode. On the other hand, unvoiced frames tend to be nonstationary. To track the variations in PW spectra, both mean and deviation components are transmitted twice per frame, at the 4th and 8th subframes. Prediction is not employed in the unvoiced mode.
The PW magnitude vectors at subframes 4 and 8 are smoothed by a 3-point window. This smoothing can be viewed as an approximate form of decimation filtering to down sample the PW vector from 8 vectors/frame to 2 vectors/frame.
{overscore (P)}m(k)=0.3Pm−1(k)+0.4Pm(k)+0.3Pm+1(k), 0≦k≦Km, m=4,8. (94)
The subband mean vector is computed by averaging the PW magnitude vector across 7 subbands. The subband edges in Hz are
Bpw=[1 400 800 1200 1600 2000 2600 3400] (95)
To average the PW vector across frequencies, it is necessary to translate the subband edges in Hz to subband edges in terms of harmonic indices. The band-edges in terms of hamonic indices for subframes 4 and 8 can be computed by
The mean vectors are computed at subframes 4 and 8 by averaging over the harmonic indices of each subband. It should be noted that, as mentioned earlier, since the PW vector is available in magnitude-squared form, the mean vector is in reality a RMS vector. This is reflected by the following equation.
The mean vector quantization is spectrally weighted. The spectral weight vector is computed for subframe 8 from LP parameters as follows:
The spectral weight vector is attenuated outside the band of interest, so that out-of-band PW components do not influence the selection of the optimal code-vector.
W8(k)←W8(k)10−10, 0≦k≦κ8(0) or κ8(7)≦k≦K8. (99)
The spectral weight vector for subframe 4 is approximated as an average of the spectral weight vectors of subframes 0 and 8. This approximation is used to reduce computational complexity of the encoder.
W4(k)=0.5(W0(k)+W8(k)),0≦k≦K4. (100)
The spectral weight vectors at subframes 4 and 8 are averaged over subbands to serve as spectral weights for quantizing the subband mean vectors:
The mean vectors at subframes 4 and 8 are vector quantized using a 7 bit codebook. A precomputed DC vector {PDC
Here, {VPWM
The quantized subband mean vectors are given by adding the optimal codevectors to the DC vector:
{overscore (P)}mq(i)=PDC
The quantized subband mean vectors are used to derive the PW deviations vectors. This makes it possible to compensate for the quantization error in the mean vectors during the quantization of the deviations vectors. Deviations vectors are computed for subframes 4 and 8 by subtracting fullband vectors constructed using quantized mean vectors from original PW magnitude vectors. The fullband vectors are obtained by piecewise-constant approximation across each subband:
The deviation vector is quantized only for a small subset of the harmonics, which are perceptually important. There are a number of approaches to selecting the harmonics, by taking into account the signal characteristics, spectral energy distribution etc. This embodiment of the present invention uses a simple approach where harmonics 1–10 are selected. This ensures that the low frequency part of the speech spectrum, which is perceptually important is reproduced more accurately. Taking into account the fact that the PW vector is available in magnitude-squared form, harmonics 1–10 of the deviation vector are computed as follows:
Fm(k)=√{square root over (Pm(ksartm+k))}−Sm(kstartm+k), 1≦k≦10,m=4,8 (106)
Here, kstartm is computed so that harmonics below 200 Hz are not selected for computing the deviations vector:
The quantization of deviations vectors is carried out by a 6-bit vector quantizer using spectrally weighted MSE distortion measure.
Here, {VPWD
The quantized deviations vectors are the optimal code-vectors:
Fmq(i)=VPWD
The two 7-bit mean quantization indices l*PWM
In the voiced mode, the PW magnitude vector smoothing, the computation of harmonic subband edges and the PW subband mean vector at subframe 8 take place as in the case of unvoiced frames. In contrast to the unvoiced case, a predictive VQ approach is used where the quantized PW subband mean vector at subframe 0 (i.e., subframe 8 of previous frame) is used to predict the PW subband mean vector at subframe 8. A prediction coefficient of 0.5 is used. A predetermined DC vector is subtracted prior to prediction. The resulting vectors are quantized by a 7-bit codebook using a spectrally weighted MSE distortion measure. The subband spectral weight vector is computed for subframe 8 as in the case of unvoiced frames. The distortion computation is summarized by
Here, {VPWM
The quantized subband mean vector at subframe 8 is given by adding the optimal code-vector to the predicted vector and the DC vector:
{overscore (P)}8q(i)=MAX(0.1,PDC
Since the mean vector is an average of PW magnitudes, it should be a nonnegative value. This is enforced by the maximization operation in the above equation 113.
A fullband mean vector {S8(k),0≦k≦K8} is constructed at subframe 8 using the quantized subband mean vector, as in the unvoiced mode. A subband mean vector is constructed for subframe 4 by linearly interpolating between the quantized subband mean vectors of subframes 0 and 8:
{overscore (P)}4(i)=0.5({overscore (P)}0q(i)+{overscore (P)}8q(i))0≦i≦6. (114)
A fullband mean vector {S4(k),0≦k≦K4} is constructed at subframe 4 using this interpolated subband mean vector. By subtracting these fullband mean vectors from the corresponding magnitude vectors, deviations vectors {F4(k),1≦k≦10} and {F8(k),l≦k≦10} are computed at subframes 4 and 8. Note that these deviations vectors are computed only for selected harmonics, i.e., harmonics (kstartm+1)−(kstartm+10) as in the unvoiced case. The deviations vectors are predictively quantized based on prediction from the quantized deviation vector from 4 subframes ago i.e, subframe 4 is predicted using subframe 0, subframe 8 using subframe 4. A prediction coefficient of 0.55 is preferably used.
The deviations prediction error vectors are quantized using a multi-stage vector quantizer with 2 stages. The 1st stage uses a 64-level codebook and the 2nd stage uses a 16-level codebook. Another embodiment of the present invention considers only the 8 best candidates from the 1st codebook in searching the 2nd codebook which is used to reduce complexity. The distortion measures are spectrally weighted. The spectral weight vectors {W4(k),0≦k≦10}, and {W8(k),0≦k≦10} computed as in the unvoiced case. The 1st codebook uses the following distortion to find the 8 codevectors with the smallest distortion:
where {jPWD
where l1=l*PWD
In the unvoiced mode, the VAD flag is explicitly encoded using a binary index l*VAD
l*VAD
In the voiced mode, it is implicitly assumed that the frame is active speech. Consequently, it is not necessary to explicitly encode the VAD information.
In a preferred embodiment, at 4 kb/s, the following table 1 summarizes the bits allocated to the quantization of the encoder parameters under voiced and unvoiced modes. As indicated in the table, a single parity bit is included as part of the 80 bit compressed speech packet. This bit is intended to detect channel errors in a set of 24 critical (Class 1) bits. Class 1 bits consist of the 6 most significant bits (MSB) of the PW gain bits, 3 MSBs of 1st LSF, 3 MSBs of 2nd LSF, 3 MSBs of 3rd LSF, 2 MSBs of 4th LSF, 2 MSBs of 5th LSF, MSB of 6th LSF, 3 MSBs of the pitch index and MSB of the nonstationarity measure index. The single parity bit is obtained by an exclusive OR operation of the Class 1 bit sequence. It will be appreciated by those skilled in the art that other bit allocations can be used and still fall within the scope of the present invention.
The present invention will now be discussed with reference to decoder 100B. The decoder receives the 80 bit packet of compressed speech produced by the encoder and reconstructs a 20 ms segment of speech. The received bits are unpacked to obtain quantization indices for the LSF parameter vector, the pitch period, the PW gain vector, the nonstationarity measure vector and the PW magnitude vector. A cyclic redundancy check (CRC) flag is set if the frame is marked as a bad frame. For example this could be due to frame erasures or if the parity bit which is part of the 80 bit compressed speech packet is not consistent with the class 1 bits comprising the gain, LSF, pitch and nonstationarity measure bits. Otherwise, the CRC flag is cleared. If the CRC flag is set, the received information is discarded and bad frame masking techniques are employed to approximate the missing information.
Based on the quantization indices, LSF parameters, pitch, PW gain vector, nonstationarity measure vector and the PW magnitude vector are decoded. The LSF vector is converted to LPC parameters and linearly interpolated for each subframe. The pitch frequency is interpolated linearly for each sample. The decoded PW gain vector is linearly interpolated for odd indexed subframes. The PW magnitude vector is reconstructed depending on the voicing measure flag, obtained from the nonstationarity measure index. The PW magnitude vector is interpolated linearly across the frame at each subframe. For unvoiced frames (voicing measure flag=1), the VAD flag corresponding to the look-ahead frame is decoded from the PW magnitude index. For voiced frames, the VAD flag is set to 1 to represent active speech.
Based on the voicing measure and the nonstationarity measure, a phase model is used to derive a PW phase vector for each subframe. The interpolated PW magnitude vector at each subframe is combined with a phase vector from the phase model to obtain a complex PW vector for each subframe.
Out-of-band components of the PW vector are attenuated. The level of the PW vector is restored to the RMS value represented by the PW gain vector. The PW vector, which is a frequency domain representation of the pitch cycle waveform of the residual, is transformed to the time domain by an interpolative sample-by-sample pitch cycle inverse DFT operation. The resulting signal is the excitation that drives the LP synthesis filter, constructed using the interpolated LP parameters. Prior to synthesis, the LP parameters are bandwidth broadened to eliminate sharp spectral resonances during background noise conditions. The excitation signal is filtered by the all-pole LP synthesis filter to produce reconstructed speech. Adaptive postfiltering with tilt correction is used to mask coding noise and improve the peceptual quality of speech.
The pitch period is inverse quantized by a simple table lookup operation using the pitch index. It is converted to the radian pitch frequency corresponding to the right edge of the frame by
where {circumflex over (p)} is the decoded pitch period. A sample by sample pitch frequency contour is created by interpolating between the pitch frequency of the left edge {circumflex over (ω)}(0) and the pitch frequency of the right edge {circumflex over (ω)}(160):
If there are abrupt discontinuities between the left edge and the right edge pitch frequencies, the above interpolation is modified as in the case of the encoder. Note that the left edge pitch frequency {circumflex over (ω)}(0) is the right edge pitch frequency of the previous frame.
The index of the highest pitch harmonic within the 4000 Hz band is computed for each subframe by
The LSFs are quantized by a hybrid scalar-vector quantization scheme. The first 6 LSFs are scalar quantized using a combination of intraframe and interframe prediction using 4 bits/LSF. The last 4 LSFs are vector quantized using 7 bits.
The inverse quantization of the first 6 LSFs can be described by the following equations:
When the received frame is inactive, the decoded LSF's are used to update an estimate for background LSF's using the following recursive relationship:
{circumflex over (λ)}bgn(m)=0.98λbgn(m)+0.02{circumflex over (λ)}(m), 0≦m≦9 (123)
In order to improve the performance of the codec 100 in the presence of background noise, we replace the curent decoded LSF's by an interpolated version of the inverse quantized LSF's, background noise LSF's, and a DC value of the background noise LSF's during frames that are not only active but which follow another active frame, i.e.,
{circumflex over (λ)}(m)=0.25{circumflex over (λ)}(m)+0.25λbgn(m)+0.5λbgn,dc(m), 0≦m≦9 (124)
For transitional frames, i.e., frames which are transitioning from active to inactive or vice-versa, the interpolation weights are altered to favor the inverse quantized LSF's, i.e.,
{circumflex over (λ)}(m)=0.5{circumflex over (λ)}(m)+0.25λbgn(m)+0.25λbgn,dc(m), 0≦m≦9 (125)
The inverse quantized LSFs are interpolated each subframe by linear interpolation between the current LSFs {{circumflex over (λ)}(m),0≦m≦10} and the previous LSFs {{circumflex over (λ)}prev(m),0≦m≦10}. The interpolated LSFs at each subframe are converted to LP parameters {âm(l),0≦m≦10,1≦l≦8}.
Inverse quantization of the PW nonstationarity measure and the voicing measure is a table lookup operation. If l*R is the index of the composite nonstationarity measure and the voicing measure, the decoded nonstationarity measure is
1(i)=VR(l*R,i), 1≦i≦5. (126)
Here, {VR(l,m), 0≦l≦63,1≦m≦6} is the 64 level, 6-dimensional codebook used for the vector quantization of the composite nonstationarity measure vector. The decoded voicing measure is
{circumflex over (ν)}=VR(l*R,6). (127)
A voicing measure flag is also created based on l*R as follows:
This flag determines the mode of inverse quantization used for PW magnitude.
The decoded nonstationarity measure may have excessive values due to the small number of bits used in encoding this vector. This leads to excessive roughness during highly periodic frames, which is undesirable. To control this problem, during sustained intervals of highly periodic frames the decoded nonstationarity measure is subjected to upper limits, determined based on the decoded voicing measure. If l*R
In addition, for sustained intervals of highly periodic frames, it is desirable to prevent excessive changes in the nonstationarity measure from one frame to the next. This is achieved by allowing a maximum amount of permissible change for each component of the nonstationarity measure. The changes that result in a decrease of the nonstationarity measure are not limited. Rather, the changes that increase the nonstationarity measure are limited by this procedure. If prev denotes the modified nonstationarity measure of the preceding frame, this procedure can be summarized as follows:
The gain vector is inverse quantized by a table look-up operation. It is then linearly transformed to reverse the trasformation at the encoder. If l*g is the gain index, the gain values for the even indexed subframes are obtained by
where, {Vg(l,m), 0≦l≦255,1≦m≦4} is the 256 level, 4-dimensional gain codebook.
The gain values for the odd indexed subframes are obtained by linearly interpolating between the even indexed values:
ĝpw(2m−1)=0.5(ĝpw(2m−2)+ĝpw(2m)),1≦m≦4. (136)
The gain values are now expressed in logarithmic units. They are converted to linear units by
ĝ′pw(m)=10ĝ
This gain vector is used to restore the level of the PW vector during the generation of the excitation signal.
Based on the decoded gain vector in the log domain, long term average gain values for inactive frames and active unvoiced frames are computed. These gain averages are useful in identifying inactive frames that were marked as active by the VAD. This can occur due to the hangover employed in the VAD or in the case of certain background noise conditions such as babble noise. By identifying such frames, it is possible to improve the performance of the codec 100 for background noise conditions.
At step 712 a determination is made as to whether rvad_flag_final equals a one and lR is less than 8 and bad frame flag equals false, if the determination is negative the method proceeds to step 720. If the determination is affirmative, the method proceeds to step 714.
At step 714 a determination is made as to whether nuv is less than 50. If the determination is answered negatively then the method proceeds to step 716 where Gavguv is calculated using a first equation. If the method is answered negatively, the method proceeds to step 718 where a second equation is used to calculate Gavguv.
If the determination at step 704 is negative, the method proceeds to step 706 where a determination of whether nbg is less than 50 is determined. If the determination is answered negatively, the method proceeds to step 708 where Gavg-tmpbg is calculated using a first equation. If the determination is answered affirmatively, the method proceeds to step 710 where Gavg-tmpbg is calculated using a second equation.
The steps 708, 710, 716, 718 and 712 proceed to step 720 where Gavgbg is calculated. The method then proceeds to step 722 where the computation ends for Gavgbg and Gavguv.
First an average gain is computed for the entire frame:
Long term average gains for inactive frames which represent the background signal and unvoiced frames are computed according to the method 700.
The decoded voicing measure flag determines the mode of inverse quantization of the PW magnitude vector. If {circumflex over (θ)}flag is a zero, voiced mode is used and if {circumflex over (θ)}flag is a one, unvoiced mode is used.
In the voiced mode, the PW mean is transmitted once per frame and the PW deviation is transmitted twice per frame. Further, interframe predictive quantization is used in this mode. In the unvoiced mode, mean and deviation components are transmitted twice per frame. Prediction is not employed in the unvoiced mode.
In the unvoiced mode, the VAD flag is explicitly encoded using a binary index l*VAD
In the voiced mode, it is implicitly assumed that the frame is active speech. Consequently, it is not necessary to explicitly encode the VAD information. VAD flag is set to 1 indicating active speech in the voiced mode:
RVAD_FLAG=1. (140)
It should be noted that the RVAD_FLAG is the VAD flag corresponding to the look-ahead frame where RVAD_FLAG,RVAD_FLAG_DL1,RVAD_FLAG_DL2 denote the VAD flags of the look-ahead frame, current frame and the previous frame respectively. A composite VAD value, RVAD_FLAG_FINAL, is determined for the current frame, based on the above VAD flags, according to the following table 2:
The RVAD_FLAG_FINAL is zero for frames in inactive regions, three in active regions, one prior to onsets and a two prior to offsets. Isolated active frames are treated as inactive frames and vice versa.
In the unvoiced mode, the mean vectors for subframes 4 and 8 are inverse quantized as follows:
{circumflex over (D)}m(i)=PDC
Here, {{circumflex over (D)}4 (i),0≦i≦6} and {{circumflex over (D)}8(i),0≦i≦6} are the inverse quantized 7-band subband PW mean vectors, {VPWM
Due to the limited accuracy of PW mean quantization in the unvoiced mode, it is possible to have high values of PW mean at high frequencies. This in conjunction with a LP synthesis filter which emphasizes high frequencies can cause excessive high frequency content in the reconstructed speech, leading to poor voice quality. To control this condition, the PW mean values in the uppermost two subbands is attenuated if it is found to be high and the LP synthesis filter has a frequency response with a high frequency emphasis.
The magnitude squared frequency response of the LP synthesis filter is averaged across two bands, 0–2 kHz and 2–4 kHz:
Here, {â8(m)} are the decoded, interpolated LP parameters for the 8th sub frame of the current frame, ŵ(160) is the decoded pitch frequency in radians for the 160th sample of the current frame and └ ┘ denotes truncation to integer. A comparison of the low band sum Slb against the high band SUM Shb can reveal the degree of high frequency emphasis in the LP synthesis filter.
An average of the PW magnitude in the 1st 5 subbands is computed, for subframes 4 and 8, as follows:
The attenuation of the PW mean in the 6th and 7th subbands is performed according to the flowchart 800 in
At step 808, a determination is made as to whether Slb is less than 0.0724Shb. If the determination is answered negatively the method proceeds to step 810 where a determination is made as to whether 1*R
At step 814, the GavgTh is computed. The method then proceeds to step 816 where a determination is made as to whether nbg is greater than or equal to 50, nuv is greater than or equal to 50, and Gavg is less than GavgTh. If the determination is answered negatively the method proceeds to step 812. If the determination is answered affirmatively the method proceeds to step 818.
At step 818, the slope is calculated. The method then proceeds to step 820 where Gα, Dm (5) and Dm (6) are calculated.
If the determination at step 808 is answered affirmatively, the method proceeds to step 822 where Dm (5) and Dm (6) are calculated. The method then proceeds to step 824.
Steps 806, 822, 820 and 822 all proceed to step 824 where the adjustment for the PW mean ends for subframes 4 and 8.
The deviation vectors for subframes 4 and 8 are inverse quantized as follows:
{circumflex over (F)}m(k)=VPWD
Here, {{circumflex over (F)}4 (k),1≦k≦10} and {{circumflex over (F)}8 (k),1≦k≦10} are the inverse quantized PW deviation vectors. {VPWD
The subband mean vectors are converted to fullband vectors by a piecewise constant approximation across frequency. This requires that the subband edges in Hz are translated to subband edges in terms of harmonic indices. Let the band edges in Hz be defined by the array
Bpw=[1 400 800 1200 1600 2000 2600 3400]. (146)
The band edges can be computed by
The full band PW mean vectors are constructed at subframes 4 and 8 by
The PW magnitude vector can then be reconstructed for subframes 4 and 8 by adding the full band PW mean vector to the deviations vector. In the unvoiced mode, the deviations vector is assumed to be zero at the unselected harmonic indices.
Here, kstartm is computed in the same manner as in the encoder in equation (107).
The PW magnitude vector is reconstructed for the remaining subframes by linearly interpolating between sub frames 0 and 4 (for subframes 1, 2 and 3) and between subframes 4 and 8 (for subframes 5, 6 and 7):
In the voiced mode, the mean vector for subframe 8 is inverse quantized based on interframe prediction:
{circumflex over (D)}8 (i)=MAX(0.1, PDC
Here, {{circumflex over (D)}8 (i),0≦i≦6} is the 7-band subband PW mean vector, {VPWM
As in the case of unvoiced frames, if the values of PW mean in the highest two bands are excessive, and this occurs in conjuntion with LP synthesis filter with a high frequency emphasis, attenuation is applied to the PW mean values in the highest two bands. The magnitude squared frequency response of the LP synthesis filter is averaged across two bands, 0–2 kHz and 2–4 kHz, as in the unvoiced mode. An average of the PW magnitude in the 1st 5 subbands is computed for subframe 8, as in the unvoiced mode. Based on these values, the PW mean in the upper two bands is attenuated according to the flowchart shown in
At step 904 a determination is made as to whether S1b is less than 1.33Shb. If the determination is answered negatively, the method proceeds to step 906 where Dm (5) and Dm (6) are calculated using a first equation. If the determination at step 904 is answered affirmatively, the method proceeds to step 908 where Dm (5) and Dm (6) are calculated using a second equation.
Steps 906 and 908 proceed to step 910 where the adjustment of the PW mean for high frequency bands for subframe 8 ends.
A subband mean vector is constructed for subframe 4 by linearly interpolating between subframes 0 and 8:
{circumflex over (D)}4(i)=0.5(D0(i)+{circumflex over (D)}8(i)), 0≦i≦6. (152)
The full band PW mean vectors are constructed at subframes 4 and 8 by
The harmonic band edges {{circumflex over (κ)}m(i), 0≦i≦7} are computed as in the case of unvoiced mode.
The voiced deviation vectors for subframes 4 and 8 are predictively quantized by a multistage vector quantizer with 2 stages. These prediction error vectors are inverse quantized by adding the contributions of the 2 codebooks:
{circumflex over (B)}m(k)=VPWD
Here, {{circumflex over (B)}4 (i),0≦i≦9} and {{circumflex over (B)}8 (i),0≦i≦9} are the PW deviation prediction error vectors for subframes 4 and 8 respectively. {VPWD
{circumflex over (F)}m(k)={circumflex over (B)}m(k)+0.55{circumflex over (F)}0(k), 1≦k≦10,m=4,8. (155)
It should be noted that {{circumflex over (F)}0(k),1≦k≦10} is the decoded deviations vector from subframe 8 of the previous frame. If the previous frame was unvoiced, this vector is set to zero. The PW magnitude vector can then be reconstructed for subframes 4 and 8 by adding the full band PW mean vector to the deviations vector. The deviations vector is assumed to be zero at the unselected harmonic indices.
Here, kstartm is computed in the same manner as in the encoder in equation (107).
The PW magnitude vector is reconstructed for the remaining subframes by linearly interpolating between subframes 0 and 4 (for subframes 1, 2 and 3) and between subframes 4 and 8 (for subframes 5, 6 and 7):
It should be noted that {IP (i),0<i<60} is the decoded PW magnitude vector from subframe 8 of the previous frame.
In the FDI codec 100, there is no explicit coding of PW phase. The salient characteristics related to the phase, such as the degree of stationarity of the PW (i.e., periodicity of the time domain residual) and the variation of the stationarity as a function of frequency are encoded in the form of the quantized voicing measure {circumflex over (ν)} and the vector nonstationarity measure respectively. A PW phase vector is constructed for each subframe based on this information by a two step process. In this process, the phase of the PW is modeled as the phase of a weighted complex vector sum of a stationary component and a nonstationary component.
In the first step, a stationary component is constructed using the decoded voicing measure {circumflex over (ν)}. First a complex vector is constructed, by a weighted combination of the following: the phase vector of the stationary component of the previous, i.e., m−1th, sub-frame {{overscore (φ)}m−1(k),0≦k≦{circumflex over (K)}m−1}, a random phase vector {γm(k),0≦k≦{circumflex over (K)}m}, and
a fixed phase vector that is obtained from a residual voiced pitch pulse waveform {φfix(k),0≦k≦{circumflex over (K)}m}.
In order to combine the previous phase vector which has {circumflex over (K)}m−1 components with the random phase vector which has {circumflex over (K)}m components, it may be necessary to used a modified version of the previous phase vector. If there is no pitch discontinuity between the previous and the current subframes, this modification is simply a truncation (if {circumflex over (K)}m−1>{circumflex over (K)}m) or padding by random phase values (if {circumflex over (K)}m−1<{circumflex over (K)}m). If there is a pitch discontinuity, it is necessary to align the two phase vectors such that the harmonic frequencies corresponding to the vector elements are as close as possible. This may require either interlacing or decimating the previous phase vector. For example, if the pitch period of the current subframe is roughly l-times that of the previous subframe, l{circumflex over (K)}m−1≅{circumflex over (K)}m. In this case, each element of the previous phase vector is interlaced with l−1 random phase values. On the other hand, if the the pitch period of the previous subframe is roughly l-times that of the current subframe, {circumflex over (K)}m−1≅l{circumflex over (K)}m. In this case, for each element of the previous phase vector, the next l-1 elements are dropped. In either case, the dimension of the modified previous phase vector will have the same dimension as that for the current subframe. The modified previous phase vector will be denoted by {ψm−1(k),0≦k≦{circumflex over (K)}m}.
The random phase vector provides a method of controlling the degree of stationarity of the phase of the stationary component. However, to prevent excessive randomization of the phase, the random phase component is not allowed to change every subframe, but is changed after several sub-frames depending on the pitch period. Also, the random phase component at a given harmonic index alternates in sign in successive changes. At the 1st sub-frame in every frame, the rate of randomization for the current frame is determined based on the pitch period. For highly aperiodic frames, the highest rate of randomization is used regardless of the pitch period. The subframes for which the random vector is updated can be summarized as follows:
In addition, abrupt changes in the update rate of the random phase, i.e., from rate 1 in the previous frame to the rate 3 in the current frame or vice-versa are not permitted. Such cases are modified to the rate 2 in the current frame. Controlling the rate at which the phase is randomized is quite important to prevent artifacts in the reproduced signal, especially in the presence of background noise. If the phase is randomized every subframe, it leads to a fluttering of the reproduced signal. This is due to the fact that such a randomization is not representative of natural signals.
The random phase value is determined by a random number generator, which generates uniformly distributed random numbers over a sub-interval of 0-πradians. The sub-interval is determined based on the decoded voicing measure {circumflex over (ν)} and a stationarity measure ζ(m). A weighted sum of the elements of the nonstationary measure vector for the current frame is computed by
This is a scalar measure of the nonstationarity of the current frame. If θprev is the corresponding value for the previous frame, an interpolated stationarity measure is computed for each subframe is obtained by:
The sub-interval of [0−π] used for phase randomization is [πμ1/2−πμ1], where μ1 is determined based on the following rule depending on the stationarity of the subframe:
As the subframe becomes more stationary (ζ(m) relatively high valued), μ1 takes on lower values, thereby creating smaller values of random phase perturbation. As the stationarity of the subframe decreases, μ1 takes on higher values, resulting in higher values of random phase perturbation. Uniformly distributed random numbers in the interval
are used as random phases. In addition, the sign of the the random phase at any given harmonic index is alternated from one update to the next, to remove any bias in phase randomization. The weighted phase combination of the random phase, previous phase and fixed phase is performed in two steps. In the 1st step, the random phase and the previous phase are added directly resulting in a randomized previous phase vector:
ξm(k)=ψm−1(k)+γm(k),0≦k≦{circumflex over (K)}m. (161)
In the 2nd step, the randomized phase vector as well as the fixed phase vector are combined with unity magnitude and a weighted vector addition is performed. This results in a complex vector, which in general does not have unity magnitude:
where, α1 is a weighting factor determined based on the quantized voicing measure {circumflex over (ν)} and the stationarity measure ζ(m) computed by:
As the subframe becomes more stationary (ζ(m) relatively high valued), α1 takes on lower values, increasing the contribution of the fixed phase vector. Conversely, as the stationarity of the subframe decreases, α1 takes on higher values, increasing the contribution of the randomized phase. The resulting vector is normalized to unity magnitude as follows:
Also, the phase of this vector is computed to serve as the previous phase during the next subframe:
The above normalized vector is passed through an evolutionary low pass filter (i.e., low pass filtering along each harmonic track) to limit excessive variations, so that a signal having stationary characteristics (in the evolutionary sense) is obtained. Stationarity implies that variations faster than 25 Hz are minimal. However, due to phase models used and the random phase component it is possible to have excessive variations. This is undesirable since it produces speech that is rough and lacks naturalness during voiced sounds. The low pass filtering operation overcomes this problem. Delay constraints preclude the use of linear phase FIR filters. Consequently, second order IIR filters are employed. The filter transfer function is given by
The filter parameters are obtained by interpolating between two sets of filter parameters. One set of filter parameters corresponds to a low evolutionary bandwidth and the other to a much wider evolutionary bandwidth. The interpolation factor is selected based on the stationarity measure (ζ(m)), so that the bandwidth of the LPF constructed by interpolation between these two extremes allows the right degree of stationarity in the filtered signal. The filter parameters corresponding to low evolutionary bandwidth are:
The filter parameters corresponding to high evolutionary bandwidth are:
aoop=1, a1ap=−1.523326, a2ap=0.6494950,
boop=0.395304917, b1ap=−0367045695, b2op=0.146146091.
The interpolation parameter is computed based on the stationarity measure as follows:
It is desirable to prevent excessive variations in α2 from one subframe to the next, as this would result in large variations in the filter characteristics. A modified interpolation parameter β2 is computed by introducing hysteresis as follows:
Here, β2prev is the modified interpolation parameter β2 computed during the preceding subframe. The interpolated filter parameters are computed by:
The evolutionary low pass filtering operation is represented by
Ûm(k)=U″m(k)+b1U″m−1(k)+b2U″m-2(k)−a1Ûm−1(k)−a2Ûm-2(k), 0≦k≦{circumflex over (K)}m, 0≦m≦8. (172)
It should be noted that, if there is a pitch discontinuity, the filter state vectors, (i.e., U″m−1(k),U″m-2(k),Ûm−1(k),Ûm-2(k)) can require truncation, interlacing and/or decimation to align the vector elements such that the harmonic frequencies are paired with minimal discontinuity. This procedure is similar to that described for the previous phase vector above.
The phase spectrum of the resulting stationary component vector Ûm(k) has the desired evolutionary characteristics, consistent with the stationary component of the residual signal at the encoder 100A.
In the second step of phase construction, a nonstationary PW component is constructed, also using the decoded voicing measure {circumflex over (ν)}. The nonstationary component is expected to have some correlation with the stationary component. The correlation is higher for periodic signals and lower for aperiodic signals. To take this into account, the nonstationary component is constructed by a weighted addition of the stationary component and a complex random signal. The random signal has unity magnitude at all the harmonics.
In other words, only the phase of the random signal is randomized. In addition, the RMS value of the random signal is normalized such that it is equal to the RMS value of the stationary component, computed by:
The weighting factor used in combining the stationary and noise components is computed based on the voicing measure and the nonstationarity measure quantization index by:
The weighting factor is increases with the periodicity of the signal. Thus, for periodic frames, the correlation between the stationary and nonstationary components is higher than for aperiodic frames. In addition, this correlation is expected to decrease with increasing frequency. This is incorporated by decreasing the weighting factor with increasing harmonic index:
Thus, the weighting factor decreases linearly from β3 at k=0 to β3−(0.5+0.5{circumflex over (ν)})β3 at k={circumflex over (K)}m. The slope of this decrease is higher for aperiodic frames; i.e., for aperiodic frames the correlation with the stationary component starts at a lower value and decreases more rapidly than for periodic frames. The nonstationary component is then computed by:
{circumflex over (R)}m(k)=∂3(k)Ûm(k)+[1−∂3(k)]G′SN′m(k),0≦k≦{circumflex over (K)}m. (176
Here {N′m(k),0≦k≦{circumflex over (K)}m} is the unity magnitude complex random signal and {{circumflex over (R)}m(k),0≦k≦{circumflex over (K)}m} is the nonstationary PW component.
The stationary and nonstationary PW components are combined by a weighted sum to construct the complex PW vector. The subband nonstationarity measure determines the frequency dependent weights that are used in this weighted sum. The weights are detemined such that the ratio of the RMS value of the nonstationary component to that of the stationary component is equal to the decoded nonstationarity measure within each subband. From equation 90, the band edges in Hz are defined by the array
Brs=[1 400 800 1600 2400 3400].
As in the case of the encoder 100A, the subband edges in Hz are translated to subband edges in terms of harmonic indices such that the ith subband contains harmonics with indices {{circumflex over (θ)}(i−1)≦k<{circumflex over (θ)}(i),1≦i≦5}:
The energy in each subband is computed by averaging the squared magnitude of each harmonic within the subband. For the stationary component, the subband energy distribution for the mth subframe is computed by
For the nonstationary component, the subband energy distribution for the mth subframe is computed by
The subband weighting factors are computed by {{circumflex over (θ)}(i−1)≦k<{circumflex over (θ)}(i), 1≦i≦5}
Since the bandedges exclude out-of-band components, it is necessary to explicitly initialize the weighting factors for the out-of-band components:
The complex PW vector can now be constructed as a weighted combination of the complex stationary and complex nonstationary components:
{circumflex over (V)}m′(k)=Ûm(k)+{circumflex over (R)}m(k)Gsb(k),0≦k≦{circumflex over (K)}m, 1≦m≦8. (182)
However, it should be noted that this vector will have the desired phase characteristics, but not the decoded PW magnitude. To obtain a PW vector with the decoded magnitude and the desired phase, it is necessary to normalize the above vector to unity magnitude and multiply it with the decoded magnitude vector:
This vector is the reconstructed (normalized) PW magnitude vector for subframe m.
The inverse quantized PW vector may have high valued components outside the band of interest. Such components can deteriorate the quality of the reconstructed signal and should be attenuated. At the high frequency end, harmonics above 3400 Hz are attenuated. At the low frequency end, only the DC component (i.e., the 0 Hz component) is attenuated. The attenuation characteristic is linear from 1 at the bandedge to 0 at 4000 Hz. The attenuation process can be specified by:
where, kum is the index of the lowest pitch harmonic that falls above 3400 Hz. It is obtained by
Certain types of background noise can result in LP parameters that correspond to sharp spectral peaks. Examples of such noise are babble noise and interfering talker. Peaky spectra during background noise is undesirable since it leads to a highly dynamic reconstructed noise that interferes with the speech signal. This can be mitigated by a mild degree of bandwidth broadening that is adapted based on the RVAD_FLAG_FINAL computed according to table 3.6.3-3. Bandwidth broadening is also controlled by the nonstationarity index. If the index takes on values above 7, indicating an voiced frame, no bandwidth broadening is applied. For values of the nonstationarity index 7 or lower, a bandwidth broadening factor is selected jointly with the RVAD_FLAG_FINAL according to the following equation:
φ=Φ(2RVAD—FLAG—FINAL+VM—INDEX) (186)
where VM_INDEX is related to l*R as follows:
VM—INDEX=MIN(3,MAX(0,(l*R−5))) (187)
and the 9-dimensional array Φ is defined as follows in Table 3:
Bandwidth broadening is performed only during intervals of voice inactivity. Bandwidth expansion increases as the frame becomes more unvoiced. Onset and offset frames have a lower degree of bandwidth broadening compared to frames during voice inactivity. Bandwidth expansion is applied to interpolated LPC parameters as follows:
â′m(j)=âm(j)φm0≦m≦10, 1≦j≦8. (188)
The level of the PW vector is restored to the RMS value represented by the decoded PW gain. Due to the quantization process, the RMS value of the decoded PW vector is not guaranteed to be unity. To ensure that the right level is achieved, it is necessary to first normalize the PW by its RMS value and then scale it by the PW gain. The RMS value is computed by
The PW vector sequence is scaled by the ratio of the PW gain and the RMS value for each subframe:
The excitation signal is constructed from the PW using an interpolative frequency domain synthesis process. This process is equivalent to linearly interpolating the PW vectors bordering each subframe to obtain a PW vector for each sample instant, and performing a pitch cycle inverse DFT of the interpolated PW to compute a single time-domain excitation sample at that sample instant.
The interpolated PW represents an aligned pitch cycle waveform. This waveform is to be evaluated at a point in the pitch cycle (i.e., pitch cycle phase), advanced from the phase of the previous sample by the radian pitch frequency. The pitch cycle phase of the excitation signal at the sample instant determines the time sample to be evaluated by the inverse DFT. Phases of successive excitation samples advance within the pitch cycle by phase increments determined by the linearized pitch frequency contour.
The computation of the nth sample of the excitation signal in the mth sub-frame of the current frame can be conceptually represented by
where, θ(20(m−1)+n) is the pitch cycle phase at the nth sample of the excitation in the mth sub-frame. It is recursively computed as the sum of the pitch cycle phase at the previous sample instant and the pitch frequency at the current sample instant:
θ(20(m−1)+n)=θ(20(m−1)+n−1)+{circumflex over (ω)}((20(m−1)+n), 0≦n≦20 (192)
This is essentially a numerical integration of the sample-by-sample pitch frequency track to obtain the sample-by-sample pitch cycle phase. It is also possible to use trapezoidal integration of the pitch frequency track to get a more accurate and smoother phase track by
θ(20(m−1)+n)=θ(20(m−1)+n−1)+0.5[{circumflex over (ω)}(20(m−1)+n−1)+{circumflex over (ω)}(20(m−1)+n)]0≦n≦20 (193)
In either case, the first term circularly shifts the pitch cycle so that the desired pitch cycle phase occurs at the current sample instant. The second term results in the exponential basis functions for the pitch cycle inverse DFT.
The approach above is a conceptual description of the excitation synthesis operation. Direct implementation of this approach is possible, but is highly computation intensive. The process can be simplified by using radix-2 FFT to compute an oversampled pitch cycle and by performing interpolations in the time domain. These techniques have been employed to achieve a computation efficient implementation.
The resulting excitation signal {ê(n),0≦n≦160} is processed by an all-pole LP synthesis filter, constructed using the decoded and interpolated LP parameters. The first half of each sub-frame is synthesized using the LP parameters at the left edge of the sub-frame and the second half by the LP parameters at the right edge of the sub-frame. This ensures that locally optimal LP parameters are used to reconstruct the speech signal. The transfer function of the LP synthesis filter for the first half of the mth subframe is given by
and for the second half
The signal reconstruction is expressed by
The resulting signal {ŝ(n),0≦n≦160} is the reconstructed speech signal.
The reconstructed speech signal is processed by an adaptive postfilter to reduce the audibility of the effects of modeling and quantization. A pole-zero postfilter with an adaptive tilt correction is employed as disclosed in “Adaptive Postfiltering for Quality Enhancement of Coded Speech”, IEEE Transactions on Speech and Audio Processing, Vol. 3, No. 1, pages 59–71, January 1995 by J. H. Chen and A. Gersho which is incorporated by reference in its entirety.
The postfilter emphasizes the formant regions and attenuates the valleys between formants. As during speech reconstruction, the first half of the sub-frame is postfiltered by parameters derived from the LPC parameters at the left edge of the sub-frame. The second half of the sub-frame is postfiltered by the parameters derived from the LPC parameters at the right edge of the sub-frame. For the mth sub-frame, these two postfilter transfer functions are specified respectively by
and
The pole-zero postfiltering operation for the first half of the sub-frame is represented by
The pole-zero postfiltering operation for the second half of the sub-frame is represented by
where, αpf and βpf are the postfilter parameters. These satisfy the constraint 0≦βpf<αpf≦1. A typical choice for these parameters is αpf=0.875 and βpf=0.6.
The postfilter introduces a frequency tilt with a mild low pass characteristic to the spectrum of the filtered speech, which leads to a muffling of postfiltered speech. This is corrected by a tilt-correction mechanism, which estimates the spectral tilt introduced by the postfilter and compensates for it by a high frequency emphasis. A tilt correction factor is estimated as the first normalized autocorrelation lag of the impulse response of the postfilter. Let νpf1 and νpf2 be the two tilt correction factors computed for the two postfilters in equations 197 and 198, respectively. Then the tilt correction operation for the two half sub-frames are as follows:
The postfilter alters the energy of the speech signal. Hence it is desirable to restore the RMS value of the speech signal at the postfilter output to the RMS value of the speech signal at the postfilter input. The RMS value of the postfilter input speech for the mth sub-frame is computed by:
The RMS value of the postfilter output speech for the mth sub-frame is computed by:
An adaptive gain factor is computed by low pass filtering the ratio of the RMS value at the post filter input to the RMS value at the post filter output:
The postfiltered speech is scaled by the gain factor as follows:
sout(20(m−1)+n)=gpf(20(m−1)+n)ŝpf(20(m−1)+n), 0≦n<20, 0<m
The resulting scaled postfiltered speech signal {sout(n),0<n<160} constitutes one frame (20 ms) of output speech of the decoder correponding to the received 80 bit packet.
Those skilled in the art can now appreciate from the foregoing description that the broad teachings of the present invention can be implemented in a variety of forms. Therefore, while this invention has been described in connection with particular examples thereof, the true scope of the invention should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification and the following claims.
This application claims benefit under 35 U.S.C. §119(e) from U.S. Provisional Patent Application Ser. No. 60/268,327 filed on Feb. 13, 2001, and from U.S. Provisional Patent Application Ser. No. 60/314,288 filed on Aug. 23, 2001, the entire contents of both of said provisional applications being incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5517595 | Kleijn | May 1996 | A |
5664055 | Kroon | Sep 1997 | A |
5717823 | Kleijn | Feb 1998 | A |
5781880 | Su | Jul 1998 | A |
5784532 | McDonough et al. | Jul 1998 | A |
5809456 | Cucchi et al. | Sep 1998 | A |
5884010 | Chen et al. | Mar 1999 | A |
5884253 | Kleijn | Mar 1999 | A |
5890105 | Ishihara et al. | Mar 1999 | A |
5911128 | DeJaco | Jun 1999 | A |
6081776 | Grabb et al. | Jun 2000 | A |
6324505 | Choy et al. | Nov 2001 | B1 |
6418408 | Bhaskar et al. | Jul 2002 | B1 |
6493664 | Bhaskar et al. | Dec 2002 | B1 |
6691092 | Udaya Bhaskar et al. | Feb 2004 | B1 |
Number | Date | Country | |
---|---|---|---|
60314288 | Aug 2001 | US | |
60268327 | Feb 2001 | US |