The technical field of this invention is speech data coding and decoding.
CCITT Recommendation G.726 is a widely used, early speech coding standards for telephony. Recently in digital and packet communication systems, packet loss handling mechanism has become very common in the current communication scenarios using VOIP (voice over Internet Protocol) and other packet networks. But the current CCITT Recommendation G.726 does not support any mechanism for packet loss recovery. Thus quality goes down in case of packet loss with bad artifacts and glitches in the speech. These glitches and artifacts are hard to compensate in any subsequent packet loss algorithm and system such as G.711. So there is need to minimize these glitches for proper functioning of a G.726 codec in packet loss scenarios.
In a CCITT Recommendation G.726 system the encoder and decoder states are coupled. During packet loss, the encoder and decoder lose their ability to track states. In addition the tone detector is somewhat ad-hoc and further deteriorates the state tracking ability of the decoder. For tone detection, the predictor poles and zeros are set to zero values. This tone detection also detects the false tones in the normal speech signals. Thus a frame loss makes it very difficult for the decoder to track the encoder because the tone detector would set the predictor poles and zeros to zero values. In this state, the codec output exhibits glitch artifacts in the output speech.
A G.726 codec is Adaptive Differential Pulse Code Modulation (ADPCM) based and operates at 16, 24, 32 or 40 K bits/sec. The codec converts 64 K bits A-law or μ-law pulse code modulated (PCM) channels to and from a 16, 24, 32 or 40 K bits/sec channels using ADPCM transcoding. The heart of the codec is the sign-sign (SS) and leaky LMS algorithm.
This invention changes the G.726 decoding process to control glitches in the output speech upon packet loss. This invention does not change the encoder thus maintaining compatibility with the existing deployed encoders. This invention has minor data processing capacity and memory impact, handles the glitches upon packet loss to a great extent, maintains the perceived quality of the output speech and minimizes glitch artifacts. This invention controls the dynamics such as excitation, step size and leak factors of the decoder during packet loss. This controls these artifacts and produces a better Mean Opinion Score (MOS) score for the output speech.
The G.726 standard uses a sign-sign algorithm (SSA). In the sign-sign algorithm the adaptation is based on the sign of the regressor and the sign of the error signal. The SSA is given by:
H(n+1)=H(n)+μsgn{X(n)sgn{e(n)}}, (1)
e(n)=d(n)−H(n)τX(n), (2)
X(n)=[x(n)x(n−1) . . . x(n−N—1)τ], (3)
sgn{X(n)}=[sgn{x(n)}sgn{x(n−1)} . . . sgn{x(n−N+1)}]τ, (4)
Where: x(n) is the reference input at time n; d(n) is the desired response; N is the number of filter taps; X(n)εN is the input regressor; H(n)ε
N is the filter coefficients; e(n) is the estimation error; and μ is the step size. Sgn is the sign function defined as:
The sign-sign and leaky least mean squared (LMS) algorithms are the hardest of the least mean squared family to analyze due to two sign nonlinearities. The signed regressor algorithm is very sensitive to persistency of the excitations conditions. This is not equivalent to persistence excitation for non-sign least mean squared. There is no excitation during packet loss. Thus upon packet loss these algorithms tend to diverge. Due to these complexities and issues with the sign-sign least mean squared and leaky least mean squared algorithm, divergence and stability issues are more prominent than the usual LMS algorithm in G.726 ADPCM codec.
Tone detection is based on a threshold of the predictor pole amplitude (a2) and quantization error. This provides a false detection many times. According to the prior art, after tone detection the poles and zeros of the predictor are set to zero. During packet loss it is very difficult to synchronize the encoder-decoder state if this reset to zero happened during the lost frame.
A significant improvement in the glitch appearance occurs with removal of this tone detection and reset of the predictors to zero. But this change would require new tone detections at both decoder and encoder. Encoder changes would not preserve compatibility with existing installations.
The current form of the G.726 codec does not support any packet loss concealment procedure. Due to the encoder-decoder state coupling and the ad-hoc tone detector that resets the predictor upon tone detection, the encoder-decoder loses state tractability on packet loss. This causes the decoder to lose state tracking synchronization with the encoder. In this non-synchronous operation of the codec, the predictor at decoder generally takes several frames to resynchronize with the encoder. The decoder also typically hits the hard thresholds of the parameters limit used to control codec stability. This process causes glitches in the output speech supplied to the end user.
This invention is a regressor and some internal state control of the decoding process which minimize the glitches in the output speech upon packet loss. This invention produces glitch minimization and better output speech quality in terms of Mean Opinion Score (MOS) for CCITT Recommendation G.726 ADPCM based speech coding standard upon packet loss.
The least mean square (LMS) in the G.726 standard is a sign-sign and leaky algorithm having a two poles and six zeros predictor. This prior art predictor needs persistent excitation to operate stably. In this invention during packet loss, the decoder is excited by the pitch quantized inputs of the previous packet. The leak factor and the step size of the predictor are controlled in two steps to have the better performance and stability during and just after packet loss. In this two step control: step one changes the leak factor and step size during the packet loss; and step 2 changes the leak factor and step size upon reception of the very first good packet for the duration of one pitch period overlap. Similarly the scale factor of speed control adaptation is controlled in two steps during the packet loss.
These changes to the existing G.726 decoder add very marginally to the data processing and the memory requirements of the existing algorithm. The MOS results of this invention are better than the existing G.726 decoder upon packet loss.
These and other aspects of this invention are illustrated in the drawings, in which:
The G.726 standard predictor algorithm is sign-sign and hence its stability and operating conditions are sensitive to the persistency of the excitation. The standard typically uses regressor excitation.
Block 504 adaptively operates employing the first alternative parameters. Decision block 505 determines whether a first good packet is received. If a first good packet has not been received (No in decision block 505), then the invention repeats the adaptive predictor operation of block 505 using the first alternative parameters as before.
This loop repeats until decision block 505 detects the first good packet following the packet loss (decision block 501). If the current packet is the first packet following packet loss (Yes at decision block 505), then block 506 sets a second alternate parameters. Values for these parameters for a preferred embodiment are shown in Table 1 below. The parameters are set for this first good packet to intermediate values between the first alternate values and the default values for one pitch period to smoothen the transition from lost packet to good packet.
Block 507 adaptively operates using the second alternative parameters for this first good packet following packet loss. Block 508 then sets the default (normal execution value) parameters. Values for these parameters for a preferred embodiment are shown in Table 1. Normal operation continues via continue block 509.
The G.726 standard has the two poles and six zero predictor and the sign-sign leaky least mean squares adapts the predictor. In this invention during packet loss, these parameters are controlled. These parameters of the predictor are changed as shown in the Table 1. As shown in Table 1 the quantizer scale factor has smaller value during the packet loss and during the one pitch period of the first good packet received. The reduction in the quantizer scale factor helps in reducing the quantization error and drift. The values of the quantizer scale factor and the adaptation speed filters for one example of the two steps are shown in Table 1.
In the preferred embodiment these quantities are computed using the following equations. The quantization scale factor adaptation:
Yu′(k)=(1−2−5)y(k)+2−5W[I(k)] (6)
Adaptation Speed Control:
dms′(k)=(1−2−5)dms(k−1)+2−5F[I(k)] (7)
dml′(k)=(1−2−7)dml(k−1)+2−7F[I(k)] (8)
Adaptation Poles Predictor:
a1(k)=(1−leak_factor)a1(k−1)+(step_size)sgn[p(k)]sgn[p(k−1) (9)
a2(k)=(1−leak_factor)a2(k−1)+(step_size){sgn[p(k)]sgn[p(k−2)−f[a2(k−1)sgn[p(k)]sgn[pk(k−1)} (10)
Adaptive Zero Prediction:
bi(k)=(1−leak_factor)bi(k−1)+(step_size)sgn[dq(k)]sgn[dq(k−i)] (11)
The effect of the glitches in the output reduces the output speech quality. Listening tests were conducted on Harvard Speech database (Clean and Noisy speech) to evaluate the performance of the algorithm. These listening tests used five listeners. All five listeners were asked to compare outputs from a prior art G.726 decoder with no glitch removal to the glitch removal of this invention on the Car 22 db Harvard Database with 3% random packet loss. The listeners compared the prior art speech REF_OUT with the inventive speech PLC_OUT using the scale shown in Table 2.
Table 3 shows the results of the listening tests for 32 test vectors for the case of 40 Kbps. Similar results were obtained for the cases of 32, 24 and 16 Kbps.
Table 4 summarizes the results of the comparative listening tests for the five listeners. A Good result means the listener judged the inventive processed speech better than the prior art processed speech. A Bad result means the listener judged the prior art processed speech better than the inventive processed speech. A Neutral result means the listener judged the speech as having the same quality.
Following are the results drawn from the listening test. The average improvement was 0.18. This improvement varied 0.03 to 0.37. This is a quite significant improvement in case of speech codec. In these tests the MOS results indicated: the invention performed better than the prior art in 34.2% of cases; the invention performed worse in 19.5% of cases; and performance was the same in 46.1% of cases.
In the listening tests some of the test cases which are better in subjective listening have lower Perceptual Evaluation of Speech Quality (PESQ) scores than the reference speech. It looks like that PESQ is not the correct subjective measure wherever glitches are there in signal. Due to glitch removal and adaptation, the signal energy is less around the frame lost hence the PESQ score is slightly less in the inventive cases. But the average bound and variation around the mean of the PESQ of the inventive cases is better than the no glitch removal cases.
These proposed changes to the existing G.726 decoder marginally add to the data processing load and memory used in decoding. The additional data processing load is only some decision code and pitch calculation overheads as shown in
The MOS and PESQ results show the better performance of the new algorithm over the existing G.726 decoder upon packet loss. Glitches in output speech are minimized though not eliminated completely.
| Number | Date | Country | Kind |
|---|---|---|---|
| 1894/CHE/2007 | Aug 2007 | IN | national |
| Number | Name | Date | Kind |
|---|---|---|---|
| 5925146 | Murata et al. | Jul 1999 | A |
| 20040123228 | Kikuchi et al. | Jun 2004 | A1 |
| 20070100614 | Yoshida et al. | May 2007 | A1 |
| Number | Date | Country | |
|---|---|---|---|
| 20090125302 A1 | May 2009 | US |