The present invention relates to the field of encoding and decoding sound signals, in particular but not exclusively in a multilayer embedded codec interoperable with the ITU-T (International Telecommunication Union) Recommendation G.711. More specifically, the present invention relates to a device and method for noise shaping in the encoder and/or decoder of a sound signal codec.
For example, the device and method according to the present invention are applicable in the narrowband part (usually the first, or lower, layers) of a multilayer embedded codec operating at a sampling frequency of 8 kHz. Unlike ITU-T Recommendation G.711, which has been optimized for signals in the telephony bandwidth, i.e. 200-3400 Hz, the device and method of the invention significantly improve quality for signals whose range is 50-4000 Hz. Such signals are ordinarily generated, for example, by down-sampling a wideband signal whose bandwidth is 50-7000 Hz or even wider. Without the device and method of the invention, the quality of these signals would be much worse and with audible artefacts when encoded and synthesized by the legacy G.711 codec.
The demand for efficient digital wideband speech/audio encoding techniques with a good subjective quality/bit rate trade-off is increasing for numerous applications such as audio/video teleconferencing, multimedia, wireless applications and IP (Internet Protocol) telephony. Until recently the speech coding systems were able to process only signals in the telephony frequency bandwidth, i.e. 200-3400 Hz. Today, an increasing demand is seen for wideband systems that are able to process signals in the frequency bandwidth 50-7000 Hz. These systems offer significantly higher quality than the narrowband systems since they increase the intelligibility and naturalness of the sound. The frequency bandwidth 50-7000 Hz was found sufficient to deliver a face-to-face quality of speech during conversation. For audio signals such as music, this frequency bandwidth provides an acceptable audio quality but still lower than that of CD which operates in the frequency bandwidth 20-20000 Hz.
ITU-T Recommendation G.711 [1] at 64 kbps and G.729 at 8 kbps are two codecs widely used in packet-switched telephony applications. Thus, in the transition from narrowband to wideband telephony there is an interest to develop wideband codecs backward interoperable to these two standards. To this effect, the ITU-T has approved in 2006 Recommendation G.729.1 which is an embedded multi-rate coder with a core interoperable with ITU-T Recommendation G.729 at 8 kbps. Similarly, a new activity has been launched in March 2007 for an embedded wideband codec based on a narrowband core interoperable with ITU-T Recommendation G.711 (both μ-law and A-law) at 64 kbps. This new G.711-based standard is known as the ITU-T G.711 wideband extension (G.711 WBE).
In G.711 WBE, the input sound signal, sampled at 16 kHz, is split into two bands using a QMF (Quadrature Mirror Filter) filter: a lower band from 0 to 4000 Hz and an upper band from 4000 to 7000 Hz. If the bandwidth of the input signal is 50-8000 Hz the lower and upper bands are 50-4000 Hz and 4000-8000 Hz, respectively. In the G.711 WBE, the input wideband signal is encoded in three (3) Layers. The first Layer (Layer 1; the core) encodes the lower band of the signal in a G.711-compatible format at 64 kbps. Then, the second Layer (Layer 2; narrowband enhancement layer) adds 2 bits per sample (16 kbit/s) in the lower band to enhance the signal quality in this band. Finally, the third Layer (Layer 3; wideband extension layer) encodes the higher band with another 2 bits per sample (16 kbit/s) to produce a wideband synthesis. The structure of the bitstream is embedded. In other words, there is always a Layer 1 after which come either Layer 2 or Layer 3, or both (Layer 2 and Layer 3). In this manner, a synthesized signal of gradually improved quality may be obtained when decoding more layers. For example,
ITU-T Recommendation G.711, also known as a companded pulse code modulation (PCM), quantizes each input sample using 8 bits. The amplitude of the input signal is first compressed using a logarithmic law, uniformly quantized with 7 bits (plus 1 bit for the sign), and then expanded to bring it back to the linear domain. The G.711 standard defines two compression laws, the μ-law and the A-law. ITU-T Recommendation G.711 was designed specifically for narrowband input signals in the telephony bandwidth, i.e. 200-3400 Hz. When it is applied to signals in the bandwidth 50-4000 Hz, the quantization noise is annoying and audible especially at high frequencies (see
An object of the present invention is therefore to provide a device and method for noise shaping, in particular but not exclusively in a multilayer embedded codec interoperable with the ITU-T Recommendation G.711.
More specifically, in accordance with the present invention, there is provided a method for shaping noise during encoding of an input sound signal, the method comprising: pre-emphasizing the input sound signal to produce a pre-emphasized sound signal; computing a filter transfer function in relation to the pre-emphasized sound signal; and shaping the noise by filtering the noise through the computed filter transfer function to produce a shaped noise signal, wherein the noise shaping comprises producing a noise feedback representative of noise generated by processing of the input sound signal through a given sound signal codec.
The present invention also relates to a method for shaping noise during encoding of an input sound signal, the method comprising: receiving a decoded signal from an output of a given sound signal codec supplied with the input sound signal; pre-emphasizing the decoded signal to produce a pre-emphasized signal; computing a filter transfer function in relation to the pre-emphasized signal; and shaping the noise by filtering the noise through the computed filter transfer function, wherein the noise shaping further comprises producing a noise feedback representative of noise generated by processing of the input sound signal through a given sound signal codec.
The present invention is also concerned with a method for noise shaping in a multilayer encoder and decoder, including at least Layer 1 and Layer 2, the method comprising:
at the encoder: producing an encoded sound signal in Layer 1, wherein producing an encoded sound signal comprises shaping noise in Layer 1; producing an enhancement signal in Layer 2; and
at the decoder: decoding the encoded sound signal from Layer 1 of the encoder to produce a synthesis sound signal; decoding the enhancement signal from Layer 2; computing a filter transfer function in relation to the synthesis sound signal; filtering the decoded enhancement signal of Layer 2 through the computed filter transfer function to produce a filtered enhancement signal of Layer 2; and adding the filtered enhancement signal of Layer 2 to the synthesis sound signal to produce an output signal including contributions from both Layer 1 and Layer 2.
The present invention further relates to a device for shaping noise during encoding of an input sound signal, the device comprising: means for pre-emphasizing the input sound signal so as to produce a pre-emphasized signal; means for computing a filter transfer function in relation to the pre-emphasized sound signal; means for producing a noise feedback representative of noise generated by processing of the input sound signal through a given sound signal codec; and means for shaping the noise by filtering the noise feedback through the computed filter transfer function to produce a shaped noise signal.
The present invention is further concerned with a device for shaping noise during encoding of an input sound signal, the device comprising: a first filter for pre-emphasizing the input sound signal so as to produce a pre-emphasized signal; a feedback loop for producing a noise feedback representative of noise generated by processing of the input sound signal through a given sound signal codec; and a second filter having a transfer function determined in relation to the pre-emphasized signal, this second filter processing the noise feedback to produce a shaped noise signal.
The present invention still further relates to a device for shaping noise during encoding of an input sound signal, the device comprising: means for receiving a decoded signal from an output of a given sound codec supplied with the input sound signal; means for pre-emphasizing the decoded signal so as to produce a pre-emphasized signal; means for calculating a filter transfer function in relation to the pre-emphasized signal; means for producing a noise feedback representative of noise generated by processing of the input sound signal through the given sound signal codec; and means for shaping the noise by filtering the noise feedback through the computed filter transfer function.
The present invention is still further concerned with a device for shaping noise during encoding of an input sound signal, the device comprising: a receiver of a decoded signal from an output of a given sound signal codec; a first filter for pre-emphasizing the decoded signal to produce a pre-emphasized signal; a feedback loop for producing a noise feedback representative of noise generated by processing of the sound signal through the given sound signal codec; and a second filter having a transfer function determined in relation to the pre-emphasized signal, this second filter processing the noise feedback to produce a shaped noise signal.
The present invention further relates to a device for shaping noise in a multilayer encoder and decoder, including at least Layer 1 and Layer 2, the device comprising:
at the encoder: means for encoding a sound signal, wherein the means for encoding the sound signal comprises means for shaping noise in Layer 1; and means for producing an enhancement signal from Layer 2;
at the decoder: means for decoding the encoded sound signal from Layer 1 so as to produce a synthesis signal from Layer 1; means for decoding the enhancement signal from Layer 2; means for calculating a filter transfer function in relation to the synthesis sound signal; means for filtering the enhancement signal to produce a filtered enhancement signal of Layer 2; and means for adding the filtered enhancement signal of Layer 2 to the synthesis sound signal so as to produce an output signal including contributions of both Layer 1 and Layer 2.
The present invention is further concerned with a device for shaping noise in a multilayer encoding device and decoding device, including at least Layer 1 and Layer 2, the device comprising:
at the encoding device: a first encoder of a sound signal in Layer 1, wherein the first encoder comprises a filter for shaping noise in Layer 1; and a second encoder of an enhancement signal in Layer 2; and
at the decoding device: a decoder of the encoded sound signal to produce a synthesis sound signal; a decoder of the enhancement signal in Layer 2; a filter having a transfer function determined in relation to the synthesis sound signal from Layer 1, this filter processing the decoded enhancement signal to produce a filtered enhancement signal of Layer 2; and an adder for adding the synthesis sound signal and the filtered enhancement signal to produce an output signal including contributions of both Layer 1 and Layer 2.
The foregoing and other objects, advantages and features of the present invention will become more apparent upon reading of the following non restrictive description of illustrative embodiments thereof, given by way of example only with reference to the accompanying drawings.
In the appended drawings:
a, 10b, 10c and 10d are schematic block diagrams illustrating transformation of the noise shaping scheme interoperable with the legacy G.711 decoder;
a and 16b are graphs illustrating the A-law quantizer levels in the G.711 WBE codec with and without a dead-zone quantizer;
a and 17b are graphs illustrating the μ-law quantizer levels in the G.711 WBE codec with and without the dead-zone quantizer;
Generally stated, a first non-restrictive illustrative embodiment of the present invention allows for encoding the lower-band signal with significantly improved quality than would be obtained using only the legacy G.711 codec. The idea behind the disclosed, first non-restrictive illustrative embodiment is to shape the G.711 residual noise according to some perceptual criteria and masking effects so that this residual noise is far less annoying for listeners. The disclosed device and method are applied in the encoder and it does not affect interoperability with G.711. More specifically, the part of the encoded bitstream corresponding to Layer 1 can be decoded by a legacy G.711 decoder with increased quality due to proper noise shaping. The disclosed device and method also provide a mechanism to shape the quantization noise when decoding both Layer 1 and Layer 2. This is accomplished by introducing a complementary part of the noise shaping device and method also in the decoder when decoding the information of Layer 2.
In the first non-restrictive illustrative embodiment, similar noise shaping as in the 3GPP AMR-WB standard [2] and ITU-T Recommendation G.722.2 [3] is used. In AMR-WB, a perceptual weighting filter is used at the encoder in the error-minimization procedure to obtain the desired shaping of the error signal.
Furthermore, in the first non-restrictive illustrative embodiment, the weighted perceptual filter is optimized for a multilayer embedded codec interoperable with the legacy ITU-T Recommendation G.711 codec and has a transfer function directly related to the input signal. This transfer function is updated on a frame-by-frame basis. The noise shaping method has a built-in protection against the instability of the closed loop resulting from signals whose energy is concentrated in frequencies close to half of the sampling frequency. The first non-restrictive illustrative embodiment also incorporates a dead-zone quantizer which is applied to signals with very low energy. These low energy signals, when decoded, would otherwise create an unpleasant coarse noise since the dynamics of the disclosed device and method are not sufficient on very low levels. In a multilayer codec, there is also a second layer (Layer 2) which is used to refine the quantization steps of the legacy G.711 quantizer from the first layer (Layer 1). Because of the disclosed device and method, the signal coming from the second layer needs to be properly shaped in the decoder in order to keep the quantization noise under control. This is accomplished by applying a modified noise shaping algorithm also in the decoder. In this manner, both layers would produce a signal with properly shaped spectrum which is more pleasant to the human ear than it would have been using the legacy ITU-T G.711 codec. The last feature of the proposed device and method is the noise gate which is used to suppress an output signal whenever its level decreases below certain threshold. The output signal with a noise gate sounds cleaner between the active passages and thus the burden of listener's concentration is lower.
Before further describing the first non-restrictive illustrative embodiment of the present invention, the AMR-WB (Adaptive Multi Rate—Wideband) standard will be described.
AMR-WB uses an analysis-by-synthesis coding paradigm where the optimum pitch and innovation parameters of an excitation signal are searched by minimizing the mean-squared error between the input sound signal, for example speech, and the synthesized sound signal (filtered excitation) in a perceptually weighted domain (
As illustrated in
This is equivalent to minimizing the error e(n) between the weighted input sound signal s(n) and the weighted synthesis sound signal {tilde over (s)}(n). The weighting filter 501 has a transfer function W′(z) in the form:
where A(z) represents a linear prediction (LP) filter, and γ2,γ1 are weighting factors. Since the sound signal is quantized in the weighted domain, the spectrum of the quantization noise in the weighted domain is flat, which can be written as:
E
w(z)=W′(z)E(z) (2)
where E(z) is the spectrum of the error signal e(n) between the input sound signal and the synthesized sound signal {tilde over (s)}(n), and Ew(z) is the “flat” spectrum of the weighted error signal ew(n). From Equation (2), it can be seen that the error E(z) between the input sound signal and synthesis sound signal is shaped by the inverse of the weighting filter, that is E(z)=W′(z)−1 Ew(z). This result is described in Reference [4]. The transfer function W′(z)−1 exhibits some of the formant structure of the input sound signal. Thus, the masking property of the human ear is exploited by shaping the quantization error so that it has more energy in the formant regions where it will be masked by the strong signal energy present in these regions. The amount of weighting is controlled by the factors γ1 and γ2 in Equation (1).
The above described traditional perceptual weighting filter works well with signals in the telephony frequency bandwidth 300-3400 Hz. However, it was found that this traditional perceptual weighting filter is not suitable for efficient perceptual weighting of wideband signals in the frequency bandwidth 50-7000 Hz. It was also found that the traditional perceptual weighting filter has inherent limitations in modelling the formant structure and the required spectral tilt concurrently. The spectral tilt is more pronounced in wideband signals due to the wide dynamic range between low and high frequencies. Prior techniques has suggested to add a tilt filter into W′(z) in order to control the tilt and formant weighting of the wideband input sound signal separately.
A solution to this problem as described in Reference [5] has been introduced in the AMR-WB standard and comprises applying a pre-emphasis filter at the input, computing the LP filter A(z) on the basis of the sound signal pre-emphasized for example by the filter 1-μz−1, where μ is a pre-emphasis factor, and using a modified filter W′(z) by fixing its denominator. In this particular case the CELP (Code-Excited Linear Prediction) model of
In Equation (3), a first-order filter is used at the denominator. Alternatively, a higher order filter can also be used. This structure substantially decouples the formant weighting from the spectral tilt. Because A(z) is computed on the basis of the pre-emphasized speech signal s(n), the tilt of the filter 1/A(z/γ1) is less pronounced compared to the case when A(z) is computed on the basis of the original sound signal. A de-emphasis is performed at the decoder using a filter having a transfer function:
where μ is a pre-emphasis factor. Using a noise shaping approach as Equation (3), the quantization error spectrum is shaped by a filter having a transfer function 1/W′(z)P(z). When γ2 is set equal to μ, which is typically the case, the weighting filter becomes:
and the spectrum of the quantization error is shaped by a filter whose transfer function is 1/A(z/γ), with A(z) computed on the basis of the pre-emphasized sound signal. Subjective listening showed that this structure for achieving the error shaping by a combination of pre-emphasis and modified weighting filtering is very efficient for encoding wideband signals, in addition to the advantages of ease of fixed-point algorithmic implementation.
Although the noise shaping described above is used in AMR-WB with wideband signals whose frequency bandwidth is 50-7000 Hz, it also works well when the bandwidth is limited to 50-4000 Hz which is the case of the first non restrictive illustrative embodiment and the G.711 WBE codec (Layer 1 and Layer 2).
2. Perceptual Weighting in a Multilayer Embedded Codec Interoperable with the ITU-T G.711 Standard
In
X(z)=S(z)W(z)−Y(z)F(z) (6a)
Y(z)=X(z)+Q(z) (6b)
where X(z) is the input sound signal of the G.711 quantizer 802, S(z) is the original sound signal, Y(z) is the output signal of the G.711 quantizer 802, Q(z) is the G.711 quantization error with flat spectrum and W(z) is the transfer function of the weighting filter 804. The above Equations 6a and 6b yield:
Y(z)=S(z)W(z)−Y(z)F(z)+Q(z) (7)
which leads to:
Y(z)[1+F(z)]=S(z)W(z)+Q(z) (8)
This is equivalent to:
Therefore, by choosing F(z)=W(z)−1, the following relation can be obtained:
Thus, the error between the output (synthesis) sound signal Y(z) and the input sound signal S(z) is shaped by the inverse of the weighting filter W(z).
X(z)=S(z)+F(z)[S(z)−Y(z)] (11a)
Y(z)=X(z)+Q(z) (11b)
Thus,
Y(z)=S(z)+F(z)[S(z)−Y(z)]+Q(z) (12)
which leads to:
Y(z)[1+F(z)]=S(z)[1+F(z)]+Q(z) (13)
Thus, by setting F(z)=W(z)−1, the same error shaping as in
The pre-emphasis factor μ which is used in
where N is the size or length of the frame.
The pre-emphasis factor μ is given by the following relation:
This results in the range 0.38<μ<1.0. In this manner, the pre-emphasis is stronger for harmonic signals and weaker for noise.
In summary, the noise shaping filter W(z) is given by W(z)=A(z/γ), with A(z) computed on the basis of the pre-emphasized sound signal, where the pre-emphasis is performed using an adaptive pre-emphasis factor μ as described in Equations (15) and (16).
In the foregoing first non-restrictive illustrative embodiment, the computation of the filter W(z)=A(z/γ) (pre-emphasis and LP analysis) is based on the input sound signal. In a second non-restrictive illustrative embodiment, the filter is computed based on the decoded signal from Layer 1. As will be described herein below, in an embedded coding structure, in order to perform the same noise shaping on the second narrowband enhancement layer, Layer 2 for example, a device and method is disclosed whereby the decoded signal from the second layer is filtered through the filter 1/W(z). Thus pre-emphasis and LP analysis should also be performed at the decoder, where only the past decoded signal is available. Therefore, in order to minimize the difference with the noise-shaping filter calculated in the decoder, the filter calculated at the encoder can be based on the past decoded signal from Layer 1, which is available at both the encoder and the decoder. This second non-restrictive illustrative embodiment is employed in the ITU-T Recommendation G.711 WBE standard (see
where the negative index represents past signal. LP analysis is then performed on the pre-emphasized past signal 1803.
In the second non-restrictive illustrative embodiment, for example, a 4th order LP analysis is conducted once per frame using an asymmetric window. The window is divided in two parts: the length of the first part is 60 samples and the length of the second part is 20 samples. The window is given by the relation:
where the values L1=60 and L2=20 are used (L1+L2=2N=80). The past decoded signal y(n) is pre-emphasized and windowed to obtain the signal s′ (n), n=0, . . . , 2N−1. The autocorrelations r(k) of the windowed signal s′(n), n=0, . . . , 79 are computed using the following relation:
and a 120 Hz bandwidth expansion is used by lag-windowing the autocorrelations using the window:
where f0=120 Hz is the bandwidth expansion and fs=8000 Hz is the sampling frequency. Furthermore, r(0) is multiplied by the white noise correction factor 1.0001 which is equivalent to adding a noise floor at −40 dB.
The modified autocorrelations are used in the LPC analyser 1804 to obtain the LP filter coefficients ak, k=1, . . . , 4 by solving the following set of equations:
The above set of equations is solved using the Levinson-Durbin algorithm well-known to those of ordinary skill in the art.
The above description describes how the coding noise in a single-layer G.711-compatible encoder is shaped. To ensure proper noise shaping when multiple layers are used, the noise shaping algorithm is distributed between the encoder (for the first or core layer) in
Conceptually,
X(z)=S(z)+F(z)D(z) (17)
As before, filter F(z) 1402 is defined as F(z)=W(z)−1, where for example W(z)=A(z/γ) is the weighted LP filter, with A(z) calculated on the pre-emphasized sound signal (speech or audio). The difference signal d[n] from Operation 2 in
D(z)=S(z)−Ŷ8(z) (18)
Here, Ŷ8(z) (or ŷ8 [n] in the time domain) is the quantized output from the first Layer (8-bit PCM in the G.711 WBE codec). Thus, the noise feedback in
Y
10(z)=X(z)+Q(z) (19)
where Q(z) (or q[n] in the time domain) is the quantization noise from block Q. This is a quantization noise from a 10-bit PCM quantizer, since both Layer 1 and Layer 2 bits are obtained from Q. In a multilayer encoder, such as the G.711 WBE encoder, these 10 bits actually correspond to 8 bits from Layer 1 (PCM-compatible) plus 2 bits from Layer 2 (enhancement Layer).
In
Ŷ
8(z)=Y10(z)−Ê(z) (20)
In practice, Operation 4 would not be performed explicitly. The bits from the Layer 1 part of box Q in
where Q8(z) is the quantization noise from Layer 1 only (core 8-bit PCM). This is the desired noise shaping result for that core Layer (or Layer 1).
This section describes how the noise is shaped if both Layer 1 and Layer 2 are decoded, i.e. if the signal y10[n] in
X(z)=S(z)+F(z){S(z)−Ŷ8(z)} (22)
In Equation (19), the relationship between X(z) and Y10(z) is provided. By substituting X(z) in Equation (22) the following relation is obtained:
Y
10(z)−Q(z)=S(z)+F(z){S(z)−Ŷ8(z)}. (23)
Now, using Equation (20) to substitute Ŷ8(z) in the above relation yields the following relation:
Y
10(z)−Q(z)=S(z)+F(z){S(z)−Y10(z)+Ê(z)} (24)
Isolating all terms in Y10(z) on the left hand side of the above Equation (24) yields the following relation:
{F(z)+1}Y10(z)={F(z)+1}S(z)+Q(z)+F(z)Ê(z) (25)
Dividing both sides by F(z)+1, the following relation is obtained:
Since we have F(z)=W(z)−1, it can be written:
Let's recall that Q(z) is the coding noise from the 10-bit quantizer Q in
from the right hand side of Equation (27). The term
is therefore undesirable and should be eliminated. It can be written:
In the equation above YD(z) denotes the desired signal when decoding both Layer 1 and Layer 2. Now, Y10(z) is related to Ŷ8(z) (the Layer 1 synthesis signal) and Ê(z) (the transmitted 2-bit enhancement from Layer 2) in the following manner:
Y
10(z)=Ŷ8(z)+Ê(z) (30)
Using this relationship for Y10 (z) and replacing it in the definition of YD(z) above yields the following relation:
The last term in the above Equation (31) can be expanded as follows
This finally yields:
Equation (33) indicates the operations that have to be performed at the decoder to obtain the Layer 1+Layer 2 synthesis with proper noise shaping. At the encoder side, noise shaping is applied as described in
to form signal ê2[n] (see filter 1503); and
Although the present invention has been described hereinabove by way of non-restrictive illustrative embodiments thereof, these embodiments can be modified without departing from the spirit and nature of the subject invention. For instance, instead of using two (2) bits per sample scalar quantization to quantize the second layer (Layer 2), other quantization strategies can be used such as vector quantization. Furthermore, other weighting filter formulation can be used. In the above illustrative embodiment, the noise shaping is given by W−1(z)=1/A(z/γ). In general, if it is desired to shape the quantization noise by W−1(z), the filter F(z) at the encoder (
In some limited cases, e.g. for certain music genres, the energy of a signal may be concentrated in a single frequency peak near 4000 Hz (half of the sampling frequency in the lower band). In this specific case, the noise-shaping feedback becomes unstable since the filter is highly resonant. As a consequence the shaped noise is incorrect and the synthesized signal is clipped. This creates an audible artefact the duration of which may be several frames until the noise-shaping loop returns to its stable state. To prevent this problem the noise-shaping feedback is attenuated whenever a signal whose energy is concentrated in higher frequencies is detected in the encoder.
Specifically, a ratio:
is calculated where r0 and r1 are, respectively, the first and second autocorrelation coefficients. The first autocorrelation coefficient is given by the relation:
and the second autocorrelation coefficient is calculated using the following relation:
The ratio r may be used as information about the spectral tilt of the signal. In order to reduce the noise-shaping, the following condition must be fulfilled:
The noise-shaping feedback is then modified by attenuating the coefficients of the weighting filter by a factor α in the following manner:
The attenuation factor α is a function of the ratio r and is given by the relation:
The attenuation of the perceptual filter for signals whose energy is concentrated in higher frequencies is not activated if there is an active attenuation for signals with very low level. This will be explained in the next section.
When the input signal has a very low energy, the noise-shaping device and method may prevent the proper masking of the coding noise. The reason is that the resolution of the G.711 decoder is level-dependent. When the signal level is too low the quantization noise has approximately the same energy as the input signal and the distortion is close to 100%. Therefore, it may even happen that the energy of the input signal is increased when the filtered noise is added thereto. This in turn increases the energy of the decoded signal, etc. The noise feedback soon becomes saturated for several frames, which is not desirable. To prevent this saturation, the noise-shaping filter is attenuated for very-low level signals.
To detect the conditions for filter attenuation, the energy of the past decoded signal ŷ8 [n] can be checked if it is below a certain threshold. Note that the correlation r0 in Equation (35) represents this energy. Thus if the condition
r0<θ, (40)
is fulfilled, the attenuation for very low level signal is performed, where θ is a given threshold. Alternatively, a normalization factor ηL can be calculated on the correlation r0 in Equation (35). The normalization factor represents the maximum number of left shifts that can be performed on a 16-bit value r0 to keep the result below 32767. When ηL fulfils the condition:
ηL≧16, (41)
the attenuation for very low level signal is performed.
The attenuation is carried out on the weighting filter by setting the weighting factor γ=0.5. That is:
Attenuating the noise-shaping filter for very-low level input sound signals avoids the case where the noise feedback loop would increase the objective noise level without bringing the benefit of having a perceptually lower noise floor. It also helps to reduce the effects of filter mismatch between the encoder and the decoder.
The perceptual filter attenuations described above (protection against instability or very low level signals) are performed exclusively, which means they cannot be active at the same time. This is explained in the following condition:
Do attenuation of the perceptual filter yielding Equation (42).
else if
Do attenuation of the perceptual filter yielding (38).
else
No attenuation.
end.
Since the noise shaping disclosed in the first and second non-restrictive illustrative embodiments of the invention addresses the problem of noise in PCM encoders, which have fixed (non-adaptive) quantization levels, some very small signal conditions can actually produce a synthesis signal with higher energy than the input. This occurs when the input signal to the quantizer oscillates around the mid-point of two quantization levels.
In A-law PCM, the lowest quantization levels are 0 and ±16. Before the quantization, every input sample is offset by the value of +8. If a signal oscillates around the value of 8, every sample with amplitude below 8 will be quantized as 0 and every sample equal or above 8 will be quantized to 16. Then, the quantized signal will toggle between 0 and 16 even though the input sound signal varies only between, say, 6 and 12. This can be further amplified by the recursive nature of the noise shaping. One solution is to increase the region around the origin (0 value) of the quantizer of Layer 1. For example, all values between −11 and +11 inclusively (instead of −7 and +7) will be set to zero by the quantizer in Layer 1. This effectively increases the dead zone of the quantizer, thereby increasing the number of low-level samples which will be set to zero. However, in a multilayer G.711-interoperable encoding scheme, such as the G.711 WBE encoder, there is an extension layer which is used to refine the coarse quantification levels of the core layer (or Layer 1). Therefore, when a dead-zone quantizer is used in Layer 1, it is also necessary to modify the quantization levels of the quantizer in Layer 2. These levels are modified in a way that the error is minimized. One possible configuration of the dead-zone quantization levels for A-law is shown in
For μ-law, the same principle is followed but with different quantization thresholds (see
The dead-zone quantizer is activated only when the following condition is satisfied:
where k=ηL is the same normalization factor as the one used to normalize the value of r0 in Equation (35). When the condition above is true, the embedded low-band quantizers are not used as well as the core layer decoder. Instead, a different quantization technique is applied, which is explained below. Note that the condition in Equation (40) can be also used to activate the dead-zone quantizer.
As seen in condition (43), the dead-zone quantizer is activated only for extremely low-level input signal s(n), fulfilling the condition (43). The interval of activity is called a dead zone and within this interval the locally decoded core-layer signal y(n) is suppressed to zero. In this dead-zone quantizer, the samples s(n) are quantized according to the following set of equations:
u(n)=0
u(n)=0
where in the above relations u(n)=ŷ8(n) is the quantized core layer and v(n)=ê(n) is the quantized second layer.
To further increase the cleanness of the synthesis signal during quasi-silent periods, a method of a noise gate is added at the decoder. The noise gate attenuates the output signal when the frame energy is very low. This attenuation is progressive in both level and time. The level of attenuation is signal-dependant and is gradually modified on a sample-by-sample basis. In a non limitative example, the noise gate operates in the G.711 WBE decoder as described below.
Before calculating its energy, the synthesised signal in Layer 1 is first filtered by a first-order high-pass FIR filter
y
f(n)=y(n)−0.768y(n−1), n=0, 1, . . . , N−1, (34)
where y(n), n=0, . . . , N−1, corresponds to the synthesised signal in the current frame and N=40 is the length of the frame. The energy of the filtered signal is calculated by
In order to avoid fast switching of the noise gate, the energy of the previous frame is added to the energy of the current frame, which gives the total energy
E
1
=E
0
+E
−1. (36)
Note that E−1 is updated by E0 at the end of decoding each frame.
Based on the information about signal energy a target gain is calculated as the square root of Et in Equation (36), multiplied by a factor ½7, i.e.
bounded by
0.25≦gt≦1.0 (37)
The target gain is lower limited by a value of 0.25 and upper limited by 1.0. Thus, the noise gate is activated when the gain gt is less than 1.0. The factor ½7 has been chosen such that the signal whose RMS value is ≈20 would result in a target gain gt≈1.0 and a signal whose RMS value is ≈5 would result in a target gain gt≈0.25. These values have been optimized for the G.711 WBE codec and it is possible to modify them in a different framework.
When the synthesized signal in the decoder has its energy concentrated in the higher band, i.e. 4000-8000 Hz, the noise gate is progressively deactivated by setting the target gain to 1.0. Therefore, a power measure of the lower-band and the higher-band synthesized signals is calculated for the current frame. Specifically, the power of the lower-band signal (synthesized in Layer 1+Layer 2) is given by the following relation:
The power of the higher-band signal (synthesized in Layer 3) is given by
where z(n), n=0, . . . , N−1 denotes the synthesized higher-band signal. If Layer 3 is not implemented, the noise gate is not conditioned and is activated every time gt is less than 1.0. When Layer 3 is used, the target gain is set to 1.0 every time when PHB>4×10−7 and PHB>16*PLB.
Finally, each sample of the output synthesized signal (i.e. when both, the lower-band and the higher-band synthesized signals are combined together) is multiplied by a gain:
g(n)=0.99g(n−1)+0.01gt, n=0, 1, . . . , N−1 (40)
which is updated on sample-by-sample basis. It can be seen that the gain converges slowly towards the target gain gt.
Although the present invention has been described in the foregoing description by means of a non-restrictive illustrative embodiment, this illustrative embodiment can be modified at will within the scope of the appended claims, without departing from the spirit and nature of the subject invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CA2007/002373 | 12/28/2007 | WO | 00 | 6/11/2010 |
Number | Date | Country | |
---|---|---|---|
60929124 | Jun 2007 | US | |
60960057 | Sep 2007 | US |