The present disclosure relates to quantization of the gain of a fixed contribution of an excitation in a coded sound signal. The present disclosure also relates to joint quantization of the gains of the adaptive and fixed contributions of the excitation.
In a coder of a codec structure, for example a CELP (Code-Excited Linear Prediction) codec structure such as ACELP (Algebraic Code-Excited Linear Prediction), an input speech or audio signal (sound signal) is processed in short segments, called frames. In order to capture rapidly varying properties of an input sound signal, each frame is further divided into sub-frames. A CELP codec structure also produces adaptive codebook and fixed codebook contributions of an excitation that are added together to form a total excitation. Gains related to the adaptive and fixed codebook contributions of the excitation are quantized and transmitted to a decoder along with other encoding parameters. The adaptive codebook contribution and the fixed codebook contribution of the excitation will be referred to as “the adaptive contribution” and “the fixed contribution” of the excitation throughout the document.
In the appended drawings:
According to a first aspect, the present disclosure relates to a device for quantizing a gain of a fixed contribution of an excitation in a frame, including sub-frames, of a coded sound signal, comprising: an input for a parameter representative of a classification of the frame; an estimator of the gain of the fixed contribution of the excitation in a sub-frame of the frame, wherein the estimator is supplied with the parameter representative of the classification of the frame; and a predictive quantizer of the gain of the fixed contribution of the excitation, in the sub-frame, using the estimated gain.
The present disclosure also relates to a method for quantizing a gain of a fixed contribution of an excitation in a frame, including sub-frames, of a coded sound signal, comprising: receiving a parameter representative of a classification of the frame;
estimating the gain of the fixed contribution of the excitation in a sub-frame of the frame, using the parameter representative of the classification of the frame; and predictive quantizing the gain of the fixed contribution of the excitation, in the sub-frame, using the estimated gain.
According to a third aspect, there is provided a device for jointly quantizing gains of adaptive and fixed contributions of an excitation in a frame of a coded sound signal, comprising: a quantizer of the gain of the adaptive contribution of the excitation; and the above described device for quantizing the gain of the fixed contribution of the excitation.
The present disclosure further relates to a method for jointly quantizing gains of adaptive and fixed contributions of an excitation in a frame of a coded sound signal, comprising: quantizing the gain of the adaptive contribution of the excitation; and quantizing the gain of the fixed contribution of the excitation using the above described method.
According to a fifth aspect, there is provided a device for retrieving a quantized gain of a fixed contribution of an excitation in a sub-frame of a frame, comprising: a receiver of a gain codebook index; an estimator of the gain of the fixed contribution of the excitation in the sub-frame, wherein the estimator is supplied with a parameter representative of a classification of the frame; a gain codebook for supplying a correction factor in response to the gain codebook index; and a multiplier of the estimated gain by the correction factor to provide a quantized gain of the fixed contribution of the excitation in the sub-frame.
The present disclosure is also concerned with a method for retrieving a quantized gain of a fixed contribution of an excitation in a sub-frame of a frame, comprising: receiving a gain codebook index; estimating the gain of the fixed contribution of the excitation in the sub-frame, using a parameter representative of a classification of the frame; supplying, from a gain codebook and for the sub-frame, a correction factor in response to the gain codebook index; and multiplying the estimated gain by the correction factor to provide a quantized gain of the fixed contribution of the excitation in said sub-frame.
The present disclosure is still further concerned with a device for retrieving quantized gains of adaptive and fixed contributions of an excitation in a sub-frame of a frame, comprising: a receiver of a gain codebook index; an estimator of the gain of the fixed contribution of the excitation in the sub-frame, wherein the estimator is supplied with a parameter representative of the classification of the frame; a gain codebook for supplying the quantized gain of the adaptive contribution of the excitation and a correction factor for the sub-frame in response to the gain codebook index; and a multiplier of the estimated gain by the correction factor to provide a quantized gain of fixed contribution of the excitation in the sub-frame.
According to a further aspect, the disclosure describes a method for retrieving quantized gains of adaptive and fixed contributions of an excitation in a sub-frame of a frame, comprising: receiving a gain codebook index; estimating the gain of the fixed contribution of the excitation in the sub-frame, using a parameter representative of a classification of the frame; supplying, from a gain codebook and for the sub-frame, the quantized gain of the adaptive contribution of the excitation and a correction factor in response to the gain codebook index; and multiplying the estimated gain by the correction factor to provide a quantized gain of fixed contribution of the excitation in the sub-frame.
There is a need for a technique for quantizing the gains of the adaptive and fixed excitation contributions that improve the robustness of the codec against frame erasures or packet losses that can occur during transmission of the encoding parameters from the coder to the decoder.
The foregoing and other features will become more apparent upon reading of the following non-restrictive description of illustrative embodiments, given by way of example only with reference to the accompanying drawings.
In the following, there is described quantization of a gain of a fixed contribution of an excitation in a coded sound signal, as well as joint quantization of gains of adaptive and fixed contributions of the excitation. The quantization can be applied to any number of sub-frames and deployed with any input speech or audio signal (input sound signal) sampled at any arbitrary sampling frequency. Also, the gains of the adaptive and fixed contributions of the excitation are quantized without the need of inter-frame prediction. The absence of inter-frame prediction results in improvement of the robustness against frame erasures or packet losses that can occur during transmission of encoded parameters.
The gain of the adaptive contribution of the excitation is quantized directly whereas the gain of the fixed contribution of the excitation is quantized through an estimated gain. The estimation of the gain of the fixed contribution of the excitation is based on parameters that exist both at the coder and the decoder. These parameters are calculated during processing of the current frame. Thus, no information from a previous frame is required in the course of quantization or decoding which, as mentioned hereinabove, improves the robustness of the codec against frame erasures.
Although the following description will refer to a CELP (Code-Excited Linear Prediction) codec structure, for example ACELP (Algebraic Code-Excited Linear Prediction), it should be kept in mind that the subject matter of the present disclosure may be applied to other types of codec structures.
In the art of CELP coding, the excitation is composed of two contributions: the adaptive contribution (adaptive codebook excitation) and the fixed contribution (fixed codebook excitation). The adaptive codebook is based on long-term prediction and is therefore related to the past excitation. The adaptive contribution of the excitation is found by means of a closed-loop search around an estimated value of a pitch lag. The estimated pitch lag is found by means of a correlation analysis. The closed-loop search consists of minimizing the mean square weighted error (MSWE) between a target signal (in CELP coding, a perceptually filtered version of the input speech or audio signal (input sound signal)) and the filtered adaptive contribution of the excitation scaled by an adaptive codebook gain. The filter in the closed-loop search corresponds to the weighted synthesis filter known in the art of CELP coding. A fixed codebook search is also carried out by minimizing the mean squared error (MSE) between an updated target signal (after removing the adaptive contribution of the excitation) and the filtered fixed contribution of the excitation scaled by a fixed codebook gain. The construction of the total filtered excitation is shown in
Assuming the knowledge of the target signal x(i), the filtered adaptive contribution of the excitation y(i) and the filtered fixed contribution of the excitation z(i), the optimal set of unquantized gains gp and gc is found by minimizing the energy of the error signal e(i) given by the following relation:
e(i)=x(i)−gpy(i)−gcz(i), i=0, . . . ,L−1 (1)
Equation (1) can be given in vector form as
e=x−g
p
y−g
c
z (2)
and minimizing the energy of the error signal,
where t denotes vector transpose, results in optimum unquantized gains
where the constants or correlations c0, c1, c2, c3, c4 and c5 are calculated as
c
0
=y
t
y, c
1
=x
t
y, c
2
=z
t
z, c
3
=x
t
z, c
4
=y
t
z, c
5
=x
t
x. (4)
The optimum gains in Equation (3) are not quantized directly, but they are used in training a gain codebook as will be described later. The gains are quantized jointly, after applying prediction to the gain of the fixed contribution of the excitation. The prediction is performed by computing an estimated value of the gain gc0 of the fixed contribution of the excitation. The gain of the fixed contribution of the excitation is given by gc=gc0·γ where γ is a correction factor. Therefore, each codebook entry contains two values. The first value corresponds to the quantized gain gp, of the adaptive contribution of the excitation. The second value corresponds to the correction factor γ which is used to multiply the estimated gain gc0 of the fixed contribution of the excitation. The optimum index in the gain codebook (gp and γ) is found by minimizing the mean squared error between the target signal and filtered total excitation. Estimation of the gain of the fixed contribution of the excitation is described in detail below.
Each frame contains a certain number of sub-frames. Let us denote the number of sub-frames in a frame as K and the index of the current sub-frame as k. The estimation gc0 of the gain of the fixed contribution of the excitation is performed differently in each sub-frame.
The estimator 200 first calculates an estimation of the fixed codebook gain in response to a parameter t representative of the classification of the current frame.
The energy of the innovation codevector from the fixed codebook is then subtracted from the estimated fixed codebook gain to take into consideration this energy of the filtered innovation codevector. The resulting, estimated fixed codebook gain is multiplied by a correction factor selected from a gain codebook to produce the quantized fixed codebook gain gc.
In one embodiment, the estimator 200 comprises a calculator 201 of a linear estimation of the fixed codebook gain in logarithmic domain. The fixed codebook gain is estimated assuming unity-energy of the innovation codevector 202 from the fixed codebook. Only one estimation parameter is used by the calculator 201, the parameter t representative of the classification of the current frame. A subtractor 203 then subtracts the energy of the filtered innovation codevector 202 from the fixed codebook in logarithmic domain from the linear estimated fixed codebook gain in logarithmic domain at the output of the calculator 201. A converter 204 converts the estimated fixed codebook gain in logarithmic domain from the subtractor 203 to linear domain. The output in linear domain from the converter 204 is the estimated fixed codebook gain gc0. A multiplier 205 multiplies the estimated gain gc0 by the correction factor 206 selected from the gain codebook. As described in the preceding paragraph, the output of the multiplier 205 constitutes the quantized fixed codebook gain gc.
The quantized gain gp of the adaptive contribution of the excitation (hereinafter the adaptive codebook gain) is selected directly from the gain codebook. A multiplier 207 multiplies the filtered adaptive excitation 208 from the adaptive codebook by the quantized adaptive codebook gain gp to produce the filtered adaptive contribution 209 of the filtered excitation. Another multiplier 210 multiplies the filtered innovation codevector 202 from the fixed codebook by the quantized fixed codebook gain gc to produce the filtered fixed contribution 211 of the filtered excitation. Finally, an adder 212 sums the filtered adaptive 209 and fixed 211 contributions of the excitation to form the total filtered excitation 214.
In the first sub-frame of the current frame, the estimated fixed codebook gain in logarithmic domain at the output of the subtractor 203 is given by
G
c0
(1)
=a
0
+a
1
t−log10(√{square root over (Ei)}) (5)
where Gc0(1)=log10(gc0(1)).
The inner term inside the logarithm of Equation (5) corresponds to the square root of the energy of the filtered innovation vector 202 (Ei is the energy of the filtered innovation vector in the first sub-frame of frame n). This inner term (square root of the energy Ei) is determined by a first calculator 215 of the energy Ei of the filtered innovation vector 202 and a calculator 216 of the square root of that energy Ei. A calculator 217 then computes the logarithm of the square root of the energy Ei for application to the negative input of the subtractor 203. The inner term (square root of the energy Ei) has non-zero energy; the energy is incremented by a small amount in case of all-zero frames to avoid log(0).
The estimation of the fixed codebook gain in calculator 201 is linear in logarithmic domain with estimation coefficients a0 and a1 which are found for each sub-frame by means of a mean square minimization on a large signal database (training) as will be explained in the following description. The only estimation parameter 202 in the equation, t, denotes the classification parameter for frame n (in one embodiment, this value is constant for all sub-frames in frame n). Details about classification of the frames are given below. Finally, the estimated value of the gain in logarithmic domain is converted back to the linear domain (gc0(1)=10G
The superscript (1) denotes the first sub-frame of the current frame n.
As explained in the foregoing description, the parameter t representative of the classification of the current frame is used in the calculation of the estimated fixed codebook gain gc0. Different codebooks can be designed for different classes of voice signals. However, this will increase memory requirements. Also, estimation of the fixed codebook gain in the frames following the first frame can be based on the frame classification parameter t and the available adaptive and fixed codebook gains from previous sub-frames in the current frame. The estimation is confined to the frame boundary to increase robustness against frame erasures.
For example, frames can be classified as unvoiced, voiced, generic, or transition frames. Different alternatives can be used for classification. An example is given later below as a non-limitative illustrative embodiment. Further, the number of voice classes can be different from the one used hereinabove. For example the classification can be only voiced or unvoiced in one embodiment. In another embodiment more classes can be added such as strongly voiced and strongly unvoiced.
The values for the classification estimation parameter t can be chosen arbitrarily. For example, for narrowband signals, the values of parameter t are set to: 1, 3, 5, and 7, for unvoiced, voiced, generic, and transition frames, respectively, and for wideband signals, they are set to 0, 2, 4, and 6, respectively. However, other values for the estimation parameter t can be used for each class. Including this estimation, classification parameter t in the design and training for determining estimation parameters will result in better estimation gc0 of the fixed codebook gain.
The sub-frames following the first sub-frame in a frame use slightly different estimation scheme. The difference is in fact that in these sub-frames, both the quantized adaptive codebook gain and the quantized fixed codebook gain from the previous sub-frame(s) in the current frame are used as auxiliary estimation parameters to increase the efficiency.
In one embodiment, a calculator 302 computes a linear estimation of the fixed codebook gain again in logarithmic domain and a converter 303 converts the gain estimation back to linear domain. The quantized adaptive codebook gains gp(1), gp(2), etc. from the previous sub-frames are supplied to the calculator 302 directly while the quantized fixed codebook gains gc(1), gc(2), etc. from the previous sub-frames are supplied to the calculator 302 in logarithmic domain through a logarithm calculator 304. A multiplier 305 then multiplies the estimated fixed codebook gain gc0 (which is different from that of the first sub-frame) from the converter 303 by the correction factor 306, selected from the gain codebook. As described in the preceding paragraph, the multiplier 305 then outputs a quantized fixed codebook gain gc, forming the gain of the fixed contribution of the excitation.
A first multiplier 307 multiplies the filtered adaptive excitation 308 from the adaptive codebook by the quantized adaptive codebook gain gp selected directly from the gain codebook to produce the adaptive contribution 309 of the excitation. A second multiplier 310 multiplies the filtered innovation codevector 311 from the fixed codebook by the quantized fixed codebook gain gc to produce the fixed contribution 312 of the excitation. An adder 313 sums the filtered adaptive 309 and filtered fixed 312 contributions of the excitation together so as to form the total filtered excitation 314 for the current frame.
The estimated fixed codebook gain from the calculator 302 in the kth sub-frame of the current frame in logarithmic domain is given by
G
c0
(k)
=a
0
+a
1
t+Σj=1k-1(b2j-2Gc(j)+b2j-1gp(j)), k=2, . . . ,K. (6)
where Gc(k)=log10(gc(k)) is the quantized fixed codebook gain in logarithmic domain in sub-frame k, and gp(k) is the quantized adaptive codebook gain in sub-frame k.
For example, in one embodiment, four (4) sub-frames are used (K=4) so the estimated fixed codebook gains, in logarithmic domain, in the second, third, and fourth sub-frames from the calculator 302 are given by the following relations:
G
c0
(2)
=a
0
+a
1
t+b
0
G
c
(1)
+b
1
g
p
(1),
G
c0
(3)
=a
0
+a
1
t+b
0
G
c
(1)
+b
1
g
p
(1)
+b
2
G
c
(2)
+b
3
g
p
(2), and
G
c0
(4)
=a
0
+a
1
t+b
0
G
c
(1)
+b
1
g
p
(1)
+b
2
G
c
(2)
+b
3
g
p
(2)
+b
4
G
c
(3)
+b
5
g
p
(3).
The above estimation of the fixed codebook gain is based on both the quantized adaptive and fixed codebook gains of all previous sub-frames of the current frame. There is also another difference between this estimation scheme and the one used in the first sub-frame. The energy of the filtered innovation vector from the fixed codebook is not subtracted from the linear estimation of the fixed codebook gain in the logarithmic domain from the calculator 302. The reason comes from the use of the quantized adaptive codebook and fixed codebook gains from the previous sub-frames in the estimation equation. In the first sub-frame, the linear estimation is performed by the calculator 201 assuming unit energy of the innovation vector. Subsequently, this energy is subtracted to bring the estimated fixed codebook gain to the same energetic level as its optimal value (or at least close to it). In the second and subsequent sub-frames, the previous quantized values of the fixed codebook gain are already at this level so there is no need to take the energy of the filtered innovation vector into consideration. The estimation coefficients ai and bi are different for each sub-frame and they are determined offline using a large training database as will be described later below.
An optimal set of estimation coefficients is found on a large database containing clean, noisy and mixed speech signals in various languages and levels and with male and female talkers.
The estimation coefficients are calculated by running the codec with optimal unquantized values of adaptive and fixed codebook gains on the large database. It is reminded that the optimal unquantized adaptive and fixed codebook gains are found according to Equations (3) and (4).
In the following description it is assumed that the database comprises N+1 frames, and the frame index is n=0, . . . , N. The frame index n is added to the parameters used in the training which vary on a frame basis (classification, first sub-frame innovation energy, and optimum adaptive and fixed codebook gains).
The estimation coefficients are found by minimizing the mean square error between the estimated fixed codebook gain and the optimum gain in the logarithmic domain over all frames in the database.
For the first sub-frame, the mean square error energy is given by
From Equation (5), the estimated fixed codebook gain in the first sub-frame of frame n is given by
G
c0
(1)(n)=a0+a1t(n)−log10(√{square root over (Ei(n))}),
then the mean square error energy is given by
In above equation above (8), Eest is the total energy (on the whole database) of the error between the estimated and optimal fixed codebook gains, both in logarithmic domain. The optimal, fixed codebook gain in the first sub-frame is denoted g(1)c,opt. As mentioned in the foregoing description, Ei(n) is the energy of the filtered innovation vector from the fixed codebook and t(n) is the classification parameter of frame n. The upper index (1) is used to denote the first sub-frame and n is the frame index.
The minimization problem may be simplified by defining a normalized gain of the innovation vector in logarithmic domain. That is
G
i
(1)(n)=log10(√{square root over (Ei(1)(n))}{square root over (Ei(1)(n))})+log10(gc,opt(1)(n)), n=0, . . . ,N−1. (9)
The total error energy then becomes
The solution of the above defined MSE (Mean Square Error) problem is found by the following pair of partial derivatives
The optimal values of estimation coefficients resulting from the above equations are given by
Estimation of the fixed codebook gain in the first sub-frame is performed in logarithmic domain and the estimated fixed codebook gain should be as close as possible to the normalized gain of the innovation vector in logarithmic domain, Gi(1)(n).
For the second and other subsequent sub-frames, the estimation scheme is slightly different. The error energy is given by
where Gc,opt(k)=log10(gc,opt(k)). Substituting Equation (6) into Equation (12) the following is obtained
For the calculation of the estimation coefficients in the second and subsequent sub-frames of each frame, the quantized values of both the fixed and adaptive codebook gains of previous sub-frames are used in the above Equation (13). Although it is possible to use the optimal unquantized gains in their place, the usage of quantized values leads to the maximum estimation efficiency in all sub-frames and consequently to better overall performance of the gain quantizer.
Thus, the number of estimation coefficients increases as the index of the current sub-frame is advanced. The gain quantization itself is described in the following description. The estimation coefficients ai and bi are different for each sub-frame, but the same symbols were used for the sake of simplicity. Normally, they would either have the superscript (k) associated therewith or they would be denoted differently for each sub-frame, wherein k is the sub-frame index.
The minimization of the error function in Equation (13) leads to the following system of linear equations
The solution of this system, i.e. the optimal set of estimation coefficients a0, a1, b0, . . . , b2k-3, is not provided here as it leads to complicated formulas. It is usually solved by mathematical software equipped with a linear equation solver, for example MATLAB. This is advantageously done offline and not during the encoding process.
For the second sub-frame, Equation (14) reduces to
As mentioned hereinabove, calculation of the estimation coefficients is alternated with gain quantization as depicted in
Before gain quantization it is assumed that both the filtered adaptive excitation 501 from the adaptive codebook and the filtered innovation codevector 502 from the fixed codebook are already known. The gain quantization at the coder is performed by searching the designed gain codebook 503 in the MMSE (Minimum Mean Square Error) sense. As described in the foregoing description, each entry in the gain codebook 503 includes two values: the quantized adaptive codebook gain gp and the correction factor γ for the fixed contribution of the excitation. The estimation of the fixed codebook gain is performed beforehand and the estimated fixed codebook gain gc0 is used to multiply the correction factor γ selected from the gain codebook 503. In each sub-frame, the gain codebook 503 is searched completely, i.e. for indices q=0, . . . , Q−1, Q being the number of indices of the gain codebook. It is possible to limit the search range in case the quantized adaptive codebook gain gp is mandated to be below a certain threshold. To allow reducing the search range, the codebook entries may be sorted in ascending order according to the value of the adaptive codebook gain gp.
Referring to
The gain quantization can be performed by minimizing the energy of the error in Equation (2). The energy is given by
E=e
t
e=(x−gpy−gcz)t(x−gpy−gcz). (15)
Substituting gc by γgc0 the following relation is obtained
E=c
5
+g
p
2
c
0−2gpc1+γ2gc02c2−2γgc0c3+2gpγgc0c4 (16)
where the constants or correlations c0, c1, c2 c3, c4 and c5 are calculated as in Equation (4) above. The constants or correlations c0, c1, c2, c3, c4 and c5, and the estimated gain gc0 are computed before the search of the gain codebook 503, and then the energy in Equation (16) is calculated for each codebook index (each set of entry values gp and γ).
The codevector from the gain codebook 503 leading to the lowest energy 515 of the error signal ei is chosen as the winning codevector and its entry values correspond to the quantized values gp and γ. The quantized value of the fixed codebook gain is then calculated as
g
c
=g
c0·γ.
In the gain quantizer 600 of
In the decoder, the received index is used to retrieve the values of quantized adaptive codebook gain gp and correction factor γ from the gain codebook. The estimation of the fixed codebook gain is performed in the same manner as in the coder, as described in the foregoing description. The quantized value of the fixed codebook gain is calculated by the equation gc=gc0·γ. Both the adaptive codevector and the innovation codevector are decoded from the bitstream and they become adaptive and fixed excitation contributions that are multiplied by the respective adaptive and fixed codebook gains. Both excitation contributions are added together to form the total excitation. The synthesis signal is found by filtering the total excitation through a LP synthesis filter as known in the art of CELP coding.
Different methods can be used for determining classification of a frame, for example parameter t of
Signal classification can be performed in three steps, where each step discriminates a specific signal class. First, a signal activity detector (SAD) discriminates between active and inactive speech frames. If an inactive speech frame is detected (background noise signal) then the classification chain ends and the frame is encoded with comfort noise generation (CNG). If an active speech frame is detected, the frame is subjected to a second classifier to discriminate unvoiced frames. If the classifier classifies the frame as unvoiced speech signal, the classification chain ends, and the frame is encoded using a coding method optimized for unvoiced signals. Otherwise, the frame is processed through a “stable voiced” classification module. If the frame is classified as stable voiced frame, then the frame is encoded using a coding method optimized for stable voiced signals. Otherwise, the frame is likely to contain a non-stationary signal segment such as a voiced onset or rapidly evolving voiced signal. These frames typically require a general purpose coder and high bit rate for sustaining good subjective quality. The disclosed gain quantization technique has been developed and optimized for stable voiced and general-purpose frames. However, it can be easily extended for any other signal class.
In the following, the classification of unvoiced and voiced signal frames will be described.
The unvoiced parts of the sound signal are characterized by missing periodic component and can be further divided into unstable frames, where energy and spectrum change rapidly, and stable frames where these characteristics remain relatively stable. The classification of unvoiced frames uses the following parameters:
Voicing Measure
The normalized correlation, used to determine the voicing measure, is computed as part of the open-loop pitch analysis. In the art of CELP coding, the open-loop search module usually outputs two estimates per frame. Here, it is also used to output the normalized correlation measures. These normalized correlations are computed on a weighted signal and a past weighted signal at the open-loop pitch delay. The weighted speech signal sw(n) is computed using a perceptual weighting filter. For example, a perceptual weighting filter with fixed denominator, suited for wideband signals, is used. An example of a transfer function of the perceptual weighting filter is given by the following relation:
where A(z) is a transfer function of linear prediction (LP) filter computed by means of the Levinson-Durbin algorithm and is given by the following relation
LP analysis and open-loop pitch analysis are well known in the art of CELP coding and, accordingly, will not be further described in the present description.
The voicing measure
norm=⅓(Cnorm(d0)+Cnorm(d1)+Cnorm(d2))
where Cnorm(d0), Cnorm(d1) and Cnorm(d2) are, respectively, the normalized correlation of the first half of the current frame, the normalized correlation of the second half of the current frame, and the normalized correlation of the look-ahead (the beginning of the next frame). The arguments to the correlations are the open-loop pitch lags.
The spectral tilt contains information about a frequency distribution of energy. The spectral tilt can be estimated in the frequency domain as a ratio between the energy concentrated in low frequencies and the energy concentrated in high frequencies. However, it can be also estimated in different ways such as a ratio between the two first autocorrelation coefficients of the signal.
The energy in high frequencies and low frequencies is computed following the perceptual critical bands as described in [J. D. Johnston, “Transform Coding of Audio Signals Using Perceptual Noise Criteria,” IEEE Journal on Selected Areas in Communications, vol. 6, no. 2, pp. 314-323, February 1988] of which the full contents is herein incorporated by reference. The energy in high frequencies is calculated as the average energy of the last two critical bands using the following relation:
Ē
h=0.5[ECB(bmax−1)+ECB(bmax)]
where ECB(i) is the critical band energy of ith band and bmax is the last critical band. The energy in low frequencies is computed as average energy of the first 10 critical bands using the following relation:
where bmin is the first critical band.
The middle critical bands are excluded from the calculation as they do not tend to improve the discrimination between frames with high energy concentration in low frequencies (generally voiced) and with high energy concentration in high frequencies (generally unvoiced). In between, the energy content is not characteristic for any of the classes discussed further and increases the decision confusion.
The spectral tilt is given by
where
ē
t=⅓(eold+et(0)+et(1)),
where eold is the spectral tilt in the second half of the previous frame.
The maximum short-time energy increase at low level dE0 is evaluated on the input sound signal s(n), where n=0 corresponds to the first sample of the current frame. Signal energy is evaluated twice per sub-frame. Assuming for example the scenario of four sub-frames per frame, the energy is calculated 8 times per frame. If the total frame length is, for example, 256 samples, each of these short segments may have 32 samples. In the calculation, short-term energies of the last 32 samples from the previous frame and the first 32 samples from the next frame are also taken into consideration. The short-time energies are calculated using the following relations:
where j=−1 and j=8 correspond to the end of the previous frame and the beginning of the next frame, respectively. Another set of nine short-term energies is calculated by shifting the signal indices in the previous equation by 16 samples using the following relation:
For energies that are sufficiently low, i.e. which fulfill the condition 10 log(Est(•)(j))<37, the following ratio is calculated
for the first set of energies and the same calculation is repeated for Est(2)(j) with j=0, . . . , 7 to obtain two sets of ratios rat(1) and rat(2). The only maximum in these two sets is searched by
dE0=max(rat(1),rat(2))
which is the maximum short-time energy increase at low level.
This parameter dE is similar to the maximum short-time energy increase at low level with the difference that the low-level condition is not applied. Thus, the parameter is computed as the maximum of the following four values:
The classification of unvoiced signal frames is based on the parameters described above, namely: the voicing measure
The relative frame energy is given by
E
rel
=E
t
−Ē
f
where Et is the total frame energy (in dB) and Ēf is the long-term average frame energy, updated during each active frame by Ēf=0.99Ēf−0.01Et.
The rules for unvoiced classification of wideband signals are summarized below
The first line of this condition is related to low-energy signals and signals with low correlation concentrating their energy in high frequencies. The second line covers voiced offsets, the third line covers explosive signal segments and the fourth line is related to voiced onsets. The last line discriminates music signals that would be otherwise declared as unvoiced.
If the combined conditions are fulfilled the classification ends by declaring the current frame as unvoiced.
If a frame is not classified as inactive frame or as unvoiced frame then it is tested if it is a stable voiced frame. The decision rule is based on the normalized correlation
The open-loop pitch estimation procedure calculates three open-loop pitch lags: d0, d1 and d2, corresponding to the first half-frame, the second half-frame and the look-ahead (first half-frame of the following frame). In order to obtain a precise pitch information in all four sub-frames, ¼ sample resolution fractional pitch refinement is calculated. This refinement is calculated on a perceptually weighted input signal swd(n) (for example the input sound signal s(n) filtered through the above described perceptual weighting filter). At the beginning of each sub-frame a short correlation analysis (40 samples) with resolution of 1 sample is performed in the interval (−7,+7) using the following delays: d0 for the first and second sub-frames and d1 for the third and fourth sub-frames. The correlations are then interpolated around their maxima at the fractional positions dmax−¾, dmax−½, dmax−¼, dmax, dmax+¼, dmax+½, dmax+¾. The value yielding the maximum correlation is chosen as the refined pitch lag.
Let the refined open-loop pitch lags in all four sub-frames be denoted as T(0), T(1), T(2) and T(3) and their corresponding normalized correlations as C(0), C(1), C(2) and C(3). Then, the voiced signal classification condition is given by
[ēt>4] AND
The above voiced signal classification condition indicates that the normalized correlation must be sufficiently high in all sub-frames, the pitch estimates must not diverge throughout the frame and the energy must be concentrated in low frequencies. If this condition is fulfilled the classification ends by declaring the current frame as voiced. Otherwise the current frame is declared as generic.
Although the present invention has been described in the foregoing description with reference to non-restrictive illustrative embodiments thereof, these embodiments can be modified at will within the scope of the appended claims without departing from the spirit and nature of the present invention.
Number | Date | Country | |
---|---|---|---|
61442960 | Feb 2011 | US |