The present invention relates to the field of the coding/decoding and the processing of audio frequency signals (such as speech, music or other such signals) for their transmission or their storage.
More particularly, the invention relates to a method and a device for determining an optimized scale factor that can be used to adjust the level of an excitation signal or, in an equivalent manner, of a filter as part of a frequency band extension in a decoder or a processor enhancing an audio frequency signal.
Numerous techniques exist for compressing (with loss) an audio frequency signal such as speech or music.
The conventional coding methods for the conversational applications are generally classified as waveform coding (PCM for “Pulse Code Modulation”, ADCPM for “Adaptive Differential Pulse Code Modulation”, transform coding, etc.), parametric coding (LPC for “Linear Predictive Coding”, sinusoidal coding, etc.) and parametric hybrid coding with a quantization of the parameters by “analysis by synthesis” of which CELP (“Code Excited Linear Prediction”) coding is the best known example.
For the non-conversational applications, the prior art for (mono) audio signal coding consists of perceptual coding by transform or in subbands, with a parametric coding of the high frequencies by band replication.
A review of the conventional speech and audio coding methods can be found in the works by W. B. Kleijn and K. K. Paliwal (eds.), Speech Coding and Synthesis, Elsevier, 1995; M. Bosi, R. E. Goldberg, Introduction to Digital Audio Coding and Standards, Springer 2002; J. Benesty, M. M. Sondhi, Y. Huang (Eds.), Handbook of Speech Processing, Springer 2008.
The focus here is more particularly on the 3GPP standardized AMR-WB (“Adaptive Multi-Rate Wideband”) codec (coder and decoder), which operates at an input/output frequency of 16 kHz and in which the signal is divided into two subbands, the low band (0-6.4 kHz) which is sampled at 12.8 kHz and coded by CELP model and the high band (6.4-7 kHz) which is reconstructed parametrically by “band extension” (or BWE, for “Bandwidth Extension”) with or without additional information depending on the mode of the current frame. It can be noted here that the limitation of the coded band of the AMR-WB codec at 7 kHz is essentially linked to the fact that the frequency response in transmission of the wideband terminals was approximated at the time of standardization (ETSI/3GPP then ITU-T) according to the frequency mask defined in the standard ITU-T P.341 and more specifically by using a so-called “P341” filter defined in the standard ITU-T G.191 which cuts the frequencies above 7 kHz (this filter observes the mask defined in P.341). However, in theory, it is well known that a signal sampled at 16 kHz can have a defined audio band from 0 to 8000 Hz; the AMR-WB codec therefore introduces a limitation of the high band by comparison with the theoretical bandwidth of 8 kHz.
The 3GPP AMR-WB speech codec was standardized in 2001 mainly for the circuit mode (CS) telephony applications on GSM (2G) and UMTS (3G). This same codec was also standardized in 2003 by the ITU-T in the form of recommendation G.722.2 “Wideband coding speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)”.
It comprises nine bit rates, called modes, from 6.6 to 23.85 kbit/s, and comprises continuous transmission mechanisms (DTX, for “Discontinuous Transmission”) with voice activity detection (VAD) and comfort noise generation (CNG) from silence description frames (SID, for “Silence Insertion Descriptor”), and lost frame correction mechanisms (FEC for “Frame Erasure Concealment”, sometimes called PLC, for “Packet Loss Concealment”).
The details of the AMR-WB coding and decoding algorithm are not repeated here; a detailed description of this codec can be found in the 3GPP specifications (TS 26.190, 26.191, 26.192, 26.193, 26.194, 26.204) and in ITU-T-G.722.2 (and the corresponding annexes and appendix) and in the article by B. Bessette et al. entitled “The adaptive multirate wideband speech codec (AMR-WB)”, IEEE Transactions on Speech and Audio Processing, vol. 10, no. 8, 2002, pp. 620-636 and the source code of the associated 3GPP and ITU-T standards.
The principle of band extension in the AMR-WB codec is fairly rudimentary. Indeed, the high band (6.4-7 kHz) is generated by shaping a white noise through a time (applied in the form of gains per subframe) and frequency (by the application of a linear prediction synthesis filter or LPC, for “Linear Predictive Coding”) envelope. This band extension technique is illustrated in
A white noise uHB1(n), n=0, L, 79 is generated at 16 kHz for each 5 ms subframe by linear congruential generator (block 100). This noise uHB1(n) is formatted in time by application of gains for each subframe; this operation is broken down into two processing steps (blocks 102, 106 or 109):
It can be noted here that the normalization of the energies is done by comparing blocks of different size (64 for u(n) and 80 for uHB1(n)) without compensation of the differences in sampling frequencies (12.8 or 16 kHz).
At 23.85 kbit/s, a correction information item is transmitted by the AMR-WB coder and decoded (blocks 107, 108) in order to refine the gain estimated for each subframe (4 bits every 5 ms, or 0.8 kbit/s). The artificial excitation uHB (n) is then filtered (block 111) by an LPC synthesis filter (block 111) of transfer function 1/AHB (z) and operating at the sampling frequency of 16 kHz. The construction of this filter depends on the bit rate of the current frame:
A number of drawbacks in the band extension technique of the AMR-WB codec can be identified, in particular:
while recalling that the filter ÂHF(z) models a spectrally aliased high band (because of the spectral properties of the filter bank separating the low and high bands). Since the filters are interpolated by subframes, the gain gmatch is computed only once per frame, and it is interpolated by subframes. The band extension gain coding technique in AMR-WB+, and more particularly the compensation of levels of the LPC filters at their junction is an appropriate method in the context of a band extension by LPC models in low and high band, and it can be noted that such a level compensation between LPC filters is not present in the band extension of the AMR-WB codec. However, it is in practice possible to verify that the direct equalization of the level between the two LPC filters at the separation frequency is not an optimal method and can provoke an overestimation of energy in high band and audible artifacts in certain cases; it will be recalled that an LPC filter represents a spectral envelope, and the principle of equalization of the level between two LPC filters for a given frequency amounts to adjusting the relative level of two LPC envelopes. Now, such an equalization performed at a precise frequency does not ensure a complete continuity and overall consistency of the energy (in frequency) in the vicinity of the equalization point when the frequency envelope of the signal fluctuates significantly in this vicinity. A mathematical way of positing the problem consists in noting that the continuity between two curves can be ensured by forcing them to meet at one and the same point, but there is nothing to guarantee that the local properties (successive derivatives) coincide so as to ensure a more global consistency. The risk in ensuring a spot continuity between low and high band LPC envelopes is of setting the LPC envelope in high band at a relative level that is too strong or too weak, the case of a level that is too strong being more damaging because it results in more annoying artifacts.
Moreover, the gain compensation in AMR-WB+ is primarily a prediction of the gain known to the coder and to the decoder and which serves to reduce the bit rate necessary for the transmission of gain information scaling the high-band excitation signal. Now, in the context of an interoperable enhancement of the AMR-WB coding/decoding, it is not possible to modify the existing coding of the gains by subframes (0.8 kbit/s) of the band extension in the AMR-WB 23.85 kbit/s mode. Furthermore, for the bit rates strictly less than 23.85 kbit/s, the compensation of levels of LPC filters in low and high bands can be applied in the band extension of a decoding compatible with AMR-WB, but experience shows that this sole technique derived from the AMR-WB+ coding, applied without optimization, can cause problems of overestimation of energy of the high band (>6 kHz).
There is therefore a need to improve the compensation of gains between linear prediction filters of different frequency bands for the frequency band extension in a codec of AMR-WB type or an interoperable version of this codec without in any way overestimating the energy in a frequency band and without requiring additional information from the coder.
The present invention improves the situation.
To this end, the invention targets a method for determining an optimized scale factor to be applied to an excitation signal or to a filter in an audio frequency signal frequency band extension method, the band extension method comprising a step of decoding or of extraction, in a first frequency band, of an excitation signal and of parameters of the first frequency band comprising coefficients of a linear prediction filter, a step of generation of an extended excitation signal on at least one second frequency band and a step of filtering, by a linear prediction filter, for the second frequency band. The determination method is such that it comprises the following steps:
Thus, the use of an additional filter of lower order than the filter of the first frequency band to be equalized makes it possible to avoid the overestimations of energy in the high frequencies which could result from local fluctuations of the envelope and which can disrupt the equalization of the prediction filters.
The equalization of gains between the linear prediction filters of the first and second frequency bands is thus enhanced.
In an advantageous application of the duly obtained optimized scale factor, the band extension method comprises a step of application of the optimized scale factor to the extended excitation signal.
In an appropriate embodiment, the application of the optimized scale factor is combined with the step of filtering in the second frequency band.
Thus, the steps of filtering and of application of the optimized scale factor are combined in a single filtering step to reduce the processing complexity.
In a particular embodiment, the coefficients of the additional filter are obtained by truncation of the transfer function of the linear prediction filter of the first frequency band to obtain a lower order.
This lower order additional filter is therefore obtained in a simple manner.
Furthermore, so as to obtain a stable filter, the coefficients of the additional filter are modified as a function of a stability criterion of the additional filter.
In a particular embodiment, the computation of the optimized scale factor comprises the following steps:
Thus, the optimized scale factor is computed in such a way as to avoid the annoying artifacts which could occur should the higher order filter frequency response of the first band in proximity to the common frequency show a signal peak or trough.
In a particular embodiment, the method further comprises the following steps, implemented for a predetermined decoding bit rate:
Thus, additional information can be used to enhance the quality of the extended signal for a predetermined operating mode.
The invention also targets a device for determining an optimized scale factor to be applied to an excitation signal or to a filter in an audio frequency signal frequency band extension device, the band extension device comprising a module for decoding or extracting, in a first frequency band, an excitation signal and parameters of the first frequency band comprising coefficients of a linear prediction filter, a module for generating an extended excitation signal on at least one second frequency band and a module for filtering, by a linear prediction filter, for the second frequency band. The determination device is such that it comprises:
The invention targets a decoder comprising a device as described.
It targets a computer program comprising code instructions for implementing the steps of the method for determining an optimized scale factor as described, when these instructions are executed by a processor.
Finally, the invention relates to a storage medium, that can be read by a processor, incorporated or not in the device for determining an optimized scale factor, possibly removable, storing a computer program implementing a method for determining an optimized scale factor as described previously.
Other features and advantages of the invention will become more clearly apparent on reading the following description, given purely as a nonlimiting example and with reference to the attached drawings, in which:
Unlike the AMR-WB decoding which operates with an output sampling frequency of 16 kHz, a decoder is considered here which can operate with an output signal (synthesis) at the frequency fs=8, 16, 32 or 48 kHz. It should be noted that it is assumed here that the coding has been performed according to the AMR-WB algorithm with an internal frequency of 12.8 kHz for the CELP coding in low band and at 23.85 kbit/s a gain coding per subframe at the frequency of 16 kHz; even though the invention is described here at the decoding level, it is assumed here that the coding can also operate with an input signal at the frequency fs=8, 16, 32 or 48 kHz and suitable resampling operations, beyond the context of the invention, are implemented in coding as a function of the value of fs. It can be noted that, when fs=8 kHz, in the case of a decoding compatible with AMR-WB, it is not necessary to extend the 0-6.4 kHz low band, because the audio band reconstructed at the frequency fs is limited to 0-4000 Hz.
In
The decoding according to
In the embodiment described here and with reference to
At 23.85 kbit/s, the HF gain correction information (0.8 kbit/s) transmitted at 23.85 kbit/s is here decoded. Its use is detailed later, with reference to
In order to align the decoded low and high bands, a delay (block 310) is introduced to synchronize the outputs of the blocks 306 and 307 and the high band synthesized at 16 kHz is resampled from 16 kHz to the frequency fs (output of block 311). The value of the delay T depends on how the high band signal is synthesized, and on the frequency fs as in the post-processing of the low frequencies. Thus, generally, the value of T in the block 310 will have to be adjusted according to the specific implementation.
The low and high bands are then combined (added) in the block 312 and the synthesis obtained is post-processed by 50 Hz high-pass filtering (of IIR type) of order 2, the coefficients of which depend on the frequency fs (block 313) and output post-processing with optional application of the “noise gate” in a manner similar to G.718 (block 314).
Referring to
Thus, the block 400, from an excitation signal decoded in a first frequency band u(n), performs a band extension to obtain an extended excitation signal uHB (n) on at least one second frequency band.
It will be noted here that the optimized scale factor estimation according to the invention is independent of how the signal uHB(n) is obtained. One condition concerning its energy is, however, important. Indeed, the energy of the high band from 6000 to 8000 Hz must be at a level similar to the energy of the band from 4000 to 6000 Hz of the decoded excitation signal at the output of the block 302. Furthermore, since the low-band signal is de-emphasized (block 305), the de-emphasis must also be applied to the high-band excitation signal, either by using a specific de-emphasis filter, or by multiplying by a constant factor which corresponds to an average attenuation of the filter mentioned. This condition does not apply to the case of the 23.85 kbit/s bit rate which uses the additional information transmitted by the coder. In this case, the energy of the high-band excitation signal must be consistent with the energy of the signal corresponding to the coder, as explained later.
The frequency band extension can, for example, be implemented in the same way as for the decoder of AMR-WB type described with reference to
In another embodiment, this band extension can be performed from a combination of a white noise and of a decoded excitation signal as illustrated and described later for the blocks 700 to 707 in
Other frequency band extension methods with conservation of the energy level between the decoded excitation signal and the extended excitation signal as described below, can of course be envisaged for the block 400.
Furthermore, the band extension module can also be independent of the decoder and can perform a band extension for an existing audio signal stored or transmitted to the extension module, with an analysis of the audio signal to extract an excitation and an LPC filter therefrom. In this case, the excitation signal at the input of the extension module is no longer a decoded signal but a signal extracted after analysis, like the coefficients of the linear prediction filter of the first frequency band used in the method for determining the optimized scale factor in an implementation of the invention.
In the example illustrated in
In this case, an optimized scale factor denoted gHB2(m) is computed. In one embodiment, this computation is performed preferentially for each subframe and it consists in equalizing the levels of the frequency responses of the LPC filters 1/Â(z) and 1/Â(z/γ) used in low and high frequencies, as described later with reference to
In an alternative embodiment, it will be possible to keep the extrapolated HF synthesis filter 1/Âext(z/γ) as implemented in the AMR-WB decoder or a decoder that can interwork with the AMR-WB coder/decoder, for example according to the ITU-T recommendation G.718, in place of the filter 1/Â(z/γ). The compensation according to the invention is then performed from the filters 1/Â(z) and 1/Âext(z/γ).
The determination of the optimized scale factor is also performed by the determination (in 401a) of a linear prediction filter called additional filter, of lower order than the linear prediction filter of the first frequency band 1/Â(z), the coefficients of the additional filter being obtained from the parameters decoded or extracted from the first frequency band. The optimized scale factor is then computed (in 401b) as a function at least of these coefficients to be applied to the extended excitation signal uHB(n).
The principle of the determination of the optimized scale factor, implemented in the block 401, is illustrated in
The first step consists in computing the frequency responses R and P respectively of the linear prediction filter of the first frequency band (low band) and of the second frequency band (high band) at the frequency of 6000 Hz. The following is first computed:
in which M=16 is the order of the decoded LPC filter, 1/Â(z), and θ corresponds to the frequency of 6000 Hz normalized for the sampling frequency of 12.8 kHz, that is:
Then, similarly, the following is computed:
in which
In a preferred embodiment, the quantities P and R are computed according to the following pseudo-code:
The additional prediction filter is obtained for example by suitably truncating the polynomial Â(z) to the order 2.
In fact, the direct truncation to the order leads to the filter 1+â1+â2, which can pose a problem because there is generally nothing to guarantee that this filter of order 2 is stable. In a preferred embodiment, the stability of the filter 1+â1+â2 is therefore detected and a filter 1+â1′+â2′ is used, the coefficients of which are drawn from 1+â1+â2 as a function of the instability detection. More specifically, the following are initialized:
âi′=â1, i=1,2
The stability of the filter 1+â1+â2 can be verified differently; here, a conversion is used in the PARCOR coefficients (or reflection coefficients) domain by computing:
k1=â1′/(1+â2′)
k2=â2′
The stability is verified if |ki|<1, i=1, 2. The value of ki is therefore conditionally modified before ensuring the stability of the filter, with the following steps:
in which min(.,.) and max(.,.) respectively give the minimum and the maximum of 2 operands.
It should be noted that the threshold values, 0.99 for k1 and 0.6 for k2, will be able to be adjusted in variants of the invention. It will be recalled that the first reflection coefficient, k1, characterizes the spectral slope (or tilt) of the signal modeled to the order 1; in the invention the value of k1 is saturated at a value close to the stability limit, in order to preserve this slope and retain a tilt similar to that of 1/Â(z). It will also be recalled that the second reflection coefficient, k2, characterizes the resonance level of the signal modeled to the order 2; since the use of a filter of order 2 aims to eliminate the influence of such resonances around the frequency of 6000 Hz, the value of k2 is more strongly limited; this limit is set at 0.6.
The coefficients of 1+â1′+â2′ are then obtained by:
â1′=(1+k2)k1
â2′=k2
The frequency response of the additional filter is therefore finally computed:
with
This quantity is computed preferentially according to the following pseudo-code:
to avoid artifacts due to excessively abrupt variations of energy of the high band, a smoothing is applied to the value of R. In a preferred embodiment, an exponential smoothing is performed with a fixed factor in time (0.5) in the form of:
R=0.5R+0.5RPrev
Rprev=R
in which Rprev corresponds to the value of R in the preceding subframe and the factor 0.5 is optimized empirically obviously, the factor 0.5 will be able to be changed for another value and other smoothing methods are also possible. It should be noted that the smoothing makes it possible to reduce the temporal variants and therefore avoid artifacts.
The optimized scale factor is then given by:
gHB2(m)=max(min(R,Q),P)/P
In an alternative embodiment, it will be possible to replace the smoothing of R with a smoothing of gHB2(m) such that:
gHB2(m)←0.5gHB2(m)+0.5gHB2(m)
If the tilt (computed as in AMR-WB in the block 104) is positive (tilt>0 as in
the quantity R is smoothed adaptively in time, with a stronger smoothing when R is low as in the preceding case, this smoothing makes it possible to reduce the temporal variants and therefore avoids artifacts:
R=(1−α)R+αRprev with α=1−R2
Rprev=R
Then, the optimized scale factor is given by:
gHB2(m)=min(R,P,Q)/P
In an alternative embodiment, it will be possible to replace the smoothing of R with a smoothing of gHB2(m) as computed above.
gHB(m)=(1−α)gHB(m)+αgHB(m−1), m=0, . . . ,3, α=1−gHB2(m)
where gHB(−1) is the scale or gain factor computed for the last subframe of the preceding frame.
The minimum of R, P, Q is taken here in order to avoid overestimating the scale factor.
In a variant, the above condition depending only on the tilt will be able to be extended to take account not only of the tilt parameter but also of other parameters in order to refine the decision. Furthermore, the computation of gHB2 (m) will be able to be adjusted according to these said additional parameters.
An example of additional parameter is the number of zero crossings (ZCR, zero crossing rate) which can be defined as:
in which
The parameter zcr generally gives results similar to the tilt. A good classification criterion is the ratio between zcr computed for the synthesized signal s(n) and zcru computed for the excitation signal u(n) at 12 800 Hz. This ratio is between 0 and 1, where 0 means that the signal has a decreasing spectrum, 1 that the spectrum is increasing (which corresponds to (1−tilt)/2. In this case, a ratio zcrs/zcru>0.5 corresponds to the case tilt<0, a ratio zcrs/zcru<0.5 corresponds to tilt>0. In a variant, it will be possible to use a function of a parameter tilthp where tilthp is the tilt computed for the synthesized signal s(n) filtered by a high-pass filter with a cut-off frequency for example at 4800 Hz; in this case, the response 1/Â(z/γ) from 6 to 8 kHz (applied at 16 kHz) corresponds to the weighted response of 1/Â(z) from 4.8 to 6.4 kHz. Since 1/Â(z/γ) has a more flattened response, it is necessary to compensate this change of tilt. The scale factor function according to tilthp is then given in an embodiment by: (1−tilthp)2+0.6. Q and R are therefore multiplied by min (1, (1−tilthp)2+0.6) when tilt>0 or by max (1, (1−tilthp)2+0.6) when tilt<0.
The case of the 23.85 kbit/s bit rate is now considered, for which a gain correction is performed by the blocks 403 to 408. This gain correction could moreover be the subject of a separate invention. In this particular embodiment according to the invention, the gain correction information, denoted gHBcorr(m), transmitted by the AMR-WB (compatible) coding with a bit rate of 0.8 kbit/s, is used to improve the quality at 23.85 kbit/s.
It is assumed here that the AMR-WB (compatible) coding has performed a correction gain quantization on 4 bits as described in ITU-T clause G.722.2/5.11 or, equivalently, in the 3GPP clause TS 26.190/5.11.
In the AMR-WB coder, the correction gain is computed by comparing the energy of the original signal sampled at 16 kHz and filtered by a 6-7 kHz bandpass filter, s(n), with the energy of the white noise at 16 kHz filtered by a synthesis filter 1/Â(z/γ) and a 6-7 kHz bandpass filter (before the filtering, the energy of the noise is set to a level similar to that of the excitation at 12.8 kHz), sHB2(n). The gain is the root of the ratio of energy of the original signal to the energy of the noise divided by two. In one possible embodiment, it will be possible to change the bandpass filter for a filter with a wider band (for example from 6 to 7.6 kHz).
To be able to apply the gain information received at 23.85 kbit/s (in the block 407), it is important to bring the excitation to a level similar to that expected of the AMR-WB (compatible) coding. Thus, the block 404 performs the scaling of the excitation signal according to the following equation:
uHB1(n)=gHB3(m)uHB(n), n=80m,L,80(m+1)−1
in which gHB3(m) is a gain per subframe computed in the block 403 in the form:
in which the factor 5 in the denominator serves to compensate the bandwidth difference between the signal u (n) and the signal uHB(n), given that, in the AMR-WB coding, the HF excitation is a white noise over the 0-8000 Hz band.
The index of 4 bits per subframe, denoted indexHF_gain(m), sent at 23.85 kbit/s is demultiplexed from the bit stream (block 405) and decoded by the block 406 as follows:
gHBcorr(m)=2·HP_gain(indexHF_gain(m))
in which HP_gain(⋅) is the HF gain quantization dictionary defined in the AMR-WB coding and recalled below:
The block 407 performs the scaling of the excitation signal according to the following equation:
uHB2(n)=uHBcorr(m)uHB1(n), n=80m,L,80(m+1)−1
Finally, the energy of the excitation is adjusted to the level of the current subframe with the following conditions (block 408). The following is computed:
The numerator here represents the high-band signal energy which would be obtained in the mode 23.05. As explained before, for the bit rates<23.85 kbit/s, it is necessary to retain the level of energy between the decoded excitation signal and the extended excitation signal uHB(n), but this constraint is not necessary in the case of the 23.85 kbit/s bit rate, since uHB(n) is in this case scaled by the gain gHB3(m). To avoid double multiplications, certain multiplication operations applied to the signal in the block 400 are applied in the block 402 by multiplying by g (m). The value of g (m) depends on the uHB (n) synthesis algorithm and must be adjusted such that the energy level between the decoded excitation signal in low band and the signal g(m)uHB(n) is retained.
In a particular embodiment, which will be described in detail later with reference to
It is assumed that, in the block 408, there is information on the tilt of the low-band signal in a preferred embodiment, this tilt is computed as in the AMR-WB codec according to the blocks 103 and 104, but other methods for estimating the tilt are possible without changing the principle of the invention.
If fac(m)>1 or tilt<0, the following is assumed:
uHB′(n)=uHB2(n), n=80m,L,80(m+1)−1
Otherwise:
uHB′(n)=max(√{square root over (1−tilt)},fac(m))·uHB2(n), n=80m,L, 80(m+1)−1
It will be noted that the optimized scale factor computation described here, notably in the blocks 401 and 402, is distinguished from the abovementioned equalization of filter levels performed in the AMR-WB+ codec by a number of aspects:
The main steps are implemented by the block 401.
Thus, an extended excitation signal uHB(n) is obtained in a frequency band extension method E601 which comprises a step of decoding or of extraction, in a first frequency band called low band, of an excitation signal and of parameters of the first frequency band such as, for example, the coefficients of the linear prediction filter of the first frequency band.
A step E602 determines a linear prediction filter called additional filter, of lower order than that of the first frequency band. To determine this filter, the parameters of the first frequency band decoded or extracted are used.
In one embodiment, this step is performed by truncation of the transfer function of the linear prediction filter of the low band to obtain a lower filter order, for example 2. These coefficients can then be modified as a function of a stability criterion as explained previously with reference to
From the coefficients of the additional filter thus determined, a step E603 is implemented to compute the optimized scale factor to be applied to the extended excitation signal. This optimized scale factor is, for example, computed from the frequency response of the additional filter at a common frequency between the low band (first frequency band) and the high band (second frequency band). A minimum value can be chosen between the frequency response of this filter and those of the low-band and high-band filters.
This therefore avoids the overestimations of energy which could exist in the methods of the prior art.
This step of computation of the optimized scale factor is, for example, described previously with reference to
The step E604 performed by the block 402 or 409 (depending on the decoding bit rate) for the band extension, applies the duly computed optimized scale factor to the extended excitation signal so as to obtain an optimized extended extension signal uHB′(n).
In a particular embodiment, the device for determining the optimized scale factor 708 is incorporated in a band extension device now described with reference to
In this embodiment, the band extension block 400 of
Thus, at the input of the band extension device, a low-band excitation signal decoded or estimated by analysis is received (u(n)). The band extension here uses the excitation decoded at 12.8 kHz (exc2 or u(n)) at the output of the block 302 of
It will be noted that, in this embodiment, the generation of the oversampled and extended excitation is performed in a frequency band ranging from 5 to 8 kHz therefore including a second frequency band (6.4-8 kHz) above the first frequency band (0-6.4 kHz).
Thus, the generation of an extended excitation signal is performed at least over the second frequency band but also over a part of the first frequency band.
Obviously, the values defining these frequency bands can be different depending on the decoder or the processing device in which the invention is applied.
For this exemplary embodiment, this signal is transformed to obtain an excitation signal spectrum U(k) by the time-frequency transformation module 500. In a particular embodiment, the transform uses a DCT-IV (for “Discrete Cosine Transform” type IV) (block 700) on the current frame of 20 ms (256 samples), without windowing, which amounts to directly transforming u(n) with n=0, L, 255 according to the following formula:
in which N=256 and k=0, L, 255.
It should be noted here that the transformation without windowing (or, equivalently, with an implicit rectangular window of the length of the frame) is possible because the processing is performed in the excitation domain, and not the signal domain so that no artifact (block effects) is audible, which constitutes an important advantage of this embodiment of the invention.
In this embodiment, the DCT-IV transformation is implemented by FFT according to the so-called “Evolved DCT (EDCT)” algorithm described in the article by D. M. Zhang, H. T. Li, A Low Complexity Transform—Evolved DCT, IEEE 14th International Conference on Computational Science and Engineering (CSE), August 2011, pp. 144-149, and implemented in the ITU-T standards G.718 Annex B and G.729.1 Annex E.
In variants of the invention, and without loss of generality, the DCT-IV transformation will be able to be replaced by other short-term time-frequency transformations of the same length and in the excitation domain, such as an FFT (for “Fast Fourier Transform”) or a DCT-II (Discrete Cosine Transform-type II). Alternatively, it will be possible to replace the DCT-IV on the frame by a transformation with overlap-addition and windowing of length greater than the length of the current frame, for example by using an MDCT (for “Modified Discrete Cosine Transform”). In this case, the delay T in the block 310 of
The DCT spectrum, U(k), of 256 samples covering the 0-6400 Hz band (at 12.8 kHz), is then extended (block 701) into a spectrum of 320 samples covering the 0-8000 Hz band (at 16 kHz) in the following form:
in which it is preferentially taken that start_band=160.
The block 701 operates as module for generating an oversampled and extended excitation signal and performs a resampling from 12.8 to 16 kHz in the frequency domain, by adding ¼ of samples (k=240, L, 319) to the spectrum, the ratio between 16 and 12.8 being 5/4.
Furthermore, the block 701 performs an implicit high-pass filtering in the 0-5000 Hz band since the first 200 samples of UHB1(k) are set to zero; as explained later, this high-pass filtering is also complemented by a part of progressive attenuation of the spectral values of indices k=200, L, 255 in the 5000-6400 Hz band; this progressive attenuation is implemented in the block 704 but could be performed separately outside of the block 704. Equivalently, and in variants of the invention, the implementation of the high-pass filtering separated into blocks of coefficients of index k=0, L, 199 set to zero, of attenuated coefficients k=200, L, 255 in the transformed domain, will therefore be able to be performed in a single step.
In this exemplary embodiment and according to the definition of UHB1(k), it will be noted that the 5000-6000 Hz band of UHB1(k) (which corresponds to the indices k=200, L, 239) is copied from the 5000-6000 Hz band of U(k). This approach makes it possible to retain the original spectrum in this band and avoids introducing distortions in the 5000-6000 Hz band upon the addition of the HF synthesis with the LF synthesis—in particular the phase of the signal (implicitly represented in the DCT-IV domain) in this band is preserved.
The 6000-8000 Hz band of UHB1(k) is here defined by copying the 4000-6000 Hz band of U(k) since the value of start_band is preferentially set at 160.
In a variant of the embodiment, the value of start_band will be able to be made adaptive around the value of 160. The details of the adaptation of the start_band value are not described here because they go beyond the framework of the invention without changing its scope.
For certain wide-band signals (sampled at 16 kHz), the high band (>6 kHz) may be noisy, harmonic or comprise a mixture of noise and harmonics. Furthermore, the level of harmonicity in the 6000-8000 Hz band is generally correlated with that of the lower frequency bands. Thus, the noise generation block 702 performs a noise generation in the frequency domain, UHBN(k) for k=240, L, 319 (80 samples) corresponding to a second frequency band called high frequency in order to then combine this noise with the spectrum UHB1(k) in the block 703.
In a particular embodiment, the noise (in the 6000-8000 Hz band) is generated pseudo-randomly with a linear congruential generator on 16 bits:
with the convention that UHBN (239) in the current frame corresponds to the value UHBN (319) of the preceding frame. In variants of the invention, it will be possible to replace this noise generation by other methods.
The combination block 703 can be produced in different ways. Preferentially, an adaptive additive mixing of the following form is considered:
UHB2(k)=βUHB1(k)+αGHBNUHBN(k), k=240,L,319
in which GHBN is a normalization factor serving to equalize the level of energy between the two signals,
with ε=0.01, and the coefficient α (between 0 and 1) is adjusted as a function of parameters estimated from the decoded low band and the coefficient β (between 0 and 1) depends on α.
In a preferred embodiment, the energy of the noise is computed in three bands: 2000-4000 Hz, 4000-6000 Hz and 6000-8000 Hz, with
in which
and N(k1, k2) is the set of the indices k for which the coefficient of index k is classified as being associated with the noise. This set can, for example be obtained by detecting the local peaks in U′(k) that verify |U′(k)|≥|U′(k−1)| and |U′(k)|≥|U′(k+1)| by considering that these rays are not associated with the noise, i.e. (by applying the negation of the preceding condition):
N(a,b)={a≤k≤b∥U′(k)|<|U′(k−1)| or |U′(k)|<|U′(k+1)|}
It can be noted that other methods for computing the energy of the noise are possible, for example by taking the median value of the spectrum on the band considered or by applying a smoothing to each frequency ray before computing the energy per band. a is set such that the ratio between the energy of the noise in the 4-6 kHz and 6-8 kHz bands is the same as between the 2-4 kHz and 4-6 kHz bands:
in which
In variants of the invention, the computation of α will be able to be replaced by other methods. For example, in a variant, it will be possible to extract (compute) different parameters (or “features”) characterizing the signal in low band, including a “tilt” parameter similar to that computed in the AMR-WB codec, and the factor α will be estimated as a function of a linear regression from these different parameters by limiting its value between 0 and 1. The linear regression will, for example, be able to be estimated in a supervised manner by estimating the factor α by exchanging the original high band in a learning base. It will be noted that the way in which a is computed does not limit the nature of the invention.
In a preferred embodiment, the following is taken
β=√{square root over (1−α2)}
in order to preserve the energy of the extended signal after mixing.
In a variant, the factors β and α will be able to be adapted to take account of the fact that a noise injected into a given band of the signal is generally perceived as stronger than a harmonic signal with the same energy in the same band. Thus, it will be possible to modify the factors β and α as follows:
β←β·f(α)
α←α·f(α)
in which f(α) is a decreasing function of α, for example f(α)=b−a√{square root over (α)}, b=1.1, a=1.2, f(α) limited from 0.3 to 1. It must be noted that, after multiplication by f(α), α2+β2<1 so that the energy of the signal UHB1(k)=βUHB1(k)+αGHBNUHBN(k) is lower than the energy of UHB1(k) (the energy difference depends on a, the more noise is added, the more the energy is attenuated).
In other variants of the invention, it will be possible to take:
β=1−α
which makes it possible to preserve the amplitude level (when the combined signals are of the same sign); however, this variant has the disadvantage of resulting in an overall energy (at the level of UHB2 (k)) which is not monotonous as a function of α.
It should therefore be noted here that the block 703 performs the equivalent of the block 101 of
In a simple variant, it is possible to consider an implementation of the block 703, in which the spectra, UHB1(k) or GHBNUHBN (k), are selected (switched) adaptively, which amounts to allow only the values 0 or 1 for a; this approach amounts to classifying the type of excitation to be generated in the 6000-8000 Hz band.
The block 704 optionally performs a double operation of application of bandpass filter frequency response and of de-emphasis filtering in the frequency domain.
In a variant of the invention, the de-emphasis filtering will be able to be performed in the time domain, after the block 705, even before the block 700; however, in this case, the bandpass filtering performed in the block 704 may leave certain low-frequency components of very low levels which are amplified by de-emphasis, which can modify, in a slightly perceptible manner, the decoded low band. For this reason, it is preferred here to perform the de-emphasis in the frequency domain. In the preferred embodiment, the coefficients of index k=0, L, 199 are set to zero, so the de-emphasis is limited to the higher coefficients.
The excitation is first de-emphasized according to the following equation:
in which Gdeemph(k) is the frequency response of the filter 1/(1−0.68z−1) over a restricted discrete frequency band. By taking into account the discrete (odd) frequencies of the DCT-IV, Gdeemph (k) is defined here as:
in which
In the case where a transformation other than DCT-IV is used, the definition of θk will be able to be adjusted (for example for even frequencies).
It should be noted that the de-emphasis is applied in two phases for k=200, L, 255 corresponding to the 5000-6400 Hz frequency band, where the response 1/(1−0.68z−1) is applied as at 12.8 kHz, and for k=256, L, 319 corresponding to the 6400-8000 Hz frequency band, where the response is extended from 16 kHz here to a constant value in the 6.4-8 kHz band.
It can be noted that, in the AMR-WB codec, the HF synthesis is not de-emphasized.
In the embodiment presented here, the high frequency signal is, on the contrary, de-emphasized so as to bring it into a domain consistent with the low frequency signal (0-6.4 kHz) which leaves the block 305 of
In a variant of the embodiment, in order to reduce the complexity, it will be possible to set Gdeemph (k) at a constant value independent of k, by taking for example Gdeemph (k)=0.6 which corresponds approximately to the average value of Gdeemph (k) for k=200, L, 319 in the conditions of the embodiment described above.
In another variant of the embodiment of the extension device, the de-emphasis will be able to be performed in an equivalent manner in the time domain after inverse DCT.
In addition to the de-emphasis, a bandpass filtering is applied with two separate parts: one, high-pass, fixed, the other, low-pass, adaptive (function of the bit rate).
This filtering is performed in the frequency domain.
In the preferred embodiment, the low-pass filter partial response is computed in the frequency domain as follows:
in which Nlp=60 at 6.6 kbit/s, 40 at 8.85 kbit/s, and 20 at the bit rates>8.85 bit/s. Then, a bandpass filter is applied in the form:
The definition of Ghp (k), k=0, L, 55, is given, for example, in table 1 below.
It will be noted that, in variants of the invention, the values of Ghp (k) will be able to be modified while keeping a progressive attenuation. Similarly, the low-pass filtering with variable bandwidth, Glp(k), will be able to be adjusted with values or a frequency medium that are different, without changing the principle of this filtering step.
It will also be noted that the bandpass filtering will be able to be adapted by defining a single filtering step combining the high-pass and low-pass filtering.
In another embodiment, the bandpass filtering will be able to be performed in an equivalent manner in the time domain (as in the block 112 of
It will also be noted that, in the case of the 23.85 kbit/s bit rate, the de-emphasis of the excitation UHB2(k) is not performed to remain in agreement with the way in which the correction gain is computed in the AMR-WB coder and to avoid double multiplications. In this case, block 704 performs only the low-pass filtering.
The inverse transform block 705 performs an inverse DCT on 320 samples to find the high-frequency excitation sampled at 16 kHz. Its implementation is identical to the block 700, because the DCT-IV is orthonormal, except that the length of the transform is 320 instead of 256, and the following is obtained:
in which N16k=320 and k=0, L, 319.
This excitation sampled at 16 kHz is then, optionally, scaled by gains defined per subframe of 80 samples (block 707).
In a preferred embodiment, a gain gHB1(m) is first computed (block 706) per subframe by energy ratios of the subframes such that, in each subframe of index m=0, 1, 2 or 3 of the current frame:
in which
with ε=0.01. The gain per subframe gHB1(m) can be written in the form:
which shows that, in the signal uHB, the same ratio between energy per subframe and energy per frame as in the signal u(n) is assured.
The block 707 performs the scaling of the combined signal according to the following equation:
uHB(n)=gHB1(m)uHB0(n), n=80m,L, 80(m+1)−1
It will be noted that the implementation of the block 706 differs from that of the block 101 of
Thus, this scaling step makes it possible to retain, in the high band, the energy ratio between the subframe and the frame in the same way as in the low band.
It will be noted here that, in the case of the 23.85 kbit/s bit rate, the gains gHB1(m) are computed but applied in the next step, as explained with reference to
According to the invention, the block 708 then performs a scale factor computation per subframe of the signal (steps E602 to E603 of
Finally, the corrected excitation uHB′(n) is filtered by the filtering module 710 which can be performed here by taking as transfer function 1/Â(z/γ), in which γ=0.9 at 6.6 kbit/s and γ=0.6 at the other bit rates, which limits the order of the filter to the order 16.
In a variant, this filtering will be able to be performed in the same way as is described for the block 111 of
In a variant embodiment, the step of filtering by a linear prediction filter 710 for the second frequency band is combined with the application of the optimized scale factor, which makes it possible to reduce the processing complexity. Thus, the steps of filtering 1/Â(z/γ) and of application of the optimized scale factor gHB2 are combined in a single step of filtering gHB2/Â(z/γ) to reduce the processing complexity.
In variant embodiments of the invention, the coding of the low band (0-6.4 kHz) will be able to be replaced by a CELP coder other than that used in AMR-WB, such as, for example, the CELP coder in G.718 at 8 kbit/s. With no loss of generality, other wide-band coders or coders operating at frequencies above 16 kHz, in which the coding of the low band operates with an internal frequency at 12.8 kHz, could be used. Moreover, the invention can obviously be adapted to sampling frequencies other than 12.8 kHz, when a low-frequency coder operates with a sampling frequency lower than that of the original or reconstructed signal. When the low-band decoding does not use linear prediction, there is no excitation signal to be extended, in which case it will be possible to perform an LPC analysis of the signal reconstructed in the current frame and an LPC excitation will be computed so as to be able to apply the invention.
Finally, in another variant of the invention, the excitation (u(n)) is resampled, for example by linear interpolation or cubic “spline”, from 12.8 to 16 kHz before transformation (for example DCT-IV) of length 320. This variant has the defect of being more complex, because the transform (DCT-IV) of the excitation is then computed over a greater length and the resampling is not performed in the transform domain.
Furthermore, in variants of the invention, all the computations necessary for the estimation of the gains (GHBN, gHB1(m), gHB2(m), gHBN, . . . ) will be able to be performed in a logarithmic domain.
In variants of the band extension, the excitation in low band u(n) and the LPC filter 1/Â(z) will be estimated per frame, by LPC analysis of a low-band signal for which the band has to be extended. The low-band excitation signal is then extracted by analysis of the audio signal.
In a possible embodiment of this variant, the low-band audio signal is resampled before the step of extracting the excitation, so that the excitation extracted from the audio signal (by linear prediction) is already resampled.
The band extension illustrated in
This type of device comprises a processor PROC cooperating with a memory block BM comprising a storage and/or working memory MEM.
Such a device comprises an input module E suitable for receiving an excitation audio signal decoded or extracted in a first frequency band called low band (u(n) or U(k)) and the parameters of a linear prediction synthesis filter (Â(z)). It comprises an output module S suitable for transmitting the synthesized and optimized high-frequency signal (uHB′(n)) for example to a filtering module like the block 710 of
The memory block can advantageously comprise a computer program comprising code instructions for implementing the steps of the method for determining an optimized scale factor to be applied to an excitation signal or to a filter within the meaning of the invention, when these instructions are executed by the processor PROC, and notably the steps of determination (E602) of a linear prediction filter, called additional filter, of lower order than the linear prediction filter of the first frequency band, the coefficients of the additional filter being obtained from parameters decoded or extracted from the first frequency band, and of computation (E603) of an optimized scale factor as a function at least of the coefficients of the additional filter.
Typically, the description of
The memory MEM stores, generally, all the data necessary for the implementation of the method.
In a possible embodiment, the device thus described can also comprise functions for application of the optimized scale factor to the extended excitation signal, of frequency band extension, of low-band decoding and other processing functions described for example in
Number | Date | Country | Kind |
---|---|---|---|
13 56909 | Jul 2013 | FR | national |
This application is a continuation of application Ser. No. 14/904,555, filed Jan. 12, 2016, which is a U.S. National Phase application under 35 U. S. C. § 371 of International Application No. PCT/FR2014/051720, filed Jul. 4, 2014, which claims priority to French application no. 1356909, filed Jul. 12, 2013, the content of which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5455888 | Iyengar | Oct 1995 | A |
5572622 | Wigren et al. | Nov 1996 | A |
7283967 | Nishio | Oct 2007 | B2 |
8121832 | Choo | Feb 2012 | B2 |
8260609 | Rajendran | Sep 2012 | B2 |
8392198 | Berisha | Mar 2013 | B1 |
9685165 | Liu | Jun 2017 | B2 |
20020052734 | Unno | May 2002 | A1 |
20020138268 | Gustafsson | Sep 2002 | A1 |
20030088408 | Thyssen | May 2003 | A1 |
20040147229 | McGrath | Jul 2004 | A1 |
20060277039 | Vos | Dec 2006 | A1 |
20070088542 | Vos | Apr 2007 | A1 |
20070147518 | Bessette | Jun 2007 | A1 |
20070225971 | Bessette | Sep 2007 | A1 |
20080027718 | Krishnan | Jan 2008 | A1 |
20080215344 | Song | Sep 2008 | A1 |
20080294429 | Su | Nov 2008 | A1 |
20090110208 | Choo | Apr 2009 | A1 |
20090201983 | Jusiuk | Aug 2009 | A1 |
20090319277 | Black | Dec 2009 | A1 |
20090326931 | Ragot | Dec 2009 | A1 |
20100023325 | Bessette | Jan 2010 | A1 |
20100070270 | Gao | Mar 2010 | A1 |
20100198587 | Ramabadran | Aug 2010 | A1 |
20110099004 | Krishnan | Apr 2011 | A1 |
20110295598 | Yang | Dec 2011 | A1 |
20120010879 | Tsujino | Jan 2012 | A1 |
20120072208 | Krishnan | Mar 2012 | A1 |
20120095758 | Gibbs | Apr 2012 | A1 |
20120271644 | Bessette et al. | Oct 2012 | A1 |
20140114670 | Miao | Apr 2014 | A1 |
20140257827 | Norvell | Sep 2014 | A1 |
20140288925 | Sverrisson | Sep 2014 | A1 |
20150170662 | Krishnan | Jun 2015 | A1 |
20150279384 | Atti | Oct 2015 | A1 |
20150317994 | Ramadas | Nov 2015 | A1 |
20150332701 | Nagel | Nov 2015 | A1 |
20160196829 | Liu | Jul 2016 | A1 |
20170272459 | Rusakov | Sep 2017 | A1 |
20170272853 | Yang | Sep 2017 | A1 |
Number | Date | Country |
---|---|---|
2017145792 | Aug 2017 | JP |
2011047478 | Apr 2011 | WO |
Entry |
---|
Berisha et al “Bandwidth Extension of Audio Based on Partial Loudness Criteria” Multimedia Signal Processing, 2006 IEEE 8th Workshop on IEEE 2006. |
3GPPT226445 “EVS Codec Detailed Algorithmic Description” Nov. 2014, 3GPP Technical Specification (Release 12) 3GPPTS 26.445 pp. 1-13 598, 603 of 626. |
Pulakka et al “Bandwidth Extension of Telephone Speech Using a Neural Network and a Filter Bank Implementation for Highband MEL Spectrum” 2011, IEEE Transactions on Audio, Speech and Language Processing 19(7) p. 2170-2183. |
International Search Report dated Aug. 28, 2014 Corresponding International Application PCT/FR2014/051720, Filed Jul. 4, 2014. |
English Translation of Written Opinion dated Aug. 28, 2014 Corresponding International Appliction PCT.FR2014/01720, Filed Jul. 4, 2014. |
Geiser et al “Bandwidth Extension for Hierarchical Speech and Audio Coding in ITU-T Rec. G.729.1,”, 2007, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 8, pp. 2496-2509, Nov. 2007. |
Krishnan et al, “EVRC-Wideband: The New 3GPP2 Wideband Vocoder Standard”, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing—ICASSP 2007, Honolulu, HI 2007, pp. II-333-II-336. |
Bessette et al “The Adaptive Multriate Wideband Speech Codec (AMR-WB),”, 2002, in IEEE Transactions on Speech and Audio Processing, vol. 10, No. 8, pp. 620-636, Nov. 2002. |
Jax et al “An Embedded Scalable Wideband Codec Based on teh GSM EFR Codec”, 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006, pp. 1-1. |
Freudenberger, “Bandwidth Extension for Mixed Asynchronous Asynchronous Synchronous Speech Transmission”, 2009, Proceedings of the 8th WSEAS International Conference on Signal Processing, Robotics and Automation, pp. 304-308, World Scientific and Engineering Academy and Society (WSEAS). |
Digital cellular telecommunications system (Phase 2+); Universal Mobile Telecommunications System (UMTS); LTE; Audio codec processing functions; Extended Adaptive Multi-Rate—Wideband (AMR-WB+) codec; Transcoding functions (3GPP TS 26.290 version 11.0.0 Release 11). 2012. |
Number | Date | Country | |
---|---|---|---|
20190385625 A1 | Dec 2019 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14904555 | US | |
Child | 16553595 | US |