The present invention relates to the field of the transmission of coded audio signals, more specifically to a method and an apparatus for obtaining, or acquiring, spectrum coefficients for a replacement frame of an audio signal, to an audio decoder, to an audio receiver and to a system for transmitting audio signals. Embodiments relate to an approach for constructing a spectrum for a replacement frame based on previously received frames.
In conventional technology, several approaches are described dealing with a frame-loss at an audio receiver. For example, when a frame is lost on the receiver side of an audio or speech codec, simple methods for the frame-loss-concealment as described in P. Lauber and R. Sperschneider, “Error Concealment for Compressed Digital Audio,” in AES 111th Convention, New York, USA, 2001 (hereinafter “the Lauber reference”) may be used, such as:
Additionally, in the Lauber reference, an advanced technique using predictors in sub-bands is presented. The predictor technique is then combined with sign scrambling, and the prediction gain is used as a sub-band wise decision criterion to determine which method will be used for the spectral coefficients of this sub-band.
In U.S. Pat. No. 6,351,730 B2 (C. J. Hwey, “Low-complexity, low-delay, scalable and embedded speech and audio coding with adaptive frame loss concealment,” hereinafter “the '730 Patent”), a waveform signal extrapolation in the time domain is used for a MDCT (Modified Discrete Cosine Transform) domain codec. This kind of approach may be good for monophonic signals including speech.
If one frame delay is allowed, an interpolation of the surrounding frames can be used for the construction of the lost frame. Such an approach is described in US Patent Application Publication No. 2007/094009 A1 (S. K. Gupta, E. Choy and S.-U. Ryu, “Encoder-assisted frame loss concealment techniques for audio coding,” hereinafter “the '009 Publication”), where the magnitudes of the tonal components in the lost frame with an index m are interpolated using the neighboring frames indexed m−1 and m+1. The side information that defines the MDCT coefficient signs for tonal components is transmitted in the bit-stream. Sign scrambling is used for other non-tonal MDCT coefficients. The tonal components are determined as a predetermined fixed number of spectral coefficients with the highest magnitudes. This approach selects n spectral coefficients with the highest magnitudes as the tonal components.
C
m*(k)=½(Cm−1(k)+)
C
m+1(k)
The operation of the FLC (Frame Loss Concealment) technique described in the block diagram of
In
The approach is based on the following general formula:
C
m(k)=Cm*(k)α*(k)s*(k),0≤k<M
Cm*(k) is derived by a bin-wise interpolation (see block 700 “Frame Interpolation”):
C
m*(k)=½(Cm−1(k)+Cm+1(k))
α*(k) is derived by an energy interpolation using the geometric mean:
For tonal components it can be shown that α=cos(πfl), with fl being the frequency of the tonal component.
The energies E are derived based on a pseudo power spectrum, derived by a simple smoothing operation:
P(k)≅C2(k)+{C(k+1)−C(k−1)}2
s*(k) is set randomly to ±1 for non-tonal components (see block 712 “Random Sign Change”), and to either +1 or −1 for tonal components (see block 714 “Sign Correction”).
The peak detection is performed as searching for local maxima in the pseudo power spectrum to detect the exact positions of the spectral peaks corresponding to the underlying sinusoids. It is based on the tone identification process adopted in the MPEG-1 psychoacoustic model described in ISO/IEC JTC1/SC29/WG11, Information technology—Coding of moving pictures and associated, International Organization for Standardization, 1993. Out of this, an index sub-set is defined having the bandwidth of an analysis window's main-lobe in terms of MDCT bins and the detected peak in its center. Those bins are treated as tone dominant MDCT bins of a sinusoid, and the index sub-set is treated as an individual tonal component.
The sign correction s*(k) flips either the signs of all bins of a certain tonal component, or none. The determination is performed using an analysis by synthesis, i.e., the SFM is derived for both versions and the version with the lower SFM is chosen. For the SFM derivation, the power spectrum is needed, which in return may use the MDST (Modified Discrete Sine Transform) coefficients. For keeping the complexity manageable, only the MDST coefficients for the tonal component are derived, using also only the MDCT coefficients of this tonal component.
The following modifications are of interest with respect to the Ryu 2006/Paris reference:
P
m(k)=Cm−12(k)+Cm+12(k)
C
m(k)=Cm*(k)α(k), 0≤k<M
E
m(α)=½{Em−1(α)+Em+1(α)}
E
m−1(α)≅|cm−1|2+|sm−1|2=|cm−1|2+|ξ1+αξ1|2
E
m(α)≅α2κm|2+|sm|2=α2κm|2+|ξ2+α2|2
E
m+1(α)≅|cm+1|2|sm+1|2=|cm+1|2+∥ξ3+α3|2
s
m−1
≅A
1
c
m−2
+A
2
c
m−1
+αA
3
c
m=ξ1+αξ1
s
m
≅A
1
c
m−1+α1A2cm+αA3cm+1=ξ2+αξ2
s
m+1
≅αA
1
c
m
+A
2
c
m+1
+αA
3
c
m+2=ξ3+αξ3
A delay-less version of this approach is suggested in S.-U. Ryu, Source Modeling Approaches to Enhanced Decoding in Lossy Audio Compression and Communication, UNIVERSITY of CALIFORNIA Santa Barbara, 2006 (hereinafter “Ryu 2006/California”):
C
m(k)=2Cm−1*(k)−Cm−2(k)
C
m(k)=αCm−1(k)−Cm−2(k)
s
m−1≅(A1−A3)cm−2+A2cm−1+αA3cm−1=ξ0+αξ0
E
m−1(α)≅|cm−1|2+|sm−1|2=|cm−1|2+|ξ0+αξ0|2
E
m−1(α)=Em−2
Another delay-less frame-loss-concealment in the frequency domain is described in European Patent No. EP 0574288 B1 (M. Yannick, “Method and apparatus for transmission error concealment of frequency transform coded digital audio signals,” hereinafter “the '288 Patent”. The teachings of reference the '288 Patent can be simplified, without loss of generality, as:
C
m′(k)=Qm(k)cos(φm(k)+χ)
S
m′(k)=Qm(k)sin(φm(k)+χ)
Q
m+1(k)=2Qm(k)−Qm−1(k)
φm+1(k)=2φm(k)−φm−1(k)
C
m+1(k)=Qm+1(k)cos(φm+1(k))
The selection of spectrum coefficients to be predicted is mentioned in the '288 Patent but is not described in detail.
In Y. Mahieux, J.-P. Petit and A. Charbonnier, “Transform coding of audio signals using correlation between successive transform blocks,” in Acoustics, Speech, and Signal Processing, 1989. ICASSP-89, 1989, it has been recognized that, for quasi-stationary signals, the phase difference between successive frames is almost constant and depends only on the fractional frequency. However, only a linear extrapolation from the last two complex spectra is used.
In AMR-WB+ (see 3GPP; Technical Specification Group Services and System Aspects, Extended Adaptive Multi-Rate—Wideband (AMR-WB+) codec, 2009) a method described in U.S. Pat. No. 7,356,748 B2 (A. Taleb, “Partial Spectral Loss Concealment in Transform Codecs,” hereinafter “the '748 Patent”) is used. The method in the '748 Patent is an extension of the method described in reference the '288 Patent in a sense that it uses also the available spectral coefficients of the current frame, assuming that only a part of the current frame is lost. However, the situation of a complete loss of a frame is not considered in the '748 Patent.
Another delay-less frame-loss-concealment in the MDCT domain is described in US Patent Application Publication No. 2012/109659 A1 (C. Guoming, D. Zheng, H. Yuan, J. Li, J. Lu, K. Liu, K. Peng, L. Zhibin, M. Wu and Q. Xiaojun, “Compensator and Compensation Method for Audio Frame Loss in Modified Discrete Cosine Transform Domain,” hereinafter “the '659 Publication”. In the '659 Publication, it is first determined if the lost Pth frame is a multiple-harmonic frame. The lost Pth frame is a multiple-harmonic frame if more than Ko frames among K frames before the Pth frame have a spectrum flatness smaller than a threshold value. If the lost Pth frame is a multiple-harmonic frame then (P−K)th to (P−2)nd frames in the MDCT-MDST domain are used to predict the lost Pth frame. A spectral coefficient is a peak if its power spectrum is bigger than the two adjacent power spectrum coefficients. A pseudo spectrum as described in L. S. M. Dauder, “MDCT Analysis of Sinusoids: Exact Results and Applications to Coding Artifacts Reduction,” IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, pp. 302-312, 2004 (hereinafter “Dauder”), is used for the (P−1)st frame.
A set of spectral coefficients Sc is constructed from L1 power spectrum frames as follows.
Obtaining L1 sets S1, . . . , SL1 composed of peaks in each of L1 frames, a number of peaks in each set being N1, . . . , NL1, respectively. Selecting a set Si from the L1 sets of S1, . . . , SL1. For each peak coefficient mj, j=1 . . . Ni in the set S1, judging whether there is any frequency coefficient among mj, mj±1, . . . , mj±k belonging to all other peak sets. If there is any, putting all the frequencies mj, mj±1, . . . , mj±k into the frequency set Sc. If there is no frequency coefficient belonging to all other peak sets, directly putting all the frequency coefficients in a frame into the frequency set Sc. Said k is a nonnegative integer. For all spectral coefficients in the set Sc the phase is predicted using L2 frames among (P−K)th to (P−2)nd MDCT-MDST frames. The prediction is done using a linear extrapolation (when L2=2) or a linear fit (when L2>2). For the linear extrapolation:
where p, t1 and t2 are frame indices.
The spectral coefficients not in the set Sc are obtained using a plurality of frames before the (P−1)st frame, without specifically explaining how.
According to one embodiment, a method for acquiring spectrum coefficients for a replacement frame of an audio signal may have the steps of: detecting a tonal component of a spectrum of an audio signal based on a peak that exists in the spectra of frames preceding a replacement frame; for the tonal component of the spectrum, predicting spectrum coefficients for the peak and its surrounding in the spectrum of the replacement frame; and for the non-tonal component of the spectrum, using a non-predicted spectrum coefficient for the replacement frame or a corresponding spectrum coefficient of a frame preceding the replacement frame. Optionally, a non-transitory computer program product may have a computer readable medium storing instructions which, when executed on a computer for the method.
According to another embodiment, an apparatus for acquiring spectrum coefficients for a replacement frame of an audio signal may have: a detector configured to detect a tonal component of a spectrum of an audio signal based on a peak that exists in the spectra of frames preceding a replacement frame; and a predictor configured to predict for the tonal component of the spectrum the spectrum coefficients for the peak and its surrounding in the spectrum of the replacement frame; wherein for the non-tonal component of the spectrum a non-predicted spectrum coefficient for the replacement frame or a corresponding spectrum coefficient of a frame preceding the replacement frame is used. In one configuration, an apparatus for acquiring spectrum coefficients for a replacement frame of an audio signal, the apparatus being configured to operate according to the method. In one alternative, an audio decoder may contain the apparatus for acquiring spectrum coefficients. Furthermore, the audio decoder may have an audio decoder for acquiring spectrum coefficients.
According to another embodiment, a system for transmitting audio signals may have: an encoder configured to generate coded audio signal; and a decoder configured to receive the coded audio signal, and to decode the coded audio signal.
Embodiments of a method for obtaining spectrum coefficients for a replacement frame of an audio signal include detecting a tonal component of a spectrum of an audio signal based on a peak that exists in the spectra of frames preceding a replacement frame; for the tonal component of the spectrum, predicting spectrum coefficients for the peak and its surrounding in the spectrum of the replacement frame; and for the non-tonal component of the spectrum, using a non-predicted spectrum coefficient for the replacement frame or a corresponding spectrum coefficient of a frame preceding the replacement frame.
Embodiments of an apparatus for obtaining spectrum coefficients for a replacement frame of an audio signal include a detector configured to detect a tonal component of a spectrum of an audio signal based on a peak that exists in the spectra of frames preceding a replacement frame; and a predictor configured to predict for the tonal component of the spectrum the spectrum coefficients for the peak and its surrounding in the spectrum of the replacement frame; wherein for the non-tonal component of the spectrum a non-predicted spectrum coefficient for the replacement frame or a corresponding spectrum coefficient of a frame preceding the replacement frame is used.
Embodiments of an apparatus for obtaining spectrum coefficients for a replacement frame of an audio signal include the apparatus being configured to operate according to the inventive method for obtaining spectrum coefficients for a replacement frame of an audio signal.
Embodiments of an apparatus include an audio decoder, comprising the inventive an apparatus for obtaining spectrum coefficients for a replacement frame of an audio signal.
Embodiments of an audio receiver may include the inventive audio decoder.
Embodiments of a system for transmitting audio signals include an encoder configured to generate coded audio signal; and the inventive decoder configured to receive the coded audio signal, and to decode the coded audio signal.
Embodiments of a non-transitory computer program product include a computer readable medium storing instructions which, when executed on a computer, carry out the inventive method for obtaining spectrum coefficients for a replacement frame of an audio signal.
Embodiments of the systems, methods, and apparatuses are advantageous as they provide for a good frame-loss concealment of tonal signals with a good quality and without introducing any additional delay. Embodiments of a low delay codec are advantageous as they perform well on both speech and audio signals and benefits, for example in an error prone environment, from the good frame-loss concealment that is achieved especially for stationary tonal signals. A delay-less frame-loss-concealment of monophonic and polyphonic signals is disclosed, which delivers good results for tonal signals without degradation of the non-tonal signals.
In many embodiments, an improved concealment of tonal components in the MDCT domain is provided. Embodiments relate to audio and speech coding that incorporate a frequency domain codec or a switched speech/frequency domain codec, in particular to a frame-loss concealment in the MDCT (Modified Discrete Cosine Transform) domain. In many embodiments, a delay-less method for constructing an MDCT spectrum for a lost frame based on the previously received frames is provided, where the last received frame is coded in the frequency domain using the MDCT.
In one embodiment, a method includes detection of the parts of the spectrum which are tonal, for example using the second to last complex spectrum to get the correct location or place of the peak, using the last real spectrum to refine the decision if a bin is tonal, and using pitch information for a better detection either of a tone onset or offset. The pitch information is either already existing in the bit-stream or is derived at the decoder side. Further, embodiments of a method include a provision of a signal adaptive width of a harmonic to be concealed. The calculation of the phase shift or phase difference between frames of each spectral coefficient that is part of a harmonic is also provided, wherein this calculation is based on the last available spectrum, for example the CMDCT spectrum, without the need for the second to last CMDCT. In accordance with embodiments, the phase difference is refined using the last received MDCT spectrum, and the refinement may be adaptive, dependent on the number of consecutively lost frames. The CMDCT spectrum may be constructed from the decoded time domain signal which is advantageous as it avoids the need for any alignment with the codec framing, and it allows for the construction of the complex spectrum to be as close as possible to the lost frame by exploiting the properties of low-overlap windows. Embodiments provide a per frame decision to use either time domain or frequency domain concealment.
Embodiments of the inventive approach are advantageous, as they operate fully on the basis of information already available at the receiver side when determining that a frame has been lost or needs to be replaced and there is no need for additional side information that needs to be received so that there is also no source for additional delays which occur in conventional-technology approaches given the requirement to either receive the additional side information or to derive the additional side information from the existing information at hand.
Embodiments of the inventive approach are advantageous when compared to the above described conventional-technology approaches as the subsequently outlined drawbacks of such approaches, which were recognized by the inventors are avoided when applying the inventive approach.
The methods for the frame-loss-concealment described in the Lauber reference are not robust enough and don't produce good enough results for tonal signals.
The waveform signal extrapolation in time domain, as described in the '730 Patent, cannot handle polyphonic signals and uses an increased complexity for concealment of very stationary, tonal signals, as a precise pitch lag may be determined.
In the '009 Publication, an additional delay is introduced and significant side information may be used. The tonal component selection is very simple and will choose many peaks among non-tonal components.
The method described in the Ryu 2006/Paris reference may use a look-ahead on the decoder side and hence introduces an additional delay of one frame. Using the smoothed pseudo power spectrum for the peak detection reduces the precision of the location of the peaks. It also reduces the reliability of the detection since it will detect peaks from noise that appear in just one frame.
The method described in the Ryu 2007 reference may use a look-ahead on the decoder side and hence introduces an additional delay of two frames. The tonal component selection doesn't check for tonal components in two frames separately, but relies on an averaged spectrum, and thus it will have either too many false positives or false negatives making it impossible to tune the peak detection thresholds. The location of the peaks will not be precise because the pseudo power spectrum is used. The limited spectral range for peak search looks like a workaround for the described problems that arises because pseudo power spectrum is used.
The method described in the Ryu 2006/California reference is based on the method described in the Ryu 2007 reference; hence, it has the same drawbacks; it just overcomes the additional delay.
In the '288 Patent, there is no detailed description of the decision whether a spectral coefficient belongs to the tonal part of the signal. However, the synergy between the tonal spectral coefficients detection and the concealment is important and thus a good detection of tonal components is important. Further, it has not been recognized to use filters dependent on both Cm and Cm−1 (that is Cm, Cm−1 and Sm−1, as Sm−1 can be calculated when Cm, and Cm−1 is available) to calculate Cm and S. Also, it was not recognized to use the possibility to calculate a complex spectrum that is not aligned to the coded signal framing, which is given with low overlap windows. In addition, it was not recognized to use the possibility to calculate the phase difference between frames only based on the second last complex spectrum.
In the '659 Publication, at least three previous frames are stored in memory, thereby significantly increasing the memory requirements. The decision whether to use tonal concealment may be wrong and a frame with one or more harmonics may be classified as a frame without multiple harmonics. The last received MDCT frame is not directly used to improve the prediction of the lost MDCT spectrum, but just in the search for the tonal components. The number of MDCT coefficients to be concealed for a harmonic is fixed, however, depending on the noise level, it is desirable to have a variable number of MDCT coefficients that constitute one harmonic.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
In the following, embodiments of the inventive approach will be described in further detail and it is noted that in the accompanying drawings elements having the same or similar functionality are denoted by the same reference signs. In the following embodiments of the inventive approach will be described, in accordance with which a concealment is done in the frequency domain only if the last two received frames are coded using the MDCT. Details about the decision whether to use time or frequency domain concealment on a frame loss after receiving two MDCT frames will also be described. With regard to the embodiments described in the following it is noted that the requirement that the last two frames are coded in the frequency domain does not reduce the applicability of the inventive approach as in a switched codec the frequency domain will be used for stationary tonal signals.
The system further comprises a decoder 120 having an input 122 at which the encoded audio signal provided by the encoder 106 is received. The encoder 120 may comprise, in accordance with an embodiment, an antenna 124 for receiving a wireless transmission 110 from the encoder 100. In another embodiment, the input 122 may provide for a connection to the wired transmission 112 for receiving the encoded audio signal. The audio signal received at the input 122 of the decoder 120 is applied to a detector 126 which determines whether a coded frame of the received audio signal that is to be decoded by the decoder 120 needs to be replaced. For example, in accordance with embodiments, this may be the case when the detector 126 determines that a frame that should follow a previous frame is not received at the decoder or when it is determined that the received frame has errors which avoid decoding it at the decoder side 120. In case it is determined at detector 126 that a frame presented for decoding is available, the frame will be forwarded to the decoding block 128 where a decoding of the encoded frame is carried out so that at the output of the decoder 130 a stream of decoded audio frames or a decoded audio signal 132 can be output.
In case it is determined at block 126 that the frame to be currently processed needs a replacement, the frames preceding the current frame which needs a replacement and which may be buffered in the detector circuitry 126 are provided to a tonal detector 134 determining whether the spectrum of the replacement includes tonal components or not. In case no tonal components are provided, this is indicated to the noise generator/memory block 136 which generates spectral coefficients which are non-predictive coefficients which may be generated by using a noise generator or another conventional noise generating method, for example sign scrambling or the like. Alternatively, also predefined spectrum coefficients for non-tonal components of the spectrum may be obtained from a memory, for example a look-up table. Alternatively, when it is determined that the spectrum does not include tonal components, instead of generating non-predicted spectral coefficients, corresponding spectral characteristics of one of the frames preceding the replacement may be selected.
In case the tonal detector 134 detects that the spectrum includes tonal components, a respective signal is indicated to the predictor 138 predicting, in accordance with embodiments of the present invention described later, the spectral coefficients for the replacement frame. The respective coefficients determined for the replacement frame are provided to the decoding block 128 where, on the basis of these spectral coefficients, a decoding of the lost or replacement frame is carried out.
As is shown in
In step S202 it is determined whether or not a current frame to be processed by the decoder 120 needs to be replaced. A replacement frame may be used at the decoder side, for example in case the frame cannot be processed due to an error in the received data or the like, or in case the frame was lost during transmission to the receiver/decoder 120, or in case the frame was not received in time at the audio signal receiver 120, for example due to a delay during transmission of the frame from the encoder side towards the decoder side.
In case it is determined in step S202, for example by the detector 126 in decoder 120, that the frame to be currently processed by the decoder 120 needs to be replaced, the method proceeds to step S204 at which a further determination is made whether or not a frequency domain concealment may be used. In accordance with an embodiment, if the pitch information is available for the last two received frames and if the pitch is not changing, it is determined at step S204 that a frequency domain concealment is desired. Otherwise, it is determined that a time domain concealment should be applied. In an alternative embodiment, the pitch may be calculated on a sub-frame basis using the decoded signal, and again using the decision that in case the pitch is present and in case it is constant in the sub-frames, the frequency domain concealment is used, otherwise the time domain concealment is applied.
In yet another embodiment of the present invention, a detector, for example the detector 126 in decoder 120, may be provided and may be configured in such a way that it additionally analyzes the spectrum of the second to last frame or the last frame or both of these frames preceding the replacement frame and to decide, based on the peaks found, whether the signal is monophonic or polyphonic. In case the signal is polyphonic, the frequency domain concealment is to be used, regardless of the presence of pitch information. Alternatively, the detector 126 in decoder 120, may be configured in such a way that it additionally analyzes the one or more frames preceding the replacement frame so as to indicate whether a number of tonal components in the signal exceeds a predefined threshold or not. In case the number of tonal components in the signal exceeds the threshold the frequency domain concealment will be used.
In case it is determined in step S204 that a frequency domain concealment is to be used, for example by applying the above mentioned criteria, the method proceeds to step S206, where a tonal part or a tonal component of a spectrum of the audio signal is detected based on one or more peaks that exist in the spectra of the preceding frames, namely one or more peaks that are present at substantially the same location in the spectrum of the second to last frame and the spectrum of the last frame preceding the replacement frame. In step S208 it is determined whether there is a tonal part of the spectrum. In case there is a tonal part of the spectrum, the method proceeds to step S210, where one or more spectrum coefficients for the one or more peaks and their surroundings in the spectrum of the replacement frame are predicted, for example on the basis of information derivable from the preceding frames, namely the second to last frame and the last frame. The spectrum coefficient(s) predicted in step S210 is (are) forwarded, for example to the decoding block 128 shown in
In case it is determined in step S208 that there is no tonal part of the spectrum, the method proceeds to step S214, using a non-predicted spectrum coefficient for the replacement frame or a corresponding spectrum coefficient of a frame preceding the replacement frame which are provided to step S212 for decoding the frame.
In case it is determined in step S204 that no frequency domain concealment is desired, the method proceeds to step S216 where a conventional time domain concealment of the frame to be replaced is performed and on the basis of the spectrum coefficients generated by the process in step S216 the frame of the encoded signal is decoded in step S212.
In case it is determined at step S202 that there is no replacement frame in the audio signal currently processed, i.e. the currently processed frame can be fully decoded using the conventional approaches, the method directly proceeds to step S212 for decoding the frame of the encoded audio signal.
In the following, further details in accordance with embodiments of the present invention will be described.
Power Spectrum Calculation
For the second-last frame, indexed m−2 the MDST coefficients Sm−2 are calculated directly from the decoded time domain signal.
For the last frame an estimated MDST spectrum is used which is calculated from the MDCT coefficients Cm−1 of the last received frame (see e.g., the Dauder reference):
|Sm−1(k)|=|Cm−1(k+1)−Cm−1(k−1)|
The power spectra for the frames m−2 and m−1 are calculated as follows:
P
m−2(k)=|Sm−2(k)|2+|Cm−2(k)|2
P
m−1(k)=|Sm−1(k)|2+|Cm−1(k)|2
with:
The obtained power spectra are smoothed as follows:
Psmoothedm−2(k)=0.75·Pm−2(k−1)+Pm−2(k)+0.75·Pm−2(k+1)
Psmoothedm−1(k)=0.75·Pm−1(k−1)+Pm−1(k)+0.75·Pm−1(k+1).
Detection of Tonal Components
Peaks existing in the last two frames (m−2 and m−1) are considered as representatives of tonal components. The continuous existence of the peaks allows for a distinction between tonal components and randomly occurring peaks in noisy signals.
Pitch Information
It is assumed that the pitch information is available:
The pitch information is used only if all of the following conditions are met:
The fundamental frequency is calculated from the pitch lag:
If there is F0′=n·F0 for which N>5 harmonics are the strongest in the spectrum then F0 is set to F0′. F0 is not reliable if there are not enough strong peaks at the positions of the harmonics n·F0.
In accordance with an embodiment, the pitch information is calculated on the framing aligned to the right border of the MDCT window shown in
In another embodiment, the pitch information may be transferred in the bit-stream and used by the codec in the clean channel and thus comes at no additional cost for the concealment.
Envelope
In the following a procedure is described for obtaining a spectrum envelope, which is needed for the peak picking described later.
The envelope of each power spectrum in the last two frames is calculated using a moving average filter of length L:
The filter length depends on the fundamental frequency (and may be limited to the range [7,23]):
This connection between L and F0 is similar to the procedure described in D. B. Paul, “The Spectral Envelope Estimation Vocoder,” IEEE Transactions on Acoustics, Speech, and Signal Processing, pp. 786-794, 1981 (hereinafter “Paul”); however, in the present invention the pitch information from the current frame is used that includes a look-ahead, wherein the Paul reference uses an average pitch specific to a talker. If the fundamental frequency is not available or not reliable, the filter length L is set to 15.
Peak Picking
The peaks are first searched in the power spectrum of the frame m−1 based on predefined thresholds. Based on the location of the peaks in the frame m−1, the thresholds for the search in the power spectrum of the frame m−2 are adapted. Thus the peaks that exist in both frames (m−1 and m−2) are found, but the exact location is based on the power spectrum in the frame m−2. This order is important because the power spectrum in the frame m−1 is calculated using only an estimated MDST and thus the location of a peak is not precise. It is also important that the MDCT of the frame m−1 is used, as it is unwanted to continue with tones that exist only in the frame m−2 and not in the frame m−1.
In accordance with an embodiment, finding peaks in the power spectrum Pm−1 of the last frame m−1 preceding the replacement frame is done using the following steps (step S400 in
The thresholds for the peak search in the power spectrum Pm−2 of the second last frame m−2 are set as follows (step S402 in
Threshold(k)=(Psmoothedm−1(k)>Envelopem−1(k))?9.21 dB: 10.56 dB,
Threshold(k)=8.8 dB+10·log10(0.35)
Threshold(k−1)=8.8 dB+10·log10(0.35+2·frac)
Threshold(k+1)=8.8 dB+10·log10(0.35+2·(1−frac)),
if k∈[i−1,i+1] around a peak at index i in Pm−1 then the thresholds set in the first step are overwritten,
Threshold(k)=20.8 dB
Tonal peaks are found in the power spectrum Pm−2 of the second last frame m−2 by the following steps (step S404 in
Using the above described method, reveals that the right peak 506 in
Sinusoidal Parameter Extraction
For a sinusoidal signal
a shift for N/2 (the MDCT hop size) results in the signal
Thus, there is the phase shift Δφ=π·(l+Δl) where l is the index of a peak. Hence the phase shift depends on the fractional part of the input frequency plus an additional adding of π for odd spectral coefficients.
The fractional part of the frequency Δl can be derived using a method described, e.g., in A. Ferreira, “Accurate estimation in the ODFT domain of the frequency, phase and magnitude of stationary sinusoids,” 2001 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 47-50, 2001:
MDCT Prediction
For all spectrum peaks found and their surroundings, the MDCT prediction is used. For all other spectrum coefficients sign scrambling or a similar noise generating method may be used.
All spectrum coefficients belonging to the found peaks and their surroundings belong to the set that is denoted as K. For example, in
In accordance with embodiments, the surrounding of the peak is defined by a predefined number of coefficients around the peak 502. The surrounding of the peak may comprise a first number of coefficients on the left from the peak 502 and a second number of coefficients on the right from the peak 502. The first number of coefficients on the left from the peak 502 and the second number of coefficients on the right from the peak 502 may be equal or different.
In accordance with embodiments applying the EVS standard the predefined number of neighboring coefficients may be set or fixed in a first step, e.g. prior to detecting the tonal component. In the EVS standard three coefficients on the left from the peak 502, three coefficients on the right and the peak 502 may be used, i.e., all together seven coefficients (this number was chosen for complexity reasons, however, any other number will work as well).
In accordance with embodiments, the size of the surrounding of the peak is adaptive. The surroundings of the peaks identified as representing a tonal component may be modified such that the surroundings around two peaks don't overlap. In accordance with embodiments, a peak is usually considered only with its surrounding and they together define a tonal component.
For the prediction of the MDCT coefficients in a lost frame, the power spectrum (the magnitude of the complex spectrum) in the second last frame is used:
Q
m−2(k)=√{square root over (Pm−2(k))}=√{square root over (|Sm−2(k)|2+|Cm−2(k)|2)}.
The lost MDCT coefficient in the replacement frame is estimated as:
C
m(k)=Qm−2(k)·cos(φm(k).
In the following a method for calculating the phase φm (k) in accordance with an embodiment will be described.
Phase Prediction
For every spectrum peak found, the fractional frequency Δl is calculated as described above and the phase shift is:
Δφ=π·(l+Δl).
Δφ is the phase shift between the frames. It is equal for the coefficients in a peak and its surrounding.
The phase for each spectrum coefficient at the peak position and the surroundings (k∈K) is calculated in the second last received frame using the expression:
The phase in the lost frame is predicted as:
φm(k)=φm−2(k)+2Δφ
In accordance with an embodiment, a refined phase shift may be used. Using the calculated phase φm−2(k) for each spectrum coefficient at the peak position and the surroundings allows for an estimation of the MDST in the frame m−1 which can be derived as:
S
m−1(k)=Qm−2(k)·sin(φm−2(k)+Δφ(k))
with:
Qm−2(k) power spectrum (magnitude of the complex spectrum) in frame m−2.
From this MDST estimation and from the received MDCT an estimation of the phase in the frame m−1 is derived:
The estimated phase is used to refine the phase shift:
Δφ(k)=φm−1(k)−φm−2(k)
with:
φm−1(k)—phase of the complex spectrum in frame m−1, and
φm−2(k)—phase of the complex spectrum in frame m−2.
The phase in the lost frame is predicted as:
φm(k)=φm−1(k)+Δφ(k).
The phase shift refinement in accordance with this embodiment improves the prediction of sinusoids in the presence of a background noise or if the frequency of the sinusoid is changing. For non-overlapping sinusoids with constant frequency and without background noise the phase shift is the same for all of the MDCT coefficients that surround the peak.
The concealment that is used may have different fade out speeds for the tonal part and for the noise part. If the fade-out speed for the tonal part of the signal is slower, after multiple frame losses, the tonal part becomes dominant. The fluctuations in the sinusoid, which are due to the different phase shifts of the sinusoid components, produce unpleasant artifacts.
To overcome this problem, in accordance with embodiments, starting from the third lost frame, the phase difference of the peak (with index k) is used for all spectral coefficients surrounding it (k−l is the index of the left foot and k+u is the index of the right foot):
Δφm+2(i)=Δφ(k),i∈[k−l,k+u].
In accordance with further embodiments, a transition is provided. The spectral coefficients in the second lost frame with a high attenuation use the phase difference of the peak, and coefficients with small attenuation use the corrected phase difference:
Magnitude Refinement
In accordance with other embodiments, instead of applying the above described phase shift refinement, another approach may be applied which uses a magnitude refinement:
where l is the index of a peak, the fractional frequency Δl is calculated as described above. The phase shift is:
Δφ=π·(l+Δl).
To avoid an increase in energy, the refined magnitude, in accordance with further embodiments, may be limited by the magnitude from the second last frame:
Q
m−1(k)=max(Qm−1(k),Qm−2(k)).
Further, in accordance with yet further embodiments, the decrease in magnitude may be used for fading it:
Phase Prediction Using the “Frame In-Between”
Instead of basing the prediction of the spectral coefficients on the frames preceding the replacement frame, in accordance with other embodiments, the phase prediction may use a “frame in-between” (also referred to as “intermediate” frame).
If the MDCT window overlap is less than 50% it is possible to get the CMDCT spectrum closer to the lost frame. In
In this embodiment, the calculation of both the MDST coefficients Sm−1.5 and the MDCT coefficients Cm−1.5 is done directly from the decoded time domain signal, with the MDST and MDCT constituting the CMDCT. Alternatively the CMDCT can be derived using matrix operations from the neighboring existing MDCT coefficients.
The power spectrum calculation is done as described above, and the detection of tonal components is done as described above with the m−2nd frame being replaced by the m−1.5th frame.
For a sinusoidal signal
a shift for N/4 (MDCT hop size) results in the signal
This results in the phase shift
Hence the phase shift depends on the fractional part of the input frequency plus additional adding of
where l is the index of a peak. The detection of the fractional frequency is done as described above.
For the prediction of the MDCT coefficients in a lost frame, the magnitude from the m−1.5 frame is used:
Q
m−1.5(k)=√{square root over (Pm−1.5(k))}=√{square root over (|Sm−1.5(k)|2+|Cm−1.5(k)|2)}.
The lost MDCT coefficient is estimated as:
C
m(k)=Qm−1.5(k)·cos(φm(k).
The phase φm(k) can be calculated using:
Further, in accordance with embodiments, the phase shift refinement described above may be applied:
Further the convergence of the phase shift for all spectral coefficients surrounding a peak to the phase shift of the peak can be used as described above.
Although some aspects of the described concept have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blue-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
EP13173161.4 | Jun 2013 | EP | regional |
EP14167072.9 | May 2014 | EP | regional |
This application is a continuation of co-pending U.S. patent application Ser. No. 15/844,004 filed Dec. 15, 2017 which is a continuation of U.S. Pat. No. 9,916,834 granted Mar. 13, 2018, which is a continuation of International Application No. PCT/EP2014/063058, filed Jun. 20, 2014, all of which are incorporated herein by reference in their entirety, and additionally claims priority from European Applications Nos. EP13173161.4, filed Jun. 21, 2013, and EP 14167072.9, filed May 5, 2014, both of which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | 15844004 | Dec 2017 | US |
Child | 16584645 | US | |
Parent | 14977207 | Dec 2015 | US |
Child | 15844004 | US | |
Parent | PCT/EP2014/063058 | Jun 2014 | US |
Child | 14977207 | US |