The present invention relates to coding audio signals.
Referring now to
In the sinusoidal analyser 130, the 1st residual signal x2 for each segment is modelled using a number of sinusoids represented by amplitude, frequency and phase parameters. Once the sinusoids for a segment are estimated, a tracking algorithm is initiated. This algorithm links sinusoids with each other on a segment-to-segment basis to obtain so-called tracks. The tracking algorithm thus results in sinusoidal codes CS comprising sinusoidal tracks that start at a specific time instance, evolve for a certain amount of time over a plurality of time segments and then stop.
A number of coding methods can be employed in the noise coder to model the 2nd residual signal x3. For trasparent audio quality, the noise coder can be a wave form coder in the form of a filter bank. Alternatively, for good quality and low bit-rate, the noise coder can employ a synthetic noise model to produce, for example, Autoregressive Moving Average (ARMA) or Linear Predictive Coding (LPC) filter parameters.
It is also possible to derive other components of the input audio signal such as harmonic complexes. The present specification relates only to sinusoidal and noise components, but the extension to harmonic complexes does not affect the invention in any way.
The extraction of sinusoids from a segment of an audio signal can be problematic. Within segments, sinusoidal amplitudes and frequencies can vary and this is referred to as instationarity. Furthermore, inaccuracies can occur in the estimation of the sinusoids. As a result, the spectral suppression achieved using the coded sinusoids is not always satisfactory or ideal. This results in the presence of sinusoidal-like components especially at or near the positions of the coded sinusoids in the 2nd residual signal.
In addition, at low bit rates, where there are only enough bits to code a few sinusoids, sinusoidal components will still be present in the 2nd residual.
Noise coders in general model the temporal and spectral envelope of the residual signal x3 rather coarsely, i.e. they have a limited spectral resolution and artefacts can appear when a noise coder models sinusoidal components. Even if tonal components remaining in the residual are masked, audible artefacts can occur, due to the limited spectral resolution of the noise model. This is especially likely to occur at low frequencies where the auditory system has a good spectral resolution and spectral resolution of the noise coder is usually worse. Also, in contrast to a stationary, tonal signal, the energy of the noisy component will always fluctuate over time. These fluctuations may make a previously masked tonal component audible. Energy fluctuations will be biggest in regions where spectral resolution should be good, i.e. at low frequencies. Thus, apart from the fact that in trying to model the sinusoidal-like components in the residual signal x3, the noise coder requires additional bits for the noise codes CN, modelling these components as noise may result in audible artefacts, particularly at low frequencies.
The present invention attempts to mitigate this problem.
According to the present invention there is provided a method according to claim 1.
The invention includes a re-analysis stage prior to the noise coder. In one embodiment, tonal components are removed from the residual by, for example, matching pursuit in combination with an energy-based stopping criterion which determines when to stop extracting tonal components.
In another embodiment, the residual signal is additionally suppressed at the frequencies of the coded sinusoids and their surroundings. The number of surrounding frequencies can be fixed or dependent on the frequency. A psycho-acoustical frequency division (e.g. Bark/Erb bands) can also be used. The amount of suppression can for example depend on the number of sinusoids, or the energy of the sinusoids. As a result, the noise coder does not need to model these sinusoidal regions any more.
Preferred embodiments of the invention will now be described with reference to the accompanying drawings wherein like components have been accorded like reference numerals and, unless otherwise stated perform a like function. In a preferred embodiment of the present invention,
In both the prior art and the preferred embodiment, the audio coder 1′ samples an input audio signal at a certain sampling frequency resulting in a digital representation x(t) of the audio signal. The coder 1′ then separates the sampled input signal into three components: transient signal components, sustained deterministic components, and sustained stochastic components. The audio coder 1′ comprises a transient coder 11, a sinusoidal coder 13 and a noise coder 14.
The transient coder 11 comprises a transient detector (TD) 110, a transient analyzer (TA) 111 and a transient synthesizer (TS) 112. First, the signal x(t) enters the transient detector 110. This detector 110 estimates if there is a transient signal component and its position. This information is fed to the transient analyzer 111. If the position of a transient signal component is determined, the transient analyzer 111 tries to extract (the main part of) the transient signal component. It matches a shape function to a signal segment preferably starting at an estimated start position, and determines content underneath the shape function, by employing for example a (small) number of sinusoidal components. This information is contained in the transient code CT and more detailed information on generating the transient code CT is provided in PCT Patent Application No. WO 01/69593.
The transient code CT is furnished to the transient synthesizer 112. The synthesized transient signal component is subtracted from the input signal x(t) in subtractor 16, resulting in a signal x2.
The signal x2 is furnished to the sinusoidal coder 13 where it is analyzed in a sinusoidal analyzer (SA) 130, which determines the (deterministic) sinusoidal components. It will therefore be seen that while the presence of the transient analyser is desirable, it is not necessary and the invention can be implemented without such an analyser. Alternatively, as mentioned above, the invention can be implemented with for example an harmonic complex analyser. In any case, the end result of sinusoidal coding is a sinusoidal code CS and a more detailed example illustrating the conventional generation of an exemplary sinusoidal code CS is provided in PCT Patent Application No. WO 00/79519.
In brief, however, such a sinusoidal coder encodes the input signal x2 as tracks of sinusoidal components linked from one frame segment to the next. From the sinusoidal code CS generated with the sinusoidal coder, the sinusoidal signal component is reconstructed by a sinusoidal synthesizer (SS) 131. This signal is subtracted in subtractor 17 from the input x2 to the sinusoidal coder 13, resulting in a remaining signal x3.
According to the present invention, there is provided a re-analyser 18, which conditions the residual signal x3 prior to encoding by a noise coder 14. In each of the embodiments of the invention, the re-analyser 18 selectively removes or suppresses spectral regions at or near the positions of tonal components from the residual signal x3 and provides a conditioned residual signal x3′ to the noise coder 14.
Referring now to
In a first embodiment, in the re-analyser 18, conditioning of the spectrum generated by the FFT, step 46, comprises applying a conventional type matching pursuit algorithm to iteratively remove peaks from the spectrum. In the first embodiment, the algorithm iteratively removes those peaks that result in the greatest reduction of energy. In general this will mean that the matching pursuit algorithm first extracts peaks corresponding to tonal components and then tends to extract noisy peaks, because the reduction in energy is, on average, bigger for the extraction of tonal peaks than for the extraction of noisy ones. Thus, the extraction should stop just after the extraction of all tonal components and just before the extraction of noisy ones. On the one hand, if not all tonal components are removed, when synthesised in a decoder, the signal may be too noisy, because tonal components will have been modelled by the noise coder 14. On the other hand, if too many and thus some noisy components are removed, the synthesised signal may sound metallic, because of resulting gaps in unsuitable regions of the spectrum of the residual signal x3′ provided to the noise coder 14.
In one implementation of the first embodiment, a stopping criterion indicates when to stop extracting components. This criterion is based on the energy of the residual before and after the extraction of a peak. Thus, when the reduction in energy after removal of a peak is less than a certain percentage, this indicates that all tonal peaks have been extracted and that the conditioned residual x3′ will be free of tonal components.
Since the reduction in energy depends on the length of the analysis window, the energy criterion is inversely proportional to the window length. For example, for a window length of 1024 sample points at 48 kHz (=21 ms), a useful value for the criterion is at a reduction in energy of 5%, whereas for a window length of 512 sample points at 48 kHz (=10.5 ms), it is 10%.
In another implementation of the first embodiment, a fixed number of peaks are extracted, i.e. matching pursuit runs through a fixed number of iterations.
As an alternative to the iterative matching pursuit approach of the first embodiment, in a second embodiment, the conditioning step 46 picks and removes a number (fixed or variable (for example all peaks in the spectrum)) of the highest energy peaks from the spectrum generated in step 44 in a single step. This technique has the advantage that it is faster (being performed in a single iteration) than matching pursuit, however, it may lose the benefit of picking up peaks masked by more powerful peaks that may be detected by matching pursuit.
In the cases above where a fixed number of peaks are removed either iteratively or in a single step, it has been found experimentally that the extraction of 5 peaks or less resulted in better, less noisy signals while the extraction of more than 5 peaks resulted in a less noisy but metallic sounding signal.
In all of the above implementations, the re-analyser 18 takes an inverse FFT of the residual spectrum when matching pursuit has completed to obtain a time domain signal, step 48. By applying overlap-add for successive conditioned time domain signals, step 50, the conditioned residual x3′ is created and this is fed through the noise module 14. It will be seen that the conditioned segments s1′, s2′. . . of the residual x3′ correspond to the segments s1, s2 . . . in the time domain and as such no loss of synchronisation occurs as a result of the re-analysis.
It will be seen that where the residual signal x3 is not an overlapping signal but rather is a continuous time signal, then the windowing step 42 will not be required. Similarly, if the noise coder 14 expects a continuous time signal rather than an overlapping signal, the overlap-add step 50 will not be required. Nonetheless, it will also been seen that the first embodiment can be implemented without requiring any changes to be made to the conventional sinusoidal coder 13 or the noise coder 14. Also, in both of the above implementations psycho-acoustic considerations do not have to be taken into account when conditioning the signal x3 to produce the signal x3′.
In third and fourth embodiments of invention, while no changes need to be made to the internal operation of the sinusoidal coder 13, the re-analyser 18 is provided with the sinusoidal codes Cs for each segment s1, s2 . . . as indicated by the dashed line 52 of
In the fourth embodiment of the invention, the re-analyser 18 is provided with the original signal for each segment s1, s2 . . . as indicated by the dashed line 56 of
However, by setting frequency bands to zero, noise parameters can be encoded very efficiently resulting in a considerable coding gain. Thus, if the conditioned frequency spectra generated at step 46 were fed directly to an adapted noise coder, the noise coder may be able to apply for example, run-length coding to take advantage of the gain of a number of consecutive frequency bands being zero. In existing state-of-the-art noise coders run-length coding is not applied, because without conditioning it only rarely occurs that parts of the residual spectrum are zero. However, by applying spectral blanking, run-length encoding will result in a considerable bit-rate reduction. Corresponding changes would of course need to be made to the decoder to take account of any changes in the coding of noise information.
In a fifth embodiment of the invention, rather than providing the sinusoidal codes Cs to the analyser 18, the sinusoidal coder 13 is adapted to provide to the re-analyser 18 the parameters for sinusoidal components which were detected by the sinusoidal analyser 130 but dropped during the coding process as indicated by the line 54 in
In the case of types M and B, it will be seen that these components are more likely to be tonal than in the case of type S. Therefore in the fifth embodiment, the conditioning step 46 comprises removing a number (fixed or variable) of the highest energy peaks corresponding to M and B type frequencies before providing the conditioned spectrum for processing as before in steps 48, 50.
While each of the above embodiments has been described independently, it will be seen that one or more of these techniques may be combined in the conditioning step 46. For example, the steps of the fifth embodiment may be performed to remove a limited number of M or B type components before the steps of the first embodiment are performed to remove other peaks.
It will also be seen that while each of the embodiments have been described in terms of conditioning the residual signal x3 in the frequency domain, the re-analyser 18 could equally operate in the time domain.
In any case, the conditioned signal x3′ produced by the re-analyser 18 can now more properly be assumed to comprise only noise and the noise analyzer 14 of the preferred embodiment produces a noise code CN representative of this noise, as described in, for example, PCT patent application No. PCT/EP00/04599.
Finally, in a multiplexer 15, an audio stream AS is constituted which includes the codes CT, CS and CN. The audio stream AS is furnished to e.g. a data bus, an antenna system, a storage medium etc.
The sinusoidal code CS is used to generate signal yS, described as a sum of sinusoids on a given segment. At the same time, as the sinusoidal components of the signal are being synthesized, the noise code CN is fed to a noise synthesizer NS 33, which is mainly a filter, having a frequency response approximating the spectrum of the noise. The NS 33 generates reconstructed noise yN by filtering a white noise signal with the noise code CN.
In the player of
The total signal y(t) comprises the sum of the transient signal yT and the product of any amplitude decompression (g) and the sum of the sinusoidal signal yS and the noise signal yN′. The audio player comprises two adders 36 and 37 to sum respective signals. The total signal is furnished to an output unit 35, which is e.g. a speaker.
Number | Date | Country | Kind |
---|---|---|---|
02079939.1 | Nov 2002 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB03/04869 | 10/29/2003 | WO | 5/24/2005 |