The present invention relates to the field of encoding digital audio data, utilizing lossy compression algorithms as for example advanced audio coding in order to achieve lower bit rates, while keeping high audio data quality.
Modern digital lifestyle has much to thank to the principle of perceptual digital audio compression, such as MPEG-4AAC (MPEG=Moving Pictures Expert Group, AAC=Advanced Audio Coding) or MP3 (MPEG layer 3). Typical state of the art audio compression systems utilize time-to-frequency transform functions, such as, for example, the modified discrete cosine transform (MDCT) sub-dividing the signal in frequency bands that are formed of pluralities of spectral coefficients and quantization of these grouped coefficients with appropriate quantization algorithms, followed by an advanced coding of those coefficients with some entropy coding methods as, for example, Huffman coding.
The modified discrete cosine transform is a Fourier-related transform with the additional property of being lapped, i.e. it is designed to be performed on consecutive blocks of a larger dataset, where subsequent blocks are overlapped so that the last half of one block coincides with the first half of the next block. This overlapping, in addition to the energy-compaction qualities of the discrete cosine transform, makes the modified discrete cosine transform especially attractive for signal compression applications, since it helps to avoid artifacts stemming from block boundaries. Thus, a modified discrete cosine transform is, for example, employed in MP3 and AAC.
Unfortunately, at very low bit rates, i.e. at high compression demands, coding systems have no options, but to shut down frequency bands, i.e. replace them with silence. This method is utilized in order to meet the coding demands imposed to the codec. This introduces holes in the spectrum that are especially annoying and they are the biggest contributor to audio coding artifacts.
At the same time the signal is evaluated by the perceptual model 815, the perceptual model evaluates the input signal by mathematically modeling the human auditory system and outputs a measure, such as for example the just noticeable distortion (JND) in units of a signal-to-mask ratio (SMR) of the input signal energy to the just noticeable distortion or noise energy.
The perceptual model block 815 and the remaining blocks in the state of the art encoder, as it is depicted in
The target compression demand is met by quantization of the frequency coefficients. Before quantization, the coefficients are scaled by so-called scaling factors to determine the eventual precision of the quantization process. The bit/noise allocation block 820 is responsible for estimation or calculation of the scaling factors, so the reconstruction of the quantized values yields quantization noise just below the masking threshold estimated by the perceptual model. Under certain circumstances, the perceptual model 815 indicates that certain frequency bands are noise-like and may be modeled by generating noise with a certain energy on the decoder side. For these frequency bands, there is no need to determine scaling factors or frequency coefficients, but parameters for a noise generator at the decoder side are inserted instead. Since the parameters for the noise generator take up less amount of data than scaling factors and frequency coefficients, data rates can be saved by replacing frequency bands with generated noise. The impact of the replacement on the quality of the decoded audio signal is kept in boundaries, determined by the perceptual model. For example, a frequency band, which is to be replaced, must not exceed a certain tonality threshold, nor does it contain any transient signal. The thresholds that determine noise substitution depend on the perceptual model. In ISO/IEC 14496, for example, perceptual noise substitution as a feature of AAC is described.
An advanced coding method used in some perceptual codecs is the so-called perceptual noise substitution (PNS) of which a good summary can be found in Herrer, Jürgen, Schultes, Donald, “Extending the MPEG-4AAC Codec by Perceptual Noise Substitution”, AES document 4720.
After the bit allocation block 820 in
In order to achieve the target coding requirements, for example, a given bit rate for the compressed signal, state of the art codecs are able to reduce the coding requirements by increasing the allowed amount of noise specified by the psycho-acoustic model or perceptual model. Referring to
If the coding requirement is not met and the bit demand is further reduced, additional noise is introduced to the signal. As allowed noise is increased, the scaling factors are increased as well and resolution of the quantized signal is decreased, which then also decreases the bit demand. The quantization resolution can be decreased up to the point when noise gets greater than the signal itself, possibly meaning the output of the quantization block for that scaling factor will be zero. This effectively inserts a hole in the spectrum in the place where the signal of the scaling factor should be present. This operation can be iteratively repeated as long as the transmission/storing demand of the coded quantized coefficient is below the constraints imposed to the encoder. This operation always terminates successfully, even if it sets all quantized outputs to zero, cf. the flowchart in
While, with the above-described state of the art method the coding requirements are effectively maintained and it functions quite well, provided that the constraints opposed to the codec are achievable without eliminating too much of scaling factors in the constraint's reduction phase, the method could fail miserably if the coding demands are set to be too high for the encoder.
This usually happens if the bit rate required is well below the requirements of the perceptual model. Non-optimized codecs would usually introduce high amounts of holes due to the shut-down of too much scaling factors in order to meet the coding constraints. Spectral holes or shut-downs are usually easily detectable by listeners and they have a huge impact on degradation of the sound quality. Signals containing spectral holes are usually classified as ringing, a swishy sound, birdies, etc.
Optimized state of the art codecs, as they can, for example, be found in 3GPP (3GPP=Third Generation Partnership Project), TS (TS=Technical Specification) 26.403, employ more advantageous strategies of coding constraints reduction, usually called hole avoidance. This strategy works by imposing maximum constraint reduction limits for each scaling factor. This ensures that no holes would be introduced in the scaling factors as long as it would be possible to reduce coding constraints for all scaling factors without violating this limit and maintaining the constraints imposed to the encoder. However, even with this advanced strategy, it is quite possible that the coding constraints will not be met and, in this case, the encoder will have no other option, but to start introducing spectral holes by eliminating scaling factors.
According to an embodiment, an apparatus for encoding digital audio data with a reduced bit rate may have a provider of psycho-acoustically quantized digital audio data with a bit rate being higher than the reduced bit rate and an identifier for identifying a frequency band according to a selection criterion, the selection criterion being such that an impact on the quality of the digital audio data when the data in the identified frequency band is replaced by the generated noise is smaller than the impact on the quality of the digital audio data, which would arise when data in a different frequency band is replaced by generated noise. The apparatus further may further have a replacer for replacing data in the identified frequency band of the digital audio data by a noise synthesis parameter, the noise synthesis parameter requiring a smaller amount of data than the data in the identified frequency band, the digital audio data having the reduced bit rate.
According to another embodiment, a method for encoding digital audio data with a reduced bit rate may have the steps of providing psycho-acoustically quantized digital audio data with a bit rate being higher than the reduced bit rate and identifying a frequency band according to a selection criterion and the selection criterion being such that an impact on a quality of the digital audio data when the data in the identified frequency band is replaced by the generated noise is smaller than the impact on the quality of the digital audio data, which would arise when a data in a different frequency band is replaced by generated noise. The method may further have the step of replacing data in the identified frequency band of the digital audio data by a noise synthesis parameter, the noise synthesis parameter requiring a smaller amount of data than the data in the identified frequency band, the digital audio data having the reduced bit rate.
According to another embodiment, a computer program may have program codes for performing the method mentioned above when the program code runs in a computer.
The present invention is based on the finding that since the human auditory system is not able to distinguish between different kinds of narrow band signals and noise signals as long as the average energy is the same or comparable. Under some circumstances, where high data compression is needed, the quality of digital audio data can be preserved more effectively if noise generators are used instead of shutting down frequency bands completely. This effectively means that it is sufficient to generate noise at the decoder stage without the need for transmitting a quantized spectral coefficient of the scale factor band, which is found to be noise-like. The only information that needs to be transmitted is the average energy value or a noise generator parameter as, for example, a noise synthesis parameter, of the scale factor band, which some codecs, such as MPEG-4AAC transmits instead of scaling factor values for such bands if the perceptual model indicates its suitability. However, if higher compression rates are required, these codecs shut down frequency bands where further introduction of generated noise yields a better quality of the digital audio data.
Embodiments of the present invention will be described using the Figs. attached in which:
An embodiment of an apparatus 100 for encoding digital audio data with reduced bit rate is depicted in
A further embodiment of the apparatus 100 for digital audio data is depicted in
In another embodiment of the present invention, the provider 110 would acquire already-encoded data, for example, an MP3 file or AAC encoded data and would then utilize a decoder in order to remove the entropy coding. Once the entropy coding is removed, psycho-acoustically quantized data that may already contain noise replaced frequency bands, is available to be passed on by the provider 110 to the identifier 120. It is then a task of the identifier 120 to identify the frequency bands, pass on the psycho-acoustically quantized data to the replacer 130, where the according frequency bands are replaced.
In another embodiment, the apparatus 100 is required to reduce the bit rate of digital audio data to a certain target bit rate. An embodiment for this inventive apparatus 100 is depicted in
A flowchart of the iteration carried out to achieve the target bit rate is depicted in
At the identifier 120, post analyzers can be operative in one embodiment in order to analyze the data according to a selection criterion. The post analyzer operates similar to the pre-analyzer mentioned as being in one embodiment of the inventive provider 110. Again, analysis-by-synthesis can be carried out by the post analyzer.
The criterion for noise substitution during the encoding process, as indicated in
Measures that can be taken at the pre-analyzer as well as the post-analyzer being used as pre-selection criterion or selection criterion are, for example, a lowest tonality, a lowest or highest signal-to-noise ratio, a lowest or highest signal-to-mask ratio, i.e. taking into account the human auditory system properties, a lowest energy in a frequency band, a highest center frequency of a frequency band or a best stability in the time domain, i.e. lowest variability in a time period.
In another embodiment, the replacer 130 is adapted to replace frequency bands, which are consecutive frequency bands together with a single noise synthesis parameter, i.e. by replacing several frequency band data carrying out a higher bit rate reduction of the digital audio data.
While, in the state of the art, codec perceptual noise substitution is used to replace scaling factors judged to be noise-like before the actual quantization and coding step, noise substitution is used in embodiments of the present invention to reduce the bit rate. There are more useful cases for perceptual noise substitution than just merely replacing scale factor bands found to be noise-like in the perceptual model, as it is currently achieved by the state of the art. In embodiments of the present invention, perceptual noise substitution is employed as part of a constraints reduction apparatus or bit rate reduction apparatus in the more advanced constraints reduction method.
A full flow chart of the state of the art encoding process extended by an embodiment of the inventive method is shown in
This state of the art procedure is extended by an embodiment of an inventive method within the box 755 in
As can be seen from
This selection can be done by various means, such as one of, or a multiple of, a scale factor band with the lowest tonality, a scale factor band with the lowest or highest signal-to-noise ratio, a scale factor band with the lowest or highest signal-to-mask ratio, a scale factor band with the lowest energy, a scale factor band with the highest center frequency, a scale factor band with the best stability in the time domain or any grouping of frequency coefficients fulfilling one or more of the just mentioned metrics.
It is noted that these means are just explanatory and other means known to a person skilled in the art, as they are within the scope and spirit of this invention.
After the selection has been carried out, selected scale factor bands or other grouping of frequency coefficients are coded, for example, with the perceptual noise substitution tool, meaning that the embodiments of the present invention remove the spectral content from the digital audio data and instead of the scaling factors for the band, for example, its approximate average energy is transmitted along with an appropriate flag telling the decoder to reconstruct said band with artificially-generated noise of approximately the same energy as transmitted in the bit stream.
In another embodiment of the present invention following the perceptual noise substitution coding, the bit demand of the replaced spectral coefficients can now be removed from the quantized spectrum bit demands and the total bit demands can be compared to the encoder constraints. If the constraints are still not met, the procedure continues until constraints are either met or all bands are coded with the perceptional noise substitution. Therefore, it is necessary to set a minimum constraint such that the perceptual noise substitution energy factors could be transmitted for all the bands. If it is desirable to reach such limits, it is possible to employ the removal of the perceptual noise substitution scale factors to reach even very high coding constraints. This could be achieved by iteratively removing most suitable perceptual noise substitution factors, where methods for evaluating such factors are known to a person skilled in the art, for example, like the selection of the lowest energy scale factor or the highest frequency scale factor, etc. The bit demand is then re-evaluated and the process is repeated until it satisfies the constraints or, respectively, all factors are set to zero.
Embodiments of the present invention provide the advantage that the introduction of spectral holes is effectively prevented, as artifacts connected to the spectral band shut downs or spectral holes, in a modern perceptual audio codec are circumventive, yielding a better quality of digital audio data with respect to the human auditory system.
One embodiment of the present invention is an audio coding apparatus based on frequency-based perceptual audio coding with a perceptual model, time-to-frequency mapping and quantization and an entropy coding block. Furthermore, coding can be based on the grouping of a plurality of frequency domain spectral coefficients to scale factor bands and quantizing them with irrelevancy reduction. In another embodiment, the plurality of frequency domain spectral coefficients can be treated in a manner proportional with the critical bands of the human auditory system and quantizing them with irrelevancy reduction. Another embodiment of the present invention comprises the transmission of said coefficients in a coded bit stream.
Moreover, an embodiment could make use of substitution of the scale factor band with the artificially-generated narrow band noise in the decoder without the need to transmit the spectral contents of a said scale factor band, where the coding constraint's evaluation methods can be based on just noticeable distortion measures calculated by a perceptual model and the values of the spectral coefficients. Embodiments of the present invention reduce the coding requirements in order to meet the coding constraints by substitution of the scaling factor bands with one of the methods described above. For example, a suitable scale factor band can be selected for reduction of coding requirements by determining the scale factor band with the most similarity to white noise, the scale factor band with the highest center frequency, the scale factor band with the lowest energy, the scale factor band with the highest signal-to-noise ratio, the scale factor band with the lowest signal-to-noise ratio, the scale factor band with the highest signal to just noticeable distortion energy ratio or the scale factor band with the lowest signal to just noticeable distortion energy ratio.
Depending on certain implementation requirements of the inventive methods, the inventive methods can be implemented in hardware or software. The implementation can be performed using a digital storage medium, in particular a disc, DVD or a CD having an electronically-readable control signal stored thereon, which operates with a programmable computer system, such that the inventive methods are performed. Generally, the present invention is, therefore, a computer program product with a program code stored on a machine-readable carrier, the program code being operative for performing the inventive methods when the computer program product runs on a computer. In other words, the inventive methods are, therefore, a computer program having a program code for performing at least one of the inventive methods when the computer program runs on a computer.
While this invention has been described in terms of several preferred embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
This application is a continuation of copending International Application No. PCT/EP2006/009601, filed Oct. 4, 2006, which designated the United States, and is incorporated herein by reference in its entirety. In addition, this application claims priority from U.S. Provisional Application No. 60/745,499, filed Apr. 24, 2006, and is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60745499 | Apr 2006 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP06/09601 | Oct 2006 | US |
Child | 11739562 | Apr 2007 | US |