Low-bit-rate digital speech coders based on the code-excited linear prediction (CELP) coding principle generally suffer from signal sparseness artifacts when the bit-rate falls below about 0.5 to 1 bit per sample, leading to a somewhat artificial, metallic sound. Especially when the input speech has environmental noise in the background, the low-rate artifacts are clearly audible: the background noise will be attenuated during active speech sections. The present invention describes a noise insertion scheme for (A)CELP coders such as AMR-WB [1] and G.718 [4, 7] which, analogous to the noise filling techniques used in transform based coders such as xHE-AAC [5, 6], adds the output of a random noise generator to the decoded speech signal to reconstruct the background noise.
The International publication WO 2012/110476 A1 shows an encoding concept which is linear prediction based and uses spectral domain noise shaping. A spectral decomposition of an audio input signal into a spectrogram comprising a sequence of spectra is used for both linear prediction coefficient computation as well as the input for frequency-domain shaping based on the linear prediction coefficients. According to the cited document an audio encoder comprises a linear prediction analyzer for analyzing an input audio signal so as to derive linear prediction coefficients therefrom. A frequency-domain shaper of an audio encoder is configured to spectrally shape a current spectrum of the sequence of spectra of the spectrogram based on the linear prediction coefficients provided by linear prediction analyzer. A quantized and spectrally shaped spectrum is inserted into a data stream along with information on the linear prediction coefficients used in spectral shaping so that, at the decoding side, the de-shaping and de-quantization may be performed. A temporal noise shaping module can also be present to perform a temporal noise shaping.
In view of conventional technology there remains a demand for an improved audio decoder, an improved method, an improved computer program for performing such a method and an improved audio signal or a storage medium having stored such an audio signal, the audio signal having been treated with such a method. More specifically, it is desirable to find solutions improving the sound quality of the audio information transferred in the encoded bitstream.
The reference signs in the claims and in the detailed description of embodiments of the invention were added to merely improve readability and are in no way meant to be limiting.
According to an embodiment, an audio decoder for providing a decoded audio information on the basis of an encoded audio information including linear prediction coefficients (LPC) may have:
According to another embodiment, an audio decoder for providing a decoded audio information on the basis of an encoded audio information including linear prediction coefficients (LPC), may have:
wherein the audio decoder is adapted to decode an excitation signal of the current frame and to compute its root mean square erms;
wherein the audio decoder is adapted to compute a peak level p of a transfer function of an LPC filter of the current frame;
wherein the audio decoder is adapted to compute a spectral minimum mf of the current audio frame by computing the quotient of the root mean square erms and the peak level p to acquire the noise level information;
wherein the noise level estimator is adapted to estimate the noise level on the basis of two or more quotients of different audio frames;
wherein the audio decoder includes a decoder core configured to decode an audio information of the current frame using linear prediction coefficients of the current frame to acquire a decoded core coder output signal and wherein the noise inserter adds the noise depending on linear prediction coefficients used in decoding the audio information of the current frame and used in decoding the audio information of one or more previous frames.
According to another embodiment, a method for providing a decoded audio information on the basis of an encoded audio information including linear prediction coefficients (LPC) may have the steps of:
Another embodiment may have a computer program for performing a method according to claim 16, wherein the computer program runs on a computer.
According to another embodiment, a method for providing a decoded audio information on the basis of an encoded audio information including linear prediction coefficients (LPC) may have the steps of:
Another embodiment may have a computer program for performing a method according to claim 18, wherein the computer program runs on a computer.
The suggested solutions avoid having to provide a side information in the CELP bitstream in order to adjust noise provided on the decoder side during a noise filling process. This means that the amount of data to be transported with the bitstream may be reduced while the quality of the inserted noise can be increased merely on the basis of linear prediction coefficients of the currently or previously decoded frames. In other words, side information concerning the noise which would increase the amount of data to be transferred with the bitstream may be omitted. The invention allows to provide a low-bit-rate digital coder and a method which may consume less bandwidth concerning the bitstream and provide an improved quality of the background noise in comparison to conventional-technology solutions.
It is advantageous that the audio decoder comprises a frame type determinator for determining a frame type of the current frame, the frame type determinator being configured to activate the tilt adjuster to adjust the tilt of the noise when the frame type of the current frame is detected to be of a speech type. In some embodiments, the frame type determinator is configured to recognize a frame as being a speech type frame when the frame is ACELP or CELP coded. Shaping the noise according to the tilt of the current frame may provide a more natural background noise and may reduce unwanted effects of audio compression with regard to the background noise of the wanted signal encoded in the bitstream. As those unwanted compression effects and artifacts often become noticeable with respect to background noise of speech information, it can be advantageous to enhance the quality of the noise to be added to such speech type frames by adjusting the tilt of the noise before adding the noise to the current frame. Accordingly, the noise inserter may be configured to add the noise to the current frame only if the current frame is a speech frame, since it may reduce the workload on the decoder side if only speech frames are treated by noise filling. In an advantageous embodiment of the invention, the tilt adjuster is configured to use a result of a first-order analysis of the linear prediction coefficients of the current frame to obtain the tilt information. By using such a first-order analysis of the linear prediction coefficients it becomes possible to omit side information for characterizing the noise in the bitstream. Moreover, the adjustment of the noise to be added can be based on the linear prediction coefficients of the current frame which have to be transferred with the bitstream anyway to allow a decoding of the audio information of the current frame. This means that the linear prediction coefficients of the current frame are advantageously re-used in the process of adjusting the tilt of the noise. Furthermore, a first-order analysis is reasonably simple so that the computational complexity of the audio decoder does not increase significantly.
In some embodiments of the invention, the tilt adjuster is configured to obtain the tilt information using a calculation of a gain g of the linear prediction coefficients of the current frame as the first order analysis. More advantageously, the gain g is given by the formula g=E[ak·ak+1]/Σ[ak·ak], wherein ak are LPC coefficients of the current frame. In some embodiments, two or more LPC coefficients ak are used in the calculation. Advantageously, a total of 16 LPC coefficients are used, so that k=0 . . . 15. In embodiments of the invention, the bitstream may be coded with more or less than 16 LPC coefficients. As the linear prediction coefficients of the current frame are readily present in the bitstream, the tilt information can be obtained without making use of side information, thus reducing the amount of data to be transferred in the bitstream. The noise to be added may be adjusted merely by using linear prediction coefficients which may be used for decoding the encoded audio information.
Advantageously, the tilt adjuster is configured to obtain the tilt information using a calculation of a transfer function of the direct form filter x(n)−g·x(n−1) for the current frame. This type of calculation is reasonably easy and does not need a high computing power on the decoder side. The gain g may be calculated easily from the LPC coefficients of the current frame, as shown above. This allows to improve noise quality for low-bit-rate digital coders while using purely bitstream data essential for decoding the encoded audio information.
In an advantageous embodiment of the invention, the noise inserter is configured to apply the tilt information of the current frame to the noise in order to adjust the tilt of the noise before adding the noise to the current frame. If the noise inserter is configured accordingly, a simplified audio decoder may be provided. By first applying the tilt information and then adding the adjusted noise to the current frame, a simple and effective method of an audio decoder may be provided.
In an embodiment of the invention, the audio decoder furthermore comprises a noise level estimator configured to estimate a noise level for a current frame using a linear prediction coefficient of at least one previous frame to obtain a noise level information, and a noise inserter configured to add a noise to the current frame in dependence on the noise level information provided by the noise level estimator. By this, the quality of the background noise and thus the quality of the whole audio transmission may be enhanced as the noise to be added to the current frame can be adjusted according to the noise level which is probably present in the current frame. For example, if a high noise level is expected in the current frame because a high noise level was estimated from previous frames, the noise inserter may be configured to increase the level of the noise to be added to the current frame before adding it to the current frame. Thus, the noise to be added can be adjusted to be neither too silent nor too loud in comparison with the expected noise level in the current frame. This adjustment, again, is not based on dedicated side information in the bistream but merely uses information of useful data transferred in the bitstream, in this case a linear prediction coefficient of at least one previous frame which also provides information about a noise level in a previous frame. Thus, it is advantageous that the noise to be added to the current frame is shaped using the g derived tilt and scaled in view of a noise level estimate. Most advantageously, the tilt and the noise level of the noise to be added to the current frame are adjusted when the current frame is of a speech type. In some embodiments, the tilt and/or the noise level to be added to the current frame are adjusted also when the current frame is of a general audio type, for example a TCX or a DTX type.
Advantageously, the audio decoder comprises a frame type determinator for determining a frame type of the current frame, the frame type determinator being configured to identify whether the frame type of the current frame is speech or general audio, so that the noise level estimation can be performed depending on the frame type of the current frame. For example, the frame type determinator can be configured to detect whether the current frame is a CELP or ACELP frame, which is a type of speech frame, or a TCX/MDCT or DTX frame, which are types of general audio frames. Since those coding formats follow different principles, it is desirable to determine the frame type before performing the noise level estimation so that suitable calculations can be chosen, depending on the frame type.
In some embodiments of the invention the audio decoder is adapted to compute a first information representing a spectrally unshaped excitation of the current frame and to compute a second information regarding spectral scaling of the current frame to compute a quotient of the first information and the second information to obtain the noise level information. By this, the noise level information may be obtained without making use of any side information. Thus, the bit rate of the coder may be kept low.
Advantageously, the audio decoder is adapted to decode an excitation signal of the current frame and to compute its root mean square erms from the time domain representation of the current frame as the first information to obtain the noise level information under the condition that the current frame is of a speech type. It is advantageous for this embodiment that the audio decoder is adapted to perform accordingly if the current frame is of a CELP or ACELP type. The spectrally flattened excitation signal (in perceptual domain) is decoded from the bitstream and used to update a noise level estimate. The root mean square erms of the excitation signal for the current frame is computed after the bitstream is read. This type of computation may need no high computing power and thus may even be performed by audio decoders with low computing powers.
In an advantageous embodiment the audio decoder is adapted to compute a peak level p of a transfer function of an LPC filter of the current frame as a second information, thus using a linear prediction coefficient to obtain the noise level information under the condition that the current frame is of a speech type. Again, it is advantageous that the current frame is of the CELP or ACELP type. Computing the peak level p is rather inexpensive, and by re-using linear prediction coefficients of the current frame, which are also used to decode the audio information contained in that frame, side information may be omitted and still background noise may be enhanced without increasing the data rate of the bitstream.
In an advantageous embodiment of the invention, the audio decoder is adapted to compute a spectral minimum mf of the current audio frame by computing the quotient of the root mean square erms and the peak level p to obtain the noise level information under the condition that the current frame is of the speech type. This computation is rather simple and may provide a numerical value that can be useful in estimating the noise level over a range of multiple audio frames. Thus, the spectral minimum mf of a series of current audio frames may be used to estimate the noise level during the time period covered by that series of audio frames. This may allow to obtain a good estimation of a noise level of a current frame while keeping the complexity reasonably low. The peak level p is advantageously calculated using the formula p=Σ|ak|, wherein ak are linear prediction coefficients with k=0 . . . 15, advantageously. Thus, if the frame comprises 16 linear prediction coefficients, p is in some embodiments calculated by summing up over the amplitudes of the advantageously 16 ak.
Advantageously the audio decoder is adapted to decode an unshaped MDCT-excitation of the current frame and to compute its root means square erms from the spectral domain representation of the current frame to obtain the noise level information as the first information if the current frame is of a general audio type. This is the advantageous embodiment of the invention whenever the current frame is not a speech frame but a general audio frame. A spectral domain representation in MDCT or DTX frames is largely equivalent to the time domain representation in speech frames, for example CELP or (A)CELP frames. A difference lies in that MDCT does not take into account Parseval's theorem. Thus, advantageously the root means square erms for a general audio frame is computed in a similar manner as the root means square erms for speech frames. It is then advantageous to calculate the LPC coefficients equivalents of the general audio frame as laid out in WO 2012/110476 A1, for example using an MDCT power spectrum which refers to the square of MDCT values on a bark scale. In an alternative embodiment, the frequency bands of the MDCT power spectrum can have a constant width so that the scale of the spectrum corresponds to a linear scale. With such a linear scale the calculated LPC coefficient equivalents are similar to an LPC coefficient in the time domain representation of the same frame, as, for example, calculated for an ACELP or CELP frame. Furthermore, it is advantageous that, if the current frame is of a general audio type, the peak level p of the transfer function of an LPC filter of the current frame being calculated from the MDCT frame as laid out in the WO 2012/110476 A1 is computed as a second information, thus using a linear prediction coefficient to obtain the noise level information under the condition that the current frame is of a general audio type. Then, if the current frame is of a general audio type, it is advantageous to compute the spectral minimum of the current audio frame by computing the quotient of the root means square erms and the peak level p to obtain the noise level information under the condition that the current frame is of a general audio type. Thus, a quotient describing the spectral minimum mf of a current audio frame can be obtained regardless if the current frame is of a speech type or of a general audio type.
In an advantageous embodiment, the audio decoder is adapted to enqueue the quotient obtained from the current audio frame in the noise level estimator regardless of the frame type, the noise level estimator comprising a noise level storage for two or more quotients obtained from different audio frames. This can be advantageous if the audio decoder is adapted to switch between decoding of speech frames and decoding of general audio frames, for example when applying a low-delay unified speech and audio decoding (LD-USAC, EVS). By this, an average noise level over multiple frames may be obtained, disregarding the frame type. Advantageously a noise level storage can hold ten or more quotients obtained from ten or more previous audio frames. For example, the noise level storage may contain room for the quotients of 30 frames. Thus, the noise level may be calculated for an extended time preceding the current frame. In some embodiments, the quotient may only be enqueued in the noise level estimator when the current frame is detected to be of a speech type. In other embodiments, the quotient may only be enqueued in the noise level estimator when the current frame is detected to be of a general audio type.
It is advantageous that the noise level estimator is adapted to estimate the noise level on the basis of statistical analysis of two or more quotients of different audio frames. In an embodiment of the invention, the audio decoder is adapted to use a minimum mean squared error based noise power spectral density tracking to statistically analyse the quotients. This tracking is described in the publication of Hendriks, Heusdens and Jensen [2]. If the method according to [2] shall be applied, the audio decoder is adapted to use a square root of a track value in the statistical analysis, as in the present case the amplitude spectrum is searched directly. In another embodiment of the invention, minimum statistics as known from [3] are used to analyze the two or more quotients of different audio frames.
In an advantageous embodiment, the audio decoder comprises a decoder core configured to decode an audio information of the current frame using a linear prediction coefficient of the current frame to obtain a decoded core coder output signal and the noise inserter adds the noise depending on a linear prediction coefficient used in decoding the audio information of the current frame and/or used when decoding the audio information of one or more previous frames. Thus, the noise inserter makes use of the same linear prediction coefficients that are used for decoding the audio information of the current frame. Side information in order to instruct the noise inserter may be omitted.
Advantageously, the audio decoder comprises a de-emphasis filter to de-emphasize the current frame, the audio decoder being adapted to apply the de-emphasis filter on the current frame after the noise inserter added the noise to the current frame. Since the de-emphasis is a first order IIR boosting low frequencies, this allows for low-complexity, steep IIR high-pass filtering of the added noise avoiding audible noise artifacts at low frequencies.
Advantageously, the audio decoder comprises a noise generator, the noise generator being adapted to generate the noise to be added to the current frame by the noise inserter. Having a noise generator included to the audio decoder can provide a more convenient audio decoder as no external noise generator is necessary. In the alternative, the noise may be supplied by an external noise generator, which may be connected to the audio decoder via an interface. For example, special types of noise generators may be applied, depending on the background noise which is to be enhanced in the current frame.
Advantageously, the noise generator is configured to generate a random white noise. Such a noise resembles common background noises adequately and such a noise generator may be provided easily.
In an advantageous embodiment of the invention, the noise inserter is configured to add the noise to the current frame under the condition that the bit rate of the encoded audio information is smaller than 1 bit per sample. Advantageously the bit rate of the encoded audio information is smaller than 0.8 bit per sample. It is even more advantageous that the noise inserter is configured to add the noise to the current frame under the condition that the bit rate of the encoded audio information is smaller than 0.5 bit per sample.
In an advantageous embodiment, the audio decoder is configured to use a coder based on one or more of the coders AMR-WB, G.718 or LD-USAC (EVS) in order to decode the coded audio information. Those are well-known and wide spread (A)CELP coders in which the additional use of such a noise filling method may be highly advantageous.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
The invention is described in detail with regards to the
In addition, the audio decoder according to
In other words, according to
MDCT or DTX frame. Regardless of the frame type, the spectrally flattened excitation signal (in perceptual domain) is decoded and used to update the noise level estimate as described below in detail. Then the signal is fully reconstructed up to the de-emphasis, which is the last step.
2. If the frame is ACELP-coded, the tilt (overall spectral shape) for the noise insertion is computed by first-order LPC analysis of the LPC filter coefficients. The tilt is derived from the gain g of the 16 LPC coefficients ak, which is given by g=Σ[ak·ak+1]/Σ[ak·ak].
The noise level estimation in step 1 is performed by computing the root mean square erms of the excitation signal for the current frame (or in case of an MDCT-domain excitation the time domain equivalent, meaning the erms which would be computed for that frame if it were an ACELP frame) and by then dividing it by the peak level p of the transfer function of the LPC analysis filter. This yields the level mf of the spectral minimum of frame f as in
Although some aspects have been described in the context of an audio decoder, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding audio decoder. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
This application is a continuation U.S. patent application Ser. No. 16/286,445, filed Feb. 26, 2019, which is a continuation U.S. patent application Ser. No. 14/811,778, filed Jul. 28, 2015, now U.S. Pat. No. 10,269,365, issued Apr. 23, 2019, which is a continuation of International Application No. PCT/EP2014/051649, filed Jan. 28, 2014, which claims priority from U.S. Provisional Application No. 61/758,189, filed Jan. 29, 2013, which are each incorporated herein in its entirety by this reference thereto. Embodiments of the invention refer to an audio decoder for providing a decoded audio information on the basis of an encoded audio information comprising linear prediction coefficients (LPC), to a method for providing a decoded audio information on the basis of an encoded audio information comprising linear prediction coefficients (LPC), to a computer program for performing such a method, wherein the computer program runs on a computer, and to an audio signal or a storage medium having stored such an audio signal, the audio signal having been treated with such a method.
Number | Date | Country | |
---|---|---|---|
61758189 | Jan 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16286445 | Feb 2019 | US |
Child | 17103609 | US | |
Parent | 14811778 | Jul 2015 | US |
Child | 16286445 | US | |
Parent | PCT/EP2014/051649 | Jan 2014 | US |
Child | 14811778 | US |