The present invention is related to audio coding and, particularly to audio coding in the context of frequency enhancement, i.e., that a decoder output signal has a higher number of frequency bands compared to an encoded signal. Such procedures comprise bandwidth extension, spectral replication or intelligent gap filling.
Contemporary speech coding systems are capable of encoding wideband (WB) digital audio content, that is, signals with frequencies of up to 7-8 kHz, at bitrates as low as 6 kbit/s. The most widely discussed examples are the ITU-T recommendations G.722.2 [1] as well as the more recently developed G.718 [4, 10] and MPEG-D Unified Speech and Audio Coding (USAC) [8]. Both, G.722.2, also known as AMR-WB, and G.718 employ bandwidth extension (BWE) techniques between 6.4 and 7 kHz to allow the underlying ACELP core-coder to “focus” on the perceptually more relevant lower frequencies (particularly the ones at which the human auditory system is phase-sensitive), and thereby achieve sufficient quality especially at very low bitrates. In the USAC eXtended High Efficiency Advanced Audio Coding (xHE-AAC) profile, enhanced spectral band replication (eSBR) is used for extending the audio bandwidth beyond the core-coder bandwidth which is typically below 6 kHz at 16 kbit/s. Current state-of-the-art BWE processes can generally be divided into two conceptual approaches:
This subband envelope is computed by selective linear prediction, i.e., computation of the wideband power spectrum followed by an IDFT of its upper band components and the subsequent Levinson-Durbin recursion of order 8. The resulting subband LPC coefficients are converted into the cepstral domain and are finally quantized by a vector quantizer with a codebook of size M=2N. For a frame length of 20 ms, this results in a side information data rate of 300 bit/s. A combined estimation approach extends a calculation of a posteriori probabilities and reintroduces dependences on the narrowband feature. Thus, an improved form of error concealment is obtained which utilizes more than one source of information for its parameter estimation.
A certain quality dilemma in WB codecs can be observed at low bitrates, typically below 10 kbit/s. On the one hand, such rates are already too low to justify the transmission of even moderate amounts of BWE data, ruling out typical guided BWE systems with 1 kbit/s or more of side information. On the other hand, a feasible blind BWE is found to sound significantly worse on at least some types of speech or music material due to the inability of proper parameter prediction from the core signal. This is particularly true for some vocal sound such as fricatives with low correlation between HF and LF. It is therefore desirable to reduce the side information rate of a guided BWE scheme to a level far below 1 kbit/s, which would allow its adoption even in very-low-bitrate coding.
Manifold BWE approaches have been documented in recent years [1-10]. In general, all of these are either fully blind or fully guided at a given operating point, regardless of the instantaneous characteristics of the input signal. Furthermore, many blind BWE systems [1, 3, 4, 5, 9, 10] are optimized particularly for speech signals rather than for music and may therefore yield non satisfactory results for music. Finally, most of the BWE realizations are relatively computationally complex, employing Fourier transforms, LPC filter computations, or vector quantization of the side information (Predictive Vector Coding in MPEG-D USAC [8]). This can be a disadvantage in the adoption of new coding technology in mobile telecommunication markets, given that the majority of mobile devices provide very limited computational power and battery capacity.
An approach which extends blind BWE by small side information is presented in [12] and is illustrated in
A further problem of the procedure illustrated in
According to an embodiment, a decoder for generating a frequency enhanced audio signal may have: a feature extractor for extracting a feature from a core signal; a side information extractor for extracting a selection side information associated with the core signal; a parameter generator for generating a parametric representation for estimating a spectral range of the frequency enhanced audio signal not defined by the core signal, wherein the parameter generator is configured to provide a number of parametric representation alternatives in response to the feature, and wherein the parameter generator is configured to select one of the parametric representation alternatives as the parametric representation in response to the selection side information; and a signal estimator for estimating the frequency enhanced audio signal using the parametric representation selected.
According to another embodiment, an encoder for generating an encoded signal may have: a core encoder for encoding an original signal to acquire an encoded audio signal including information on a smaller number of frequency bands compared to an original signal; a selection side information generator for generating selection side information indicating a defined parametric representation alternative provided by a statistical model in response to a feature extracted from the original signal or from the encoded audio signal or from a decoded version of the encoded audio signal; and an output interface for outputting the encoded signal, the encoded signal including the encoded audio signal and the selection side information.
According to another embodiment, a method for generating a frequency enhanced audio signal may have the steps of: extracting a feature from a core signal; extracting a selection side information associated with the core signal; generating a parametric representation for estimating a spectral range of the frequency enhanced audio signal not defined by the core signal, wherein a number of parametric representation alternatives is provided in response to the feature, and wherein one of the parametric representation alternatives is selected as the parametric representation in response to the selection side information; and estimating the frequency enhanced audio signal using the parametric representation selected.
According to another embodiment, a method of generating an encoded signal may have the steps of: encoding an original signal to acquire an encoded audio signal including information on a smaller number of frequency bands compared to an original signal; generating selection side information indicating a defined parametric representation alternative provided by a statistical model in response to a feature extracted from the original signal or from the encoded audio signal or from a decoded version of the encoded audio signal; and outputting the encoded signal, the encoded signal including the encoded audio signal and the selection side information.
Another embodiment may have a computer program for performing, when running on a computer or a processor, the method of claim 15.
Another embodiment may have a computer program for performing, when running on a computer or a processor, the method of claim 16.
According to another embodiment, an encoded signal may have: an encoded audio signal; and selection side information indicating a defined parametric representation alternative provided by a statistical model in response to a feature extracted from an original signal or from the encoded audio signal or from a decoded version of the encoded audio signal.
The present invention is based on the finding that in order to even more reduce the amount of side information and, additionally, in order to make a whole encoder/decoder not overly complex, the conventional-technology parametric encoding of a highband portion has to be replaced or at least enhanced by selection side information actually relating to the statistical model used together with a feature extractor on a frequency enhancement decoder. Due to the fact that the feature extraction in combination with a statistical model provide parametric representation alternatives which have ambiguities specifically for certain speech portions, it has been found that actually controlling the statistical model within a parameter generator on the decoder-side, which of the provided alternatives would be the best one, is superior to actually parametrically coding a certain characteristic of the signal specifically in very low bitrate applications where the side information for the bandwidth extension is limited.
Thus, a blind BWE is improved, which exploits a source model for the coded signal, by extension with small additional side information, particularly if the signal itself does not allow for a reconstruction of the HF content at an acceptable perceptual quality level. The procedure therefore combines the parameters of the source model, which are generated from coded core-coder content, by extra information. This is advantageous particularly to enhance the perceptual quality of sounds which are difficult to code within such a source model. Such sounds typically exhibit a low correlation between HF and LF content.
The present invention addresses the problems of conventional BWE in very-low-bitrate audio coding and the shortcomings of the existing, state-of-the-art BWE techniques. A solution to the above described quality dilemma is provided by proposing a minimally guided BWE as a signal-adaptive combination of a blind and a guided BWE. The inventive BWE adds some small side information to the signal that allows for a further discrimination of otherwise problematic coded sounds. In speech coding, this particularly applies for sibilants or fricatives.
It was found that, in WB codecs, the spectral envelope of the HF region above the core-coder region represents the most critical data that may be used for performing BWE with acceptable perceptual quality. All other parameters, such as spectral fine-structure and temporal envelope, can often be derived from the decoded core signal quite accurately or are of little perceptual importance. Fricatives, however, often lack a proper reproduction in the BWE signal. Side information may therefore include additional information distinguishing between different sibilants or fricatives such as “f”, “s”, “ch” and “sh”.
Other problematic acoustical information for bandwidth extension, when there occur plosives or affricates such as “t” or “tsch”.
The present invention allows to only use this side information and actually to transmit this side information where it is useful and to not transmit this side information, when there is no expected ambiguity in the statistical model.
Furthermore, advantageous embodiments of the present invention only use a very small amount of side information such as three or less bits per frame, a combined voice activity detection/speech/non-speech detection for controlling a signal estimator, different statistical models determined by a signal classifier or parametric representation alternatives not only referring to an envelope estimation but also referring to other bandwidth extension tools or the improvement of bandwidth extension parameters or the addition of new parameters to already existing and actually transmitted bandwidth extension parameters.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
Furthermore, a side information extractor 110 for extracting a selection side information 114 associated with the core signal 100 is provided. In addition, a parameter generator 108 is connected to the feature extractor 104 via feature transmission line 112 and to the side information extractor 110 via selection side information 114. The parameter generator 108 is configured for generating a parametric representation for estimating a spectral range of the frequency enhanced audio signal not defined by the core signal. The parameter generator 108 is configured to provide a number of parametric representation alternatives in response to the features 112 and to select one of the parametric representation alternatives as the parametric representation in response to the selection side information 114. The decoder furthermore comprises a signal estimator 118 for estimating a frequency enhanced audio signal using the parametric representation selected by the selector, i.e., parametric representation 116.
Particularly, the feature extractor 104 can be implemented to either extract from the decoded core signal as illustrated in
Alternatively, however, the feature extractor can also operate or extract a feature from the encoded core signal. Typically, the encoded core signal comprises a representation of scale factors for frequency bands or any other representation of audio information. Depending on the kind of feature extraction, the encoded representation of the audio signal is representative for the decoded core signal and, therefore features can be extracted. Alternatively or additionally, a feature can be extracted not only from a fully decoded core signal but also from a partly decoded core signal. In frequency domain coding, the encoded signal is representing a frequency domain representation comprising a sequence of spectral frames. The encoded core signal can, therefore, be only partly decoded to obtain a decoded representation of a sequence of spectral frames, before actually performing a spectrum-time conversion. Thus, the feature extractor 104 can extract features either from the encoded core signal or a partly decoded core signal or a fully decoded core signal. The feature extractor 104 can be implemented, with respect to its extracted features as known in the art and the feature extractor may, for example, be implemented as in audio fingerprinting or audio ID technologies.
Advantageously, the selection side information 114 comprises a number N of bits per frame of the core signal.
Furthermore, the parameter generator is configured to provide, at the most, an amount of parametric representation alternatives being equal to 2N. On the other hand, when the parameter generator 108 provides, for example, only five parametric representation alternatives, then three bits of selection side information may nevertheless be used.
Furthermore, the parameter generator 108 is configured for retrieving the selection side information 114 from the side information extractor as outlined in step 404. Then, in step 406, a specific parametric representation alternative is selected using the selection side information 114. Finally, in step 408, the selected parametric representation alternative is output to the signal estimator 118.
Advantageously, the parameter generator 108 is configured to use, when selecting one of the parametric representation alternatives, a predefined order of the parametric representation alternatives or, alternatively, an encoder-signal order of the representation alternatives. To this end, reference is made to
The predefined order of the parametric representation alternatives can, therefore, be the order in which the statistical model actually delivers the alternatives in response to an extracted feature. Alternatively, if the individual alternative has associated different probabilities which are, however, quite close to each other, then the predefined order could be that the highest probability parametric representation comes first and so on. Alternatively, the order could be signaled for example by a single bit, but in order to even save this bit, a predefined order is advantageous.
Subsequently, reference is made to
In an embodiment according to
Particularly, the selection side information 114 is also termed to be a “fricative information”, since this selection side information distinguishes between problematic sibilants or fricatives such as “f”, “s” or “sh”. Thus, the selection side information provides a clear definition of one of three problematic alternatives which are, for example, provided by the statistical model 904 in the process of the envelope estimation 902 which are both performed in the parameter generator 108. The envelope estimation results in a parametric representation of the spectral envelope of the spectral portions not included in the core signal.
Block 104 can, therefore, correspond to block 1510 of
Furthermore, it is advantageous that the signal estimator 118 comprises an analysis filter 910, an excitation extension block 112 and a synthesis filter 940. Thus, blocks 910, 912, 914 may correspond to blocks 1600, 1700 and 1800 of
Thus, other signals different from speech can also be coded as illustrated in
A further embodiment is illustrated in
Subsequently,
As discussed before,
While
The selection side information 1210 generated by the selection side information generator 1202 can have any of the characteristics as discussed in the context of the earlier Figures.
Although the present invention has been described in the context of block diagrams where the blocks represent actual or logical hardware components, the present invention can also be implemented by a computer-implemented method. In the latter case, the blocks represent corresponding method steps where these steps stand for the functionalities performed by corresponding logical or physical hardware blocks.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, some one or more of the most important method steps may be executed by such an apparatus.
The inventive transmitted or encoded signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive method is, therefore, a data carrier (or a non-transitory storage medium such as a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitory.
A further embodiment of the invention method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example, via the internet.
A further embodiment comprises a processing means, for example, a computer or a programmable logic device, configured to, or adapted to, perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
This application is a continuation of copending International Application No. PCT/EP2014/051591, filed Jan. 28, 2014, which claims priority from U.S. Application No. 61/758,092, filed Jan. 29, 2013, which are each incorporated herein in its entirety by this reference thereto.
Number | Name | Date | Kind |
---|---|---|---|
7751572 | Villemoes | Jul 2010 | B2 |
8275626 | Neuendorf et al. | Sep 2012 | B2 |
8484020 | Krishnan et al. | Jul 2013 | B2 |
8731950 | Herre et al. | May 2014 | B2 |
8929558 | Engdegard | Jan 2015 | B2 |
9094754 | Engdegard | Jul 2015 | B2 |
9191045 | Purnhagen | Nov 2015 | B2 |
20060140412 | Villemoes | Jun 2006 | A1 |
20070019813 | Hilpert | Jan 2007 | A1 |
20070094027 | Vasilache | Apr 2007 | A1 |
20070208557 | Li | Sep 2007 | A1 |
20070255572 | Miyasaka et al. | Nov 2007 | A1 |
20080154583 | Goto | Jun 2008 | A1 |
20090282298 | Zopf | Nov 2009 | A1 |
20100046762 | Henn | Feb 2010 | A1 |
20100080397 | Suzuki et al. | Apr 2010 | A1 |
20110004479 | Ekstrand | Jan 2011 | A1 |
20110054885 | Nagel et al. | Mar 2011 | A1 |
20110173006 | Nagel et al. | Jul 2011 | A1 |
20110202353 | Neuendorf | Aug 2011 | A1 |
20110295598 | Yang | Dec 2011 | A1 |
20120002818 | Heiko | Jan 2012 | A1 |
20120263308 | Herre et al. | Oct 2012 | A1 |
20130101032 | Wittmann | Apr 2013 | A1 |
20130121411 | Robillard | May 2013 | A1 |
20130170391 | Feiten | Jul 2013 | A1 |
Number | Date | Country |
---|---|---|
102027537 | Apr 2011 | CN |
102089814 | Jun 2011 | CN |
102177545 | Sep 2011 | CN |
102714035 | Oct 2012 | CN |
0720148 | Jul 1996 | EP |
2239732 | Oct 2010 | EP |
2007328268 | Dec 2007 | JP |
2010122640 | Jun 2010 | JP |
2011527449 | Oct 2011 | JP |
2455710 | Jul 2012 | RU |
2011101616 | Jul 2012 | RU |
201009808 | Mar 2010 | TW |
201104674 | Feb 2011 | TW |
201140563 | Nov 2011 | TW |
2010058518 | May 2010 | WO |
2010115845 | Oct 2010 | WO |
2011047886 | Apr 2011 | WO |
Entry |
---|
Bauer, P. et al., “A Statistical Framework for Artificial Bandwidth Extension Exploiting Speech Waveform and Phonetic Transcription”, retrieved online on Apr. 2, 2014 from url: http://www.researchgate.net/publication/228336475_A_Statistical_Framework_for_Artificail_Bandwidth_Extension_Exploiting_Speech_Waveform_and_Phonetic_Transcription/file/e0b495225068409423.pdf, Jan. 2009, 6 pages. |
Bessette, Bruno et al., “The Adaptive Multirate Wideband Speech Codec (AMR-WB)”, IEEE Transactions on Speech and Audio Processing, vol. 10, No. 8, Nov. 8, 2002, pp. 620-636. |
Geiser, B et al., “Bandwidth Extension for Hierarchical Speech and Audio Coding in ITU-T Rec. G.729.1”, IEEE Transactions on Audio, Speech and Language Processing, IEEE Service Center, vol. 15, No. 8, Nov. 2007, pp. 2496-2509. |
Geiser, B. et al., “Robust Wideband Enhancement of Speech by Combined Coding and Artificial Bandwidth Extension”, Proceedings of IWAENC; Eindhoven, Netherlands, Sep. 15, 2005, pp. 21-24. |
Iser, Bernd et al., “Bandwidth Extension of Speech Signals”, Springer Science + Business Media, LLC, 2008, pp. 53-66. |
Jari, Makinen et al., “A MR-WB+: A New Audio Coding Standard for 3RD Generation Mobile Audio Services”, Multimedia Technologies Lasboratory Nokia Research Center, Finland. VoiceAge Corp., Montreal, Qc, Canada. University of Sherbrooke, Qc, Canada. Multimedia Technologies, Ericsson Research, Sweden., Mar. 2005, pp. II-1109-1112. |
Jelinek, Milan , “Wideband Speech Coding Advances in VMR-WB Standard”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 4,, May 4, 2007, pp. 1167-1179. |
Katsir, I et al., “Speech Bandwidth Extension Based on Speech Phonetic Content and Speaker Vocal Tract Shape Estimation”, in Proc. EUSIPCO 2011, Barcelona, Spain, Aug. 29-Sep. 2, 2011, pp. 461-465. |
Larsen, Erik et al., “Audio Bandwidth Extension”, Application of Psychoacoustics, Signal Processing and Loudspeaker Design, Joh Wiley& Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, 2004, pp. 171-236. |
Miao, Lei , “G711.1 Annex D and G.722 Annex B—New ITU-T Superwideband Codecs”, In the proceedings of ICASSP; Prague, Czech Republic, May 2011, pp. 5232-5235. |
Neuendorf, M et al., “MPEG Unified Speech and Audio Coding—The ISO/MPEG Standard for High-Efficiency Audio Coding of all Content Types”, Audio Engineering Society Convention Paper 8654, Presented at the 132nd Convention, Apr. 26-29, 2012, pp. 1-22. |
Pulakka, Hannu et al., “Bandwith Extension of Telephone Speech Using a Neutral Network and a Filter Bank Implementation for Highband Mel Spectrum”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, No. 7, Sep. 7, 2011, pp. 2170-2183. |
Sanna, M. et al., “A codebook design method for fricative enhancement in Artificial Bandwidth Extension”, Proceedings of the 5th Int'l Mobile Multimedia Communications Conference, London, UK, Sep. 9, 2009, 7 pages. |
Vaillancourt, T et al., “ITU-T EV-VBR: A Robust 8-32 kbit/s Scalable Coder for Error Prone Telecommunications Channels”, in Proc. EUSIPCO 2008, Lausanne, Switzerland, Aug. 2008, 5 pages. |
Number | Date | Country | |
---|---|---|---|
20150332701 A1 | Nov 2015 | US |
Number | Date | Country | |
---|---|---|---|
61758092 | Jan 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2014/051591 | Jan 2014 | US |
Child | 14811722 | US |