The present invention relates to coding of audio signals, and in particular to high frequency reconstruction methods including a frequency domain transposer such as a harmonic transposer.
Conventionally, there are several methods for high frequency reconstruction using harmonic transposition, or time-stretching or similar. One method used is based on phase vocoders. These operate under the principle of doing a frequency analysis with sufficiently high frequency resolution, and the signal modification in the frequency domain prior to synthesizing the signal. The time-stretch or transposition depends on the combination of analysis window, analysis window stride, synthesis window, synthesis window stride, as well as phase adjustments of the analyzed signal.
One of the problem that inevitably exists with these methods is the contradiction between the needed frequency resolution in order to get a high quality transposition for stationary sounds, and the transient response of the system for transient sounds.
An algorithm which employs phase vocoders as, for example, described in M. Puckette. Phase-locked Vocoder. IEEE ASSP Conference on Applications of Signal Processing to Audio and Acoustics, Mohonk 1995.”, Röbel, A.: Transient detection and preservation in the phase vocoder; citeseer.ist.psu.edu/679246.html; Laroche L., Dolson M.: “Improved phase vocoder timescale modification of audio”, IEEE Trans. Speech and Audio Processing, vol. 7, no. 3, pp. 323-332 and U.S. Pat. No. 6,549,884 Laroche, J. & Dolson, M.: Phase-vocoder pitch-shifting for the patch generation, has been presented in Frederik Nagel, Sascha Disch, “A harmonic bandwidth extension method for audio codecs,” ICASSP International Conference on Acoustics, Speech and Signal Processing, IEEE CNF, Taipei, Taiwan, April 2009. However, this method called “harmonic bandwidth extension” (HBE) is prone to quality degradations of transients contained in the audio signal, as described in Frederik Nagel, Sascha Disch, Nikolaus Rettelbach, “A phase vocoder driven bandwidth extension method with novel transient handling for audio codecs,” 126th AES Convention, Munich, Germany, May 2009, since vertical coherence over subbands is not guaranteed to be preserved in the standard phase vocoder algorithm and, moreover, the re-calculation of the Discrete Fourier Transform (DFT) phases has to be performed on isolated time blocks of a transform implicitly assuming circular periodicity.
It is known that specifically two kinds of artifacts due to the block based phase vocoder processing can be observed. These, in particular, are dispersion of the waveform and temporal aliasing due to temporal cyclic convolution effects of the signal due to the application of newly calculated phases.
In other words, because of the application of a phase modification on the spectral values of the audio signal in the BWE algorithm, a transient contained in a block of the audio signal may be wrapped around the block, i.e., cyclically convolved back into the block. This results in temporal aliasing and, consequently, leads to a degradation of the audio signal.
Therefore, methods for a special treatment for signal parts containing transients should be employed. However, especially since the BWE algorithm is performed on the decoder side of a codec chain, computational complexity is a serious issue. Accordingly, measures against the just-mentioned audio signal degradation should not come at the price of a largely increased computational complexity.
According to an embodiment, an apparatus for generating a high frequency audio signal may have: an analyzer for analyzing an input signal to determine a transient information, wherein a first portion of the input signal has associated the transient information, and the second later portion of the input signal does not have the transient information; a spectral converter for converting the input signal into an input spectral representation; a spectral processor for processing the input spectral representation to generate a processed spectral representation including values for higher frequencies than the input spectral representation; and a time converter for converting the processed spectral representation to a time representation, wherein the spectral converter or the time converter are controllable to perform a frequency domain oversampling for the first portion of the input signal having associated the transient information and to not perform the frequency domain oversampling for the second portion of the input signal or to perform a frequency domain oversampling with a smaller oversampling factor compared to the first portion of the input signal.
According to another embodiment, a method of generating a high frequency audio signal may have the steps of: analyzing an input signal to determine a transient information, wherein a first portion of the input signal has associated the transient information, and the second later portion of the input signal does not have the transient information; converting the input signal into an input spectral representation; processing the input spectral representation to generate a processed spectral representation including values for higher frequencies than the input spectral representation; and converting the processed spectral representation to a time representation, wherein in the step of converting into an input spectral representation or in the step of converting to a time representation a controllable frequency domain oversampling is performed for the first portion of the input signal having the transient information, wherein the frequency domain oversampling for the second portion of the input signal is not performed or wherein a frequency domain oversampling with a smaller oversampling factor compared to the first portion of the input signal is performed for the second portion of the input signal.
Another embodiment may have a computer program for performing, when running on a computer, the inventive method for generating a high-frequency audio signal.
The present invention uses the feature that transients are treated separately, i.e., different from non-transient portions of the audio signal. To this end, an apparatus for generating a high frequency audio signal comprises an analyzer for analyzing the input signal to determine a transient information, where for a first portion of the input signal, the transient information is associated and a second later time portion of the input signal does not have the transient information. The analyzer can actually analyze the audio signal itself, i.e., by analyzing its energy distribution or change in energy to determine a transient portion. This necessitates a certain look-ahead so that, for example, a core coder output signal is analyzed at a certain time in advance so that the result of the analysis can be used for generating the high frequency audio signal based on the core coder output signal. A different alternative is to perform a transient detection on the encoder side and to associate a certain side information such as a certain bit in a bitstream to a time portion of the signal which has the transient characteristic. Then, the analyzer is configured for extracting this transient information bit from the bitstream in order to determine whether a certain portion of this input audio signal is transient or not. Additionally, the apparatus for generating a high frequency audio signal comprises a spectral converter for converting the input signal into the input spectral representation. The high frequency reconstruction is performed within the filterbank domain, i.e., subsequent to the spectral conversion using the spectral converter. To this end, a spectral processor processes the input spectral representation to generate a processed spectral representation comprising values for higher frequency than the input spectral representation. A conversion back into the time domain is done by a subsequently connected time converter for converting the processed spectral representation to a time representation. In accordance with the present invention, the spectral converter and/or the time converter are controllable to perform a frequency domain oversampling for the first portion of the input signal having associated the transient information and to not perform the frequency domain oversampling for the second portion of the input signal not having associated transient information.
The present invention is advantageous in that it results in a reduction of complexity while nevertheless retaining good transient performance for transpositions such as harmonic transpositions in combined filterbanks. The present invention therefore, comprises an apparatus and method having adaptive oversampling in frequency of combined transposers in a filterbank, where the oversampling is controlled by a transient detector in accordance with an embodiment.
In an embodiment, the spectral processor performs an harmonic transposition from a base band into a first high band portion, and additional high band portions such as three or four high band portions. In one embodiment, each high band portion has a separate synthesis filterbank such as an inverse FFT. In another embodiment, which is computationally more efficient, a single synthesis filterbank such as a single 1024 inverse FFT is used. For both cases, the frequency domain oversampling is obtained by increasing the transform size by an oversampling factor such as a factor of 1.5. The additional FFT input is obtained by zero padding, i.e., by adding a certain number of zeros before the first value of a windowed frame and by adding another number of zeros at the end of a windowed frame. In response to an FFT control signal, the size of the FFT is increased by the oversampling and zero padding is performed, although other values such as certain noise values different from zero can also be padded to windowed frames.
The spectral processor can additionally be controlled by the analyzer output signal, i.e., by the transient information so that for the case of a transient portion where the FFT is longer compared to the non-transient or non-padded case, start index values for the mapping of lines in a filterbank, i.e., for different transposition “rounds” or transposition iterations are changed depending on the oversampling factor, where this change comprises a multiplication of the used transform domain index by the oversampling factor to obtain the new start index for a patching operation for the frequency domain oversampled case.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
a is an embodiment of the apparatus for generating a high frequency audio signal;
b illustrates a spectral band replication processor, which comprises the apparatus for generating a high frequency audio signal of
a illustrates the transient stretching of a transient event close to the center of a window;
b illustrates the stretching of a transient close to the edge of a window; and
c illustrates a transient stretch with oversampling occurring in the first portion of the input signal having associated transient information.
The spectral converter 14 is configured for converting the input signal into an input spectral representation output on line 11. The spectral processor 13 is connected to the spectral converter via the line 11.
The spectral processor 13 is configured for processing the input spectral representation to generate a processed spectral representation comprising values for higher frequencies than the input spectral representation. Stated differently, the spectral processor 13 performs the transposition, and performs an harmonic transposition, although other transpositions could be performed as well in the spectral processor 13. The processed spectral representation is output from the spectral processor 13 via a line 15 to a time converter 17, where the time converter 17 is configured for converting the processed spectral representation to a time representation. The spectral representation is a frequency domain or filterbank domain representation and the time representation is a straightforward full bandwidth time domain representation, although the time converter can also be configured for directly transforming the processed spectral representation 15 into a filterbank domain having individual subband signals each having a certain higher bandwidth than an FFT filterbank. Therefore, the output time representation on output line 18 can also comprise one or several subband signals, where each subband signal has a higher bandwidth than a frequency line or value in the processed spectral representation.
The spectral converter 14 or the time converter 17 or both elements are controllable with respect to the size of the spectral conversion algorithm to perform a frequency domain oversampling for the first portion of the audio signal having associated the transient information and to not perform the frequency domain oversampling for the second portion of the input signal which does not have the transient information in order to provide a high efficiency and a reduced complexity without any loss of audio quality.
The spectral converter is configured for performing the frequency domain oversampling by applying a longer transform length for the first portion having associated transient information compared to the transform length applied to the second portion, wherein the longer transform length comprises padded data. The difference in length between the two transform lengths is represented by the frequency domain oversampling factor which can be in the range of 1.3 to 3, and is as low as possible but sufficiently large to make sure that “bad transients” as illustrated in
Subsequently,
The spectral converter 14 comprises an analysis windower 14a and an FFT processor 14b. Additionally, the time converter comprises an inverse FFT module 17a, a synthesis windower 17b and an overlap-add processor at 17c. An inventive apparatus may comprise a single time converter 17 as, for example, illustrated with respect to
b illustrates an SBR (spectral band replication) for a high frequency reconstruction processor. On an input line 10 a core decoder output signal which can, for example, be a time domain output signal is provided to block 20, which symbolizes the
At the output of block 14, there is the input spectral representation which is then processed via parallely arranged phase processors 41, 42, 43. Phase processor 41, which is part of the spectral processor 13 in
The generation of higher frequencies is performed by feeding the different time converters 170a, 170b, 170c, so that the signals output by the spectral processors 41, 42, 43 are input into the corresponding frequency channels. Additionally, the time converters 170a, 170b, 170c have an increased frequency spacing compared to the input filterbank 14, so that, instead of the same size of these processors, i.e., the same FFT size, the signal generated by this processor represents a higher spectral content, or, stated differently, a higher maximum frequency.
The analyzer 12 is configured for retrieving the transient information from the input signal and to control processors 14, 170a, 170b, 170c to use a larger transform size and to use padded values before the beginning of the windowed frame and after the end of the windowed frame, so that the frequency domain oversampling is performed in an adaptive way. In an alternative embodiment illustrated in
Additionally, the first portion on the left-hand side of
In order to actually implement or approximate the third order transposition, the target bins extend from 3/2 k upwards with respect to frequency. The result for the target bins 3/2 k and 3/2 (k+2) is again straightforward, since the corresponding spectral lines in the source bins k, k+2, can be taken as they are, and their phases are respectively multiplied by 3 as illustrated by phase multiply arrows 63. However, the target bin 3/2 (k+1) does not have a direct counterpart in the source bins. When, for example, the small example is considered where k is equal to 4 and k+1 is equal to 5, then 3/2 k corresponds to 6 which, divided by 1.5, results in k=4. However, the next target bin is equal to 7, and 7 divided by 1.5 is equal to 4.66. A source bin having an index 4.66, however, does not exist, since only integer source bins do exist. Therefore, an interpolation between the neighboring or adjacent source bins k and k+1 is performed. Since, however, 4.66 is closer to 5 (k+1) than to 4 (k), the phase information of source bin k+1 is multiplied by two as indicated by arrow 62 and the phase information from source bin k (in the example equal to 4) is multiplied by 1 as shown by a phase arrow 61, which represents a phase multiplication by one. This, of course, corresponds to just taking the phase as it is. These phases, which are obtained by performing the operations symbolized by arrows 61 and 62 are combined, such as added together and the phase multiplication performed by both arrows together results in a multiplication value of 3, which is necessitated for the third order transposition. Analogously, the phase values for 3/ 2k+2 and 3/2 (k+2) +1 are calculated.
A similar calculation is performed for the fourth order transposition, where the interpolated values are, as illustrated by arrows 62 calculated by two adjacent source bins, where the phase of each source bin is multiplied by two. On the other hand, the phases for the directly corresponding target bins which are integer multiples do not need to be interpolated, but are calculated using the phases of the source bins multiplied by four.
It is to be noted that, in an embodiment, where there is a direct calculation of a target bin from a source bin, the phases are only modified with respect to the source bins and the amplitudes of the source bins are maintained as they are. Regarding the interpolated values, it is advantageous to perform an interpolation between the amplitudes of the two adjacent source bins, but other ways of combining these two source bins can also be performed, such as by taking the higher amplitude from the two adjacent source bins or the lower amplitude of the two adjacent source bins or the geometric mean value or an arithmetic mean value or any other combination of the adjacent source bin amplitudes.
Then, in step 34, the target bin amplitude is determined by interpolating the source bin amplitudes. In an alternative embodiment, the target bin amplitudes can be randomly selected depending on source bin amplitudes or an average target bin amplitude of directly calculated target bins. When a random selection is applied, then an average value or one of the two source bin amplitude values can be prescribed as a medium value for the random process.
The improved transient response of the transposer is obtained by means of frequency domain oversampling, which is implemented by using DFT kernels of length 1024 F and by zero padding the analysis and synthesis windows symmetrically to that length. Here, F is the frequency domain oversampling factor.
For complexity reasons, it is important to keep the amount of oversampling to a minimum, hence the underlying theory will be explained in the following by a sequence of figures.
Consider the prototype transient signal, a Dirac pulse at time t=t0. Hence, multiplying the phase by T seems like the correct thing to do in order to achieve the transform of a pulse at t=Tt0. Indeed, such a theoretical transposer with a window of infinite duration would give the correct stretch of a pulse. For the finite duration windowed analysis, the situation is scrambled by the fact that each analysis block is to be interpreted as a one period interval of a periodic signal with period equal to the size of the DFT.
In
The problem occurs for the situation of
The beneficial effect of frequency domain oversampling is demonstrated by
Now, the period of the pulse trains is FL and the undesired contributions to the pulse stretch can be cancelled by selecting a sufficiently large value of F. For any pulse at position t=t0<L/2 the undesired image at t=Tt0−FL has to be located to the left of the left edge of the synthesis window at t=−L/2. Equivalently, TL/2−FL≦L/2, leading to the rule
A more quantitative analysis reveals that pre-echoes are still reduced by using frequency domain oversampling slightly inferior to the value imposed by the inequality, simply because the windows consist of small values near the edges.
In the transpose as in
Since the oversampling is only necessitated in transient parts of the signal, a transient detection is performed in the encoder and a transient flag is sent to the decoder for each core coder frame to control the amount of oversampling in the decoder. When the oversampling is active, the factor F=1.5 is used at least for all transposer granules for which the analysis window starts in the current core coder frame.
In
Similarly, in the synthesis case, one could either apply a specified longer synthesis window in case of a transient event, which would bring to zero the leading values and the last values of a frame generated by the inverse FFT processor 17a. However, it is advantageous to apply the same synthesis window, but to simply delete, i.e., cancel values from the beginning of the FFT−1 output, where the number of zero values (padded values) is deleted at the beginning and at the end of the block output by processor 17a corresponds to the number of the zero-padded values.
Additionally, the detection of a transient event performs a start index control via a start index control line 29 in
The transient is signaled for a frame which is used for generating the high frequency enhanced signal, i.e., a so-called SBR frame. Then, the first portion would be an SBR frame containing a transient event and the second portion of the input signal would be an SBR frame later in time not containing a transient. Each window, which has at least a single sample value of this transient frame, therefore would be zero-padded so that when a frame would have the length of one window and when the transient event would be a single sample, this would result in eight windows being transformed using a longer transform with padding values.
The present invention can also be considered as an apparatus for frequency domain transposition, where an adaptive frequency domain oversampling in a filterbank of combined transposers is performed, which is controlled by a transient detector.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
While this invention has been described in terms of several advantageous embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
This application is a National Phase entry claiming priority to International Application No. PCT/EP2010/057130, filed May 25, 2010, which is incorporated herein by reference in its entirety, and additionally claims priority from U.S. Application No. 61/253,776, filed Oct. 21, 2009, which is also incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2010/057130 | 5/25/2010 | WO | 00 | 7/12/2012 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2011/047886 | 4/28/2011 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
7835915 | Kim et al. | Nov 2010 | B2 |
8843378 | Herre et al. | Sep 2014 | B2 |
20040078194 | Liljeryd et al. | Apr 2004 | A1 |
20040125878 | Liljeryd et al. | Jul 2004 | A1 |
20090252356 | Goodwin et al. | Oct 2009 | A1 |
20090259906 | Garudadri et al. | Oct 2009 | A1 |
Number | Date | Country |
---|---|---|
1510662 | Jul 2004 | CN |
2012-501273 | Jan 2012 | JP |
2345506 | Jan 2009 | RU |
9013887 | Nov 1990 | WO |
WO 2009095169 | Aug 2009 | WO |
2009115211 | Sep 2009 | WO |
WO-2010108895 | Sep 2010 | WO |
Entry |
---|
Dietz, M., S. Liljeryd, K. Kjoerling and O. Kunz “Spectral Band Replication, a Novel Approach in Audio Coding”, in 112th AES convention, Munich, May 2002. |
Laroche L., Dolson M.: “Improved phase vocoder timescale modification of audio”, IEEE Trans. Speech and Audio Processing, vo. 7, No. 3, pp. 323-332. |
Nagel, F., Sascha Disch, “A harmonic bandwidth extension method for audio codecs”, ICASSP International Conference on Acoustics, Speech and Signal Processing, IEEE CNF, Taipei, Taiwan, Apr. 2009. |
Puckette, M., Phase-locked Vocoder. IEEE ASSP Conference on Applications of Signal Processing to Audio and Acoustics, Mohonk 1995. |
Nagel, et al., “A Phase Vocoder Driven Bandwidth Extension Method with Novel Transient Handling for Audio Codecs”, 126th AES Convention, Preprints, Munich, Germany, May 2009. |
Number | Date | Country | |
---|---|---|---|
20120281859 A1 | Nov 2012 | US |
Number | Date | Country | |
---|---|---|---|
61253776 | Oct 2009 | US |