The present application concerns noise filling in multichannel audio coding.
Modern frequency-domain speech/audio coding systems such as the Opus/Celt codec of the IETF [1], MPEG-4 (HE-)AAC [2] or, in particular, MPEG-D xHE-AAC (USAC) [3], offer means to code audio frames using either one long transform—a long block—or eight sequential short transforms—short blocks—depending on the temporal stationarity of the signal. In addition, for low-bitrate coding these schemes provide tools to reconstruct frequency coefficients of a channel using pseudorandom noise or lower-frequency coefficients of the same channel. In xHE-AAC, these tools are known as noise filling and spectral band replication, respectively.
However, for very tonal or transient stereophonic input, noise filling and/or spectral band replication alone limit the achievable coding quality at very low bitrates, mostly since too many spectral coefficients of both channels need to be transmitted explicitly.
Thus, it is the object to provide a concept for performing noise filling in multichannel audio coding which provides for a more efficient coding, especially at very low bitrates.
An embodiment may have a parametric frequency-domain audio decoder configured to: identify first scale factor bands of a spectrum of a first channel of a current frame of a multichannel audio signal, within which all spectral lines are quantized to zero, and second scale factor bands of the spectrum, within which at least one spectral line is quantized to non-zero; fill the spectral lines within a predetermined scale factor band of the first scale factor bands with noise generated using spectral lines of a previous frame of, or a different channel of the current frame of, the multichannel audio signal, with adjusting a level of the noise using a scale factor of the predetermined scale factor band; dequantize the spectral lines within the second scale factor bands using scale factors of the second scale factor bands; and inverse transform the spectrum obtained from the first scale factor bands filled with the noise the level of which is adjusted using the scale factors of the first scale factor bands, and the second scale factor bands dequantized using the scale factors of the second scale factor bands, so as to obtain a time domain portion of the first channel of the multichannel audio signal.
Another embodiment may have a parametric frequency-domain audio encoder configured to: quantize spectral lines of a spectrum of a first channel of a current frame of a multichannel audio signal using preliminary scale factors of scale factor bands within the spectrum; identify first scale factor bands in the spectrum within which all spectral lines are quantized to zero, and second scale factor bands of the spectrum within which at least one spectral line is quantized to non-zero, within a prediction and/or rate control loop, fill the spectral lines within a predetermined scale factor band of the first scale factor bands with noise generated using spectral lines of a previous frame of, or a different channel of the current frame of, the multichannel audio signal, with adjusting a level of the noise using an actual scale factor of the predetermined scale factor band; and signal the actual scale factor for the predetermined scale factor band instead of the preliminary scale factor.
According to still another embodiment, a parametric frequency-domain audio decoding method may have the steps of: identifying first scale factor bands of a spectrum of a first channel of a current frame of a multichannel audio signal, within which all spectral lines are quantized to zero, and second scale factor bands of the spectrum, within which at least one spectral line is quantized to non-zero; filling the spectral lines within a predetermined scale factor band of the first scale factor bands with noise generated using spectral lines of a previous frame of, or a different channel of the current frame of, the multichannel audio signal, with adjusting a level of the noise using a scale factor of the predetermined scale factor band; dequantizing the spectral lines within the second scale factor bands using scale factors of the second scale factor bands; and inverse transforming the spectrum obtained from the first scale factor bands filled with the noise the level of which is adjusted using the scale factors of the first scale factor bands, and the second scale factor bands dequantized using the scale factors of the second scale factor bands, so as to obtain a time domain portion of the first channel of the multichannel audio signal.
According to another embodiment, a parametric frequency-domain audio encoding method may have the steps of: quantizing spectral lines of a spectrum of a first channel of a current frame of a multi-channel audio signal using preliminary scale factors of scale factor bands within the spectrum; identifying first scale factor bands in the spectrum within which all spectral lines are quantized to zero, and second scale factor bands of the spectrum within which at least one spectral line is quantized to non-zero, within a prediction and/or rate control loop, filling the spectral lines within a predetermined scale factor band of the first scale factor bands with noise generated using spectral lines of a previous frame of, or a different channel of the current frame of, the multi-channel audio signal, with adjusting a level of the noise using an actual scale factor of the predetermined scale factor band; signaling the actual scale factor for the predetermined scale factor band instead of the preliminary scale factor.
Another embodiment may have a non-transitory digital storage medium, having stored thereon a computer program for performing a parametric frequency-domain audio decoding method having the steps of: identifying first scale factor bands of a spectrum of a first channel of a current frame of a multichannel audio signal, within which all spectral lines are quantized to zero, and second scale factor bands of the spectrum, within which at least one spectral line is quantized to non-zero; filling the spectral lines within a predetermined scale factor band of the first scale factor bands with noise generated using spectral lines of a previous frame of, or a different channel of the current frame of, the multichannel audio signal, with adjusting a level of the noise using a scale factor of the predetermined scale factor band; dequantizing the spectral lines within the second scale factor bands using scale factors of the second scale factor bands; and inverse transforming the spectrum obtained from the first scale factor bands filled with the noise the level of which is adjusted using the scale factors of the first scale factor bands, and the second scale factor bands dequantized using the scale factors of the second scale factor bands, so as to obtain a time domain portion of the first channel of the multichannel audio signal, when said computer program is run by a computer.
Another embodiment may have a non-transitory digital storage medium, having stored thereon a computer program for performing a parametric frequency-domain audio encoding method having the steps of: quantizing spectral lines of a spectrum of a first channel of a current frame of a multi-channel audio signal using preliminary scale factors of scale factor bands within the spectrum; identifying first scale factor bands in the spectrum within which all spectral lines are quantized to zero, and second scale factor bands of the spectrum within which at least one spectral line is quantized to non-zero, within a prediction and/or rate control loop, filling the spectral lines within a predetermined scale factor band of the first scale factor bands with noise generated using spectral lines of a previous frame of, or a different channel of the current frame of, the multi-channel audio signal, with adjusting a level of the noise using an actual scale factor of the predetermined scale factor band; signaling the actual scale factor for the predetermined scale factor band instead of the preliminary scale factor, when said computer program is run by a computer.
The present application is based on the finding that in multichannel audio coding, an improved coding efficiency may be achieved if the noise filling of zero-quantized scale factor bands of a channel is performed using noise filling sources other than artificially generated noise or spectral replica of the same channel. In particular, the efficiency in multichannel audio coding may be rendered more efficient by performing the noise filling based on noise generated using spectral lines from a previous frame of, or a different channel of the current frame of, the multichannel audio signal.
By using spectrally co-located spectral lines of a previous frame or spectrotemporally co-located spectral lines of other channels of the multichannel audio signal, it is possible to attain a more pleasant quality of the reconstructed multichannel audio signal, especially at very low bitrates where the encoder's requirement to zero-quantize spectral lines is close to a situation so as to zero-quantize scale factor bands as a whole. Owing to the improved noise filling an encoder may then, with less quality penalty, choose to zero-quantize more scale factor bands, thereby improving the coding efficiency.
In accordance with an embodiment of the present application, the source for performing the noise filling partially overlaps with a source used for performing complex-valued stereo prediction. In particular, the downmix of a previous frame may be used as the source for noise filling and co-used as a source for performing, or at least enhancing, the imaginary part estimation for performing the complex inter-channel prediction.
In accordance with embodiments, an existing multichannel audio codec is extended in a backward-compatible fashion so as to signal, on a frame-by-frame basis, the use of inter-channel noise filling. Specific embodiments outlined below, for example, extend xHE-AAC by a signalization in a backward-compatible manner, with the signalization switching on and off inter-channel noise filling exploiting un-used states of the conditionally coded noise filling parameter.
Embodiments of the present application are described below with respect to the figures, among which:
The frequency-domain audio decoder 10 of
The multichannel audio signal represented by data stream 30 may comprise two or more channels. In the following, the description of the embodiments of the present application concentrate on the stereo case where the multichannel audio signal merely comprises two channels, but in principle the embodiments brought forward in the following may be readily transferred onto alternative embodiments concerning multichannel audio signals and their coding comprising more than two channels.
As will further become clear from the description of
In order to account for the possibly time-varying relationship between the channels of the audio signal, the audio codec underlying decoder 10 of
Complex stereo prediction—another inter-channel redundancy exploitation tool—enables, in the spectral domain, predicting one channel's frequency-domain coefficients or spectral lines using spectrally co-located lines of another channel. More details concerning this are described below.
In order to facilitate the understanding of the subsequent description of
The transform windows for the transforms into which the frames are subdivided are illustrated in
Although the embodiments described further below could also be implemented in another fashion,
Using arrows in
In order to efficiently code the spectral line coefficients representing the spectrograms 40 and 42 via data stream 30 passed to decoder 10, same are quantized. In order to control the quantization noise spectrotemporally, the quantization step size is controlled via scale factors which are set in a certain spectrotemporal grid. In particular, within each of the sequence of spectra of each spectrogram, the spectral lines are grouped into spectrally consecutive non-overlapping scale factor groups.
That is, by way of the coding in data stream 30, the spectrograms 40 and 42 are each subdivided into a temporal sequence of spectra and each of these spectra is spectrally subdivided into scale factor bands, and for each scale factor band the data stream 30 codes or conveys information about a scale factor corresponding to the respective scale factor band. The spectral line coefficients falling into a respective scale factor band 50 are quantized using the respective scale factor or, as far as decoder 10 is concerned, may be dequantized using the scale factor of the corresponding scale factor band.
Before changing back again to
While the spectral line extractor 20 is configured to extract the spectral line data, i.e. the spectral line coefficients for frames 44 from data stream 30, the scale factor extractor 22 is configured to extract for each frame 44 the corresponding scale factors. To this end, extractors 20 and 22 may use entropy decoding. In accordance with an embodiment, the scale factor extractor 22 is configured to sequentially extract the scale factors of, for example, spectrum 46 in
For the sake of completeness only, it is noted that the spectral line extractor 20 extracts the spectral line coefficients with which the scale factor bands 50 are populated likewise using, for example, entropy coding and/or predictive coding. The entropy coding may use context-adaptivity based on spectral line coefficients in a spectrotemporal neighborhood of a currently decoded spectral line coefficient, and likewise, the prediction may be a spectral prediction, a temporal prediction or a spectrotemporal prediction predicting a currently decoded spectral line coefficient based on previously decoded spectral line coefficients in a spectrotemporal neighborhood thereof. For the sake of an increased coding efficiency, spectral line extractor 20 may be configured to perform the decoding of the spectral lines or line coefficients in tuples, which collect or group spectral lines along the frequency axis.
Thus, at the output of spectral line extractor 20 the spectral line coefficients are provided such as, for example, in units of spectra such as spectrum 46 collecting, for example, all of the spectral line coefficients of a corresponding frame, or alternatively collecting all of the spectral line coefficients of certain short transforms of a corresponding frame. At the output of scale factor extractor 22, in turn, corresponding scale factors of the respective spectra are output.
Scale factor band identifier 12 as well as dequantizer 14 have spectral line inputs coupled to the output of spectral line extractor 20, and dequantizer 14 and noise filler 16 have scale factor inputs coupled to the output of scale factor extractor 22. The scale factor band identifier 12 is configured to identify so-called zero-quantized scale factor bands within a current spectrum 46, i.e. scale factor bands within which all spectral lines are quantized to zero, such as scale factor band 50c in
The scale factor band identifier 12 informs the noise filler 16 on those scale factor bands which are zero-quantized scale factor bands. The dequantizer 14 uses the scale factors associated with an inbound spectrum 46 so as to dequantize, or scale, the spectral line coefficients of the spectral lines of spectrum 46 according to the associated scale factors, i.e. the scale factors associated with the scale factor bands 50. In particular, dequantizer 14 dequantizes and scales spectral line coefficients falling into a respective scale factor band with the scale factor associated with the respective scale factor band.
The noise filler 16 obtains the information on the zero-quantized scale factor bands which form the subject of the following noise filling, the dequantized spectrum as well as the scale factors of at least those scale factor bands identified as zero-quantized scale factor bands and a signalization obtained from data stream 30 for the current frame revealing whether inter-channel noise filling is to be performed for the current frame.
The inter-channel noise filling process described in the following example actually involves two types of noise filling, namely the insertion of a noise floor 54 pertaining to all spectral lines having been quantized to zero irrespective of their potential membership to any zero-quantized scale factor band, and the actual inter-channel noise filling procedure. Although this combination is described hereinafter, it is to be emphasized that the noise floor insertion may be omitted in accordance with an alternative embodiment. Moreover, the signalization concerning the noise filling switch-on and switch-off relating to the current frame and obtained from data stream 30 could relate to the inter-channel noise filling only, or could control the combination of both noise filling sorts together.
As far as the noise floor insertion is concerned, noise filler 16 could operate as follows. In particular, noise filler 16 could employ artificial noise generation such as a pseudorandom number generator or some other source of randomness in order to fill spectral lines, the spectral line coefficients of which were zero. The level of the noise floor 54 thus inserted at the zero-quantized spectral lines could be set according to an explicit signaling within data stream 30 for the current frame or the current spectrum 46. The “level” of noise floor 54 could be determined using a root-mean-square (RMS) or energy measure for example.
The noise floor insertion thus represents a kind of pre-filling for those scale factor bands having been identified as zero-quantized ones such as scale factor band 50d in
In particular, in order to perform the inter-channel noise filling 56, noise filler 16 obtains a spectrally co-located portion of the other channel's spectrum 48, in a state already largely or fully decoded, and copies the obtained portion of spectrum 48 into the zero-quantized scale factor band to which this portion was spectrally co-located, scaled in such a manner that the resulting overall noise level within that zero-quantized scale factor band—derived by an integration over the spectral lines of the respective scale factor band—equals the aforementioned fill-up target level obtained from the zero-quantized scale factor band's scale factor. By this measure, the tonality of the noise filled into the respective zero-quantized scale factor band is improved in comparison to artificially generated noise such as the one forming the basis of the noise floor 54, and is also better than an uncontrolled spectral copying/replication from very-low-frequency lines within the same spectrum 46.
To be even more precise, the noise filler 16 locates, for a current band such as 50d, a spectrally co-located portion within spectrum 48 of the other channel, scales the spectral lines thereof depending on the scale factor of the zero-quantized scale factor band 50d in a manner just described involving, optionally, some additional offset or noise factor parameter contained in data stream 30 for the current frame or spectrum 46, so that the result thereof fills up the respective zero-quantized scale factor band 50d up to the desired level as defined by the scale factor of the zero-quantized scale factor band 50d. In the present embodiment, this means that the filling-up is done in an additive manner relative to the noise floor 54.
In accordance with a simplified embodiment, the resulting noise-filled spectrum 46 would directly be input into the input of inverse transformer 18 so as to obtain, for each transform window to which the spectral line coefficients of spectrum 46 belong, a time-domain portion of the respective channel audio time-signal, whereupon (not shown in
However, for the sake of completeness it is to be noted that further processing may be performed onto the noise-filled spectrum. As shown in
With or without inverse TNS filtering, complex stereo predictor 24 could then treat the spectrum as a prediction residual of an inter-channel prediction. More specifically, inter-channel predictor 24 could use a spectrally co-located portion of the other channel to predict the spectrum 46 or at least a subset of the scale factor bands 50 thereof. The complex prediction process is illustrated in
The source of inter-channel prediction may, as indicated in
As already noted in the preceding description, the channel to which spectrum 46 belongs may be an MS coded channel, or may be a loudspeaker related channel, such as a left or right channel of a stereo audio signal. Accordingly, optionally an MS decoder 26 subjects the optionally inter-channel predicted spectrum 46 to MS decoding, in that same performs, per spectral line or spectrum 46, an addition or subtraction with spectrally corresponding spectral lines of the other channel corresponding to spectrum 48. For example, although not shown in
It is noted that, optionally, the MS decoding may be performed in a manner globally concerning the whole spectrum 46, or being individually activatable by data stream 30 in units of, for example, scale factor bands 50. In other words, MS decoding may be switched on or off using respective signalization in data stream 30 in units of, for example, frames or some finer spectrotemporal resolution such as, for example, individually for the scale factor bands of the spectra 46 and/or 48 of the spectrograms 40 and/or 42, wherein it is assumed that identical boundaries of both channels' scale factor bands are defined.
As illustrated in
Thus, the spectrum 46 arriving at the input of inverse transformer 18 may have been subject to further processing as just described. Again, the above description is not meant to be understood in such a manner that all of these optional tools are to be present either concurrently or not. These tools may be present in decoder 10 partially or collectively.
In any case, the resulting spectrum at the inverse transformer's input represents the final reconstruction of the channel's output signal and forms the basis of the aforementioned downmix for the current frame which serves, as described with respect to the complex prediction 58, as the basis for the potential imaginary part estimation for the next frame to be decoded. It may further serve as the final reconstruction for inter-channel predicting another channel than the one which the elements except 34 in
The respective downmix is formed by downmix provider 31 by combining this final spectrum 46 with the respective final version of spectrum 48. The latter entity, i.e. the respective final version of spectrum 48, formed the basis for the complex inter-channel prediction in predictor 24.
The downmix provider 31 is co-used by both portions 70 and 34 and receives temporally co-located spectra 48 and 46 of spectrograms 40 and 42 so as to form a downmix based thereon by summing up these spectra on a spectral line by spectral line basis, potentially with forming the average therefrom by dividing the sum at each spectral line by the number of channels downmixed, i.e. two in the case of
That is, while in
Concluding the above discussion of embodiments describing inter-channel noise filling in an audio decoder, it should be evident to readers skilled in the art that, before adding the grabbed-out spectrally or temporally co-located portion of the “source” spectrum to the spectral lines of the “target” scale factor band, a certain pre-processing may be applied to the “source” spectral lines without digressing from the general concept of the inter-channel filling. In particular, it may be beneficial to apply a filtering operation such as, for example, a spectral flattening, or tilt removal, to the spectral lines of the “source” region to be added to the “target” scale factor band, like 50d in
Thus, the above embodiments concerned a concept of an inter-channel noise filling. In the following, a possibility is described how the above concept of inter-channel noise filling may be built into an existing codec, namely xHE-AAC, in a semi-backward compatible manner. In particular, hereinafter an advantageous implementation of the above embodiments is described, according to which a stereo filling tool is built into an xHE-AAC based audio codec in a semi-backward compatible signaling manner. By use of the implementation described further below, for certain stereo signals, stereo filling of transform coefficients in either one of the two channels in an audio codec based on an MPEG-D xHE-AAC (USAC) is feasible, thereby improving the coding quality of certain audio signals especially at low bitrates. The stereo filling tool is signaled semi-backward-compatibly such that legacy xHE-AAC decoders can parse and decode the bitstreams without obvious audio errors or drop-outs. As was already described above, a better overall quality can be attained if an audio coder can use a combination of previously decoded/quantized coefficients of two stereo channels to reconstruct zero-quantized (non-transmitted) coefficients of either one of the currently decoded channels. It is therefore desirable to allow such stereo filling (from previous to present channel coefficients) in addition to spectral band replication (from low- to high-frequency channel coefficients) and noise filling (from an uncorrelated pseudorandom source) in audio coders, especially xHE-AAC or coders based on it.
To allow coded bitstreams with stereo filling to be read and parsed by legacy xHE-AAC decoders, the desired stereo filling tool shall be used in a semi-backward compatible way: its presence should not cause legacy decoders to stop—or not even start—decoding. Readability of the bitstream by xHE-AAC infrastructure can also facilitate market adoption.
To achieve the aforementioned wish for semi-backward compatibility for a stereo filling tool in the context of xHE-AAC or its potential derivatives, the following implementation involves the functionality of stereo filling as well as the ability to signal the same via syntax in the data stream actually concerned with noise filling. The stereo filling tool would work in line with the above description. In a channel pair with common window configuration, a coefficient of a zero-quantized scale factor band is, when the stereo filling tool is activated, as an alternative (or, as described, in addition) to noise filling, reconstructed by a sum or difference of the previous frame's coefficients in either one of the two channels, advantageously the right channel. Stereo filling is performed similar to noise filling. The signaling would be done via the noise filling signaling of xHE-AAC. Stereo filling is conveyed by means of the 8-bit noise filling side information. This is feasible because the MPEG-D USAC standard [4] states that all 8 bits are transmitted even if the noise level to be applied is zero. In that situation, some of the noise-fill bits can be reused for the stereo filling tool.
Semi-backward-compatibility regarding bitstream parsing and playback by legacy xHE-AAC decoders is ensured as follows. Stereo filling is signaled via a noise level of zero (i.e. the first three noise-fill bits all having a value of zero) followed by five non-zero bits (which traditionally represent a noise offset) containing side information for the stereo filling tool as well as the missing noise level. Since a legacy xHE-AAC decoder disregards the value of the 5-bit noise offset if the 3-bit noise level is zero, the presence of the stereo filling tool signaling only has an effect on the noise filling in the legacy decoder: noise filling is turned off since the first three bits are zero, and the remainder of the decoding operation runs as intended. In particular, stereo filling is not performed due to the fact that it is operated like the noise-fill process, which is deactivated. Hence, a legacy decoder still offers “graceful” decoding of the enhanced bitstream 30 because it does not need to mute the output signal or even abort the decoding upon reaching a frame with stereo filling switched on. Naturally, it is however unable to provide a correct, intended reconstruction of stereo-filled line coefficients, leading to a deteriorated quality in affected frames in comparison with decoding by an appropriate decoder capable of appropriately dealing with the new stereo filling tool. Nonetheless, assuming the stereo filling tool is used as intended, i.e. only on stereo input at low bitrates, the quality through xHE-AAC decoders should be better than if the affected frames would drop out due to muting or lead to other obvious playback errors.
In the following, a detailed description is presented how a stereo filling tool may be built into, as an extension, the xHE-AAC codec.
When built into the standard, the stereo filling tool could be described as follows. In particular, such a stereo filling (SF) tool would represent a new tool in the frequency-domain (FD) part of MPEG-H 3D-audio. In line with the above discussion, the aim of such a stereo filling tool would be the parametric reconstruction of MDCT spectral coefficients at low bitrates, similar to what already can be achieved with noise filling according to section 7.2 of the standard described in [4]. However, unlike noise filling, which employs a pseudorandom noise source for generating MDCT spectral values of any FD channel, SF would be available also to reconstruct the MDCT values of the right channel of a jointly coded stereo pair of channels using a downmix of the left and right MDCT spectra of the previous frame. SF, in accordance with the implementation set forth below, is signaled semi-backward-compatibly by means of the noise filling side information which can be parsed correctly by a legacy MPEG-D USAC decoder.
The tool description could be as follows. When SF is active in a joint-stereo FD frame, the MDCT coefficients of empty (i.e. fully zero-quantized) scale factor bands of the right (second) channel, such as 50d, are replaced by a sum or difference of the corresponding decoded left and right channels' MDCT coefficients of the previous frame (if FD). If legacy noise filling is active for the second channel, pseudorandom values are also added to each coefficient. The resulting coefficients of each scale factor band are then scaled such that the RMS (root of the mean coefficient square) of each band matches the value transmitted by way of that band's scale factor. See section 7.3 of the standard in [4].
Some operational constraints could be provided for the use of the new SF tool in the MPEG-D USAC standard. For example, the SF tool may be available for use only in the right FD channel of a common FD channel pair, i.e. a channel pair element transmitting a StereoCoreToolInfo() with common_window==1. Besides, due to the semi-backward-compatible signaling, the SF tool may be available for use only when noiseFilling==1 in the syntax container UsacCoreConfig(). If either of the channels in the pair is in LPD core_mode, the SF tool may not be used, even if the right channel is in the FD mode.
The following terms and definitions are used hereafter in order to more clearly describe the extension of the standard as described in [4].
In particular, as far as the data elements are concerned, the following data element is newly introduced:
Further, new help elements are introduced:
The decoding process of the standard would be extended in the following manner. In particular, the decoding of a joint-stereo coded FD channel with the SF tool being activated is executed in three sequential steps as follows:
First of all, the decoding of the stereo_filling flag would take place. stereo_filling does not represent an independent bit-stream element but is derived from the noise-fill elements, noise_offset and noise_level, in a UsacChannelPairElement() and the common_window flag in StereoCoreToolInfo(). If noiseFilling==0 or common_window==0 or the current channel is the left (first) channel in the element, stereo_filling is 0, and the stereo filling process ends. Otherwise,
In other words, if noise_level==0, noise_offset contains the stereo_filling flag followed by 4 bits of noise filling data, which are then rearranged. Since this operation alters the values of noise_level and noise_offset, it needs to be performed before the noise filling process of section 7.2. Moreover, the above pseudo-code is not executed in the left (first) channel of a UsacChannelPairElement() or any other element.
Then, the calculation of downmix_prev would take place.
downmix_prev[ ], the spectral downmix which is to be used for stereo filling, is identical to the dmx_re_prev[ ] used for the MDST spectrum estimation in complex stereo prediction (section 7.7.2.3). This means that
Consequently, the previous downmix only has to be computed once for both tools, saving complexity. The only difference between downmix_prev[ ] and dmx_re_prev[ ] in section 7.7.2 is the behavior when complex stereo prediction is not currently used, or when it is active but use_prev_frame==0. In that case, downmix_prev[ ] is computed for stereo filling decoding according to section 7.7.2.3 even though dmx_re_prev[ ] is not needed for complex stereo prediction decoding and is, therefore, undefined/zero.
Thereinafter, the stereo filling of empty scale factor bands would be performed.
If stereo_filling==1, the following procedure is carried out after the noise filling process in all initially empty scale factor bands sfb[ ] below max_sfb_ste, i.e. all bands in which all MDCT lines were quantized to zero. First, the energies of the given sfb[ ] and the corresponding lines in downmix_prev[ ] are computed via sums of the line squares. Then, given sfbWidth containing the number of lines per sfb[ ],
for the spectrum of each group window. Then the scale factors are applied onto the resulting spectrum as in section 7.3, with the scale factors of the empty bands being processed like regular scale factors.
An alternative to the above extension of the xHE-AAC standard would use an implicit semi-backward compatible signaling method.
The above implementation in the xHE-AAC code framework describes an approach which employs one bit in a bitstream to signal usage of the new stereo filling tool, contained in stereo_filling, to a decoder in accordance with
In cases where strict independence between the legacy and the inventive bitstream data is not required and the inventive signal is a binary decision, the explicit transmission of a signaling bit can be avoided, and said binary decision can be signaled by the presence or absence of what may be called implicit semi-backward-compatible signaling. Taking again the above embodiment as an example, the usage of stereo filling could be transmitted by simply employing the new signaling: If noise_level is zero and, at the same time, noise_offset is not zero, the stereo_filling flag is set equal to 1. If both noise_level and noise_offset are not zero, stereo_filling is equal to 0. A dependent of this implicit signal on the legacy noise-fill signal occurs when both noise_level and noise_offset are zero. In this case, it is unclear whether legacy or new SF implicit signaling is being used. To avoid such ambiguity, the value of stereo_filling is defined in advance. In the present example, it is appropriate to define stereo_filling=0 if the noise filling data consists of all-zeros, since this is what legacy encoders without stereo filling capability signal when noise filling is not to be applied in a frame.
The issue which remains to be solved in the case of implicit semi-backward-compatible signaling is how to signal stereo_filling==1 and no noise filling at the same time. As explained, the noise filling data must not be all-zero, and if a noise magnitude of zero is requested, noise_level ((noise_offset & 14)/2 as mentioned above) equals 0. This leaves only a noise_offset ((noise_offset & 1)*16 as mentioned above) greater than 0 as a solution. The noise_offset, however, is considered in case of stereo filling when applying the scale factors, even if noise_level is zero. Fortunately, an encoder can compensate for the fact that a noise_offset of zero might not be transmittable by altering the affected scale factors such that upon bitstream writing, they contain an offset which is undone in the decoder via noise_offset. This allows said implicit signaling in the above embodiment at the cost of a potential increase in scale factor data rate. Hence, the signaling of stereo filling in the pseudo-code of the above description could be changed as follows, using the saved SF signaling bit to transmit noise_offset with 2 bits (4 values) instead of 1 bit:
For the sake of completeness,
At the output of transformer 102 the spectral lines and scale factors are unquantized and substantially no coding loss has occurred yet. The spectrogram output by transformer 102 enters a quantizer 108, which is configured to quantize the spectral lines of the spectro-gram output by transformer 102, spectrum by spectrum, setting and using preliminary scale factors of the scale factor bands. That is, at the output of quantizer 108, preliminary scale factors and corresponding spectral line coefficients result, and a sequence of a noise filler 16′, an optional inverse TNS filter 28a′, inter-channel predictor 24′, MS decoder 26′ and inverse TNS filter 28b′ are sequentially connected so as to provide the encoder 100 of
The encoder 100 may use the information on the available reconstructed, final version of the spectra in order to perform inter-frame spectral prediction such as the aforementioned possible version of performing inter-channel prediction using an imaginary part estimation, and/or in order to perform rate control, i.e. in order to determine, within a rate control loop, that the possible parameters finally coded into data stream 30 by encoder 100 are set in a rate/distortion optimal sense.
For example, one such parameter set in such a prediction loop and/or rate control loop of encoder 100 is, for each zero-quantized scale factor band identified by identifier 12′, the scale factor of the respective scale factor band which has merely been preliminarily set by quantizer 108. In a prediction and/or rate control loop of encoder 100, the scale factor of the zero-quantized scale factor bands is set in some psychoacoustically or rate/distortion optimal sense so as to determine the aforementioned target noise level along with, as described above, an optional modification parameter also conveyed by the data stream for the corresponding frame to the decoder side. It should be noted that this scale factor may be computed using only the spectral lines of the spectrum and channel to which it belongs (i.e. the “target” spectrum, as described earlier) or, alternatively, may be determined using both the spectral lines of the “target” channel spectrum and, in addition, the spectral lines of the other channel spectrum or the downmix spectrum from the previous frame (i.e. the “source” spectrum, as introduced earlier) obtained from downmix provider 31′. In particular to stabilize the target noise level and to reduce temporal level fluctuations in the decoded audio channels onto which the inter-channel noise filling is applied, the target scale factor may be computed using a relation between an energy measure of the spectral lines in the “target” scale factor band, and an energy measure of the co-located spectral lines in the corresponding “source” region. Finally, as noted above, this “source” region may originate from a reconstructed, final version of another channel or the previous frame's downmix, or if the encoder complexity is to be reduced, the original, unquantized version of same other channel or the downmix of original, unquantized versions of the previous frame's spectra.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods may be performed by any hardware apparatus.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which will be apparent to others skilled in the art and which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations, and equivalents as fall within the true spirit and scope of the present invention.
References
[1] Internet Engineering Task Force (IETF), RFC 6716, “Definition of the Opus Audio Codec,” Int. Standard, September 2012. Available online at http://tools.ietf.org/html/rfc6716.
[2] International Organization for Standardization, ISO/IEC 14496-3:2009, “Information Technology—Coding of audio-visual objects—Part 3: Audio,” Geneva, Switzerland, August 2009.
[3] M. Neuendorf et al., “MPEG Unified Speech and Audio Coding—The ISO/MPEG Standard for High-Efficiency Audio Coding of All Content Types,” in Proc. 132nd AES Convention, Budapest, Hungary, April 2012. Also to appear in the Journal of the AES, 2013.
[4] International Organization for Standardization, ISO/IEC 23003-3:2012, “Information Technology—MPEG audio—Part 3: Unified speech and audio coding,” Geneva, January 2012.
Number | Date | Country | Kind |
---|---|---|---|
13177356 | Jul 2013 | EP | regional |
13189450 | Oct 2013 | EP | regional |
This application is a continuation of copending U.S. patent application Ser. No. 16/594,867, filed Oct. 7, 2019, which is a continuation of copending U.S. patent application Ser. No. 16/277,941, filed Feb. 15, 2019, which in turn is a continuation of copending U.S. patent application Ser. No. 15/002,375, filed Jan. 20, 2016, which in turn is a continuation of copending International Application No. PCT/EP2014/065550, filed Jul. 18, 2014, which are both incorporated herein by reference in their entirety, and additionally claims priority from European Application No. 13177356.6, filed Jul. 22, 2013, and from European Application No. 13189450.3, filed Oct. 18, 2013, which are also incorporated herein by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
5692102 | Pan | Nov 1997 | A |
7539612 | Thumpudi et al. | May 2009 | B2 |
20040028125 | Sato | Feb 2004 | A1 |
20090006103 | Koishida et al. | Jan 2009 | A1 |
20100228556 | Bahn | Sep 2010 | A1 |
20110015768 | Lim et al. | Jan 2011 | A1 |
20110170711 | Rettelbach et al. | Jul 2011 | A1 |
20120226505 | Lin et al. | Sep 2012 | A1 |
20130013321 | Oh et al. | Jan 2013 | A1 |
Number | Date | Country |
---|---|---|
101310328 | Nov 2008 | CN |
101933086 | Dec 2010 | CN |
102341846 | Feb 2012 | CN |
2002156998 | May 2002 | JP |
20080092823 | Oct 2008 | KR |
20120098755 | Sep 2012 | KR |
2011104006 | Aug 2012 | RU |
2005096508 | Oct 2005 | WO |
2011042464 | Apr 2011 | WO |
2011114933 | Sep 2011 | WO |
2012037515 | Mar 2012 | WO |
Entry |
---|
Helmrich, Christian R, et al., “Efficient transform coding of two-channel audio signals by means of complex-valued stereo prediction”, Acoustics, Speech and Signal Processing (ICASSP), 2011, IEEE International Conference ON, IEEE, XP032000783, DOI: 10.1109/ICASSP.2011.5946449, ISBN: 978-1-4577-0538-0, May 22, 2011, pp. 497-500. |
ISO/IEC, FDIS 23003-3:2011 (E), “Information technology—MPEG audio technologies—Part 3: Unified speech and audio coding”, ISO/IEC JTC 1/SC 29/WG 11, Sep. 20, 2011—Part 1 of 3, Part 1 of 3. |
ISO/IEC, FDIS 23003-3:2011(E), “Information technology—MPEG audio technologies—Part 3: Unified speech and audio coding”, ISO/IEC JTC 1/SC 29/WG 11, Sep. 20, 2011—Part 2 of 3, Part 2 of 3. |
ISO/IEC, FDIS 23003-3:2011 (E), “Information technology—MPEG audio technologies—Part 3 Unified speech and audio coding”, ISO/IEC JTC 1/SC 29/WG 11, Sep. 20, 2011. Part 3 of 3, Part 3 of 3. |
ISO/IEC 14496-3, “Information technology—Coding of audio-visual objects/ Part 3: Audio”, ISO/IEC 2009, 2009, 1416 pp. |
ISO/IEC 23003-3, “Information Technology—MPEG audio technologies—Part 3: Unified Speech and Audio Coding”, International Standard, ISO/IEC FDIS 23003-3, Nov. 23, 2011, 286 pp. |
Neuendorf, Max, et al., “MPEG Unified Speech and Audio Coding—The ISO/MPEG Standard for High-Efficiency Audio Coding of all Content Types”, Audio Engineering Society Convention Paper 8654, Presented at the 132nd Convention, pp. 1-22. |
Pan, Davis, “A Tutorial on MPEG/Audio Compression”, IEEE Multimedia Journal, 12 pp. |
Valin, JM, et al., “Defintion of the Opus Audio Codec”, IETF, pp. 1-326. |
Yang, Dai Tracy, et al., “High-Fidelity Multichannel Audio Coding”, EURASIP Book Series on Signal Processing and Communications. Hindawi Publishing Corporation., 2006, pp. 42-50. |
Number | Date | Country | |
---|---|---|---|
20210358508 A1 | Nov 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16594867 | Oct 2019 | US |
Child | 17217121 | US | |
Parent | 16277941 | Feb 2019 | US |
Child | 16594867 | US | |
Parent | 15002375 | Jan 2016 | US |
Child | 16277941 | US | |
Parent | PCT/EP2014/065550 | Jul 2014 | US |
Child | 15002375 | US |