The invention relates to an audio decoder in particular, but not exclusively, to an MPEG Surround decoder or object-oriented decoder.
In (parametric) spatial audio (en)coders, parameters are extracted from the original audio signals so as to produce a reduced number of down-mix audio signals (for example only a single down-mix signal corresponding to a mono, or two down-mix signals for a stereo down mix), and a corresponding set of parameters describing the spatial properties of the original audio signal. In (parametric) spatial audio decoders, the spatial properties described by the transmitted spatial parameters are used to recreate a spatial multi-channel signal, which closely resembles the original multi-channel audio signal.
Recently, techniques for processing and manipulating of individual audio objects at the decoding side have attracted significant interest. For example, within the MPEG framework, a workgroup has been started on object-based spatial audio coding. The aim of this workgroup is to “explore new technology and reuse of current MPEG Surround components and technologies for the bit rate efficient coding of multiple sound sources or objects into a number of down-mix channels and corresponding spatial parameters”. In other words, the aim is to encode multiple audio objects in a limited set of down-mix channels with corresponding parameters. At the decoder side, users interact with the content for example by repositioning the individual objects.
Such interaction with the content is easily realized in object-oriented decoders. It is then realized by including a rendering that follows the decoding. Said rendering is combined with the decoding to prevent the need of determining individual objects. The currently available dedicated rendering comprises positioning of objects, volume adjusting, or equalization of the rendered audio signals.
One disadvantage of the known object-oriented decoders with the incorporated rendering is that they permit a limited set of manipulations of objects, because they do not produce or operate on the individual objects. On the other hand explicit decoding of the individual audio objects is very costly and inefficient.
It is an object of the invention to provide an enhanced decoder for decoding audio objects that allows a wider range of manipulations of objects without a need for decoding the individual audio objects for this purpose.
This object is achieved by an audio decoder according to the invention. It is assumed that a set of objects, each with its corresponding waveform, has previously been encoded in an object-oriented encoder, which generates a down-mix audio signal (a single signal in case of a single channel), said down-mix audio signal being a down-mix of a plurality of audio objects and corresponding parametric data. The parametric data comprises a set of object parameters for each of the different audio objects. The receiver receives said down-mix audio signal and said parametric data. This down-mix audio signal is further fed into effect means that generate modified down-mix audio signal by applying effects to estimates of audio signals corresponding to selected audio objects comprised in the down-mix audio signal. Said estimates of audio signals are derived based on the parametric data. The modified down-mix audio signal is further fed into decoding means, or rendering means, or combined with the output of rendering means depending on a type of the applied effect, e.g. an insert or send effect. The decoding means decode the audio objects from the down-mix audio signal fed into the decoding means, said down-mix audio signal being the originally received down-mix audio signal or the modified down-mix audio signal. Said decoding is performed based on the parametric data. The rendering means generate a spatial output audio signal from the audio objects obtained from the decoding means and optionally from the effect means, depending on the type of the applied effect.
The advantage of the decoder according to the invention is that in order to apply various types of effects it is not needed that the object, to which the effect is to be applied, is available. Instead, the invention proposes to apply the effect to the estimated audio signals corresponding to the objects before or in parallel to the actual decoding. Therefore, explicit object decoding is not required, and the rendering emerged in the decoder is preserved.
In an embodiment, the decoder further comprises modifying means for modifying the parametric data when a spectral or temporal envelope of an estimated audio signal corresponding to the object or plurality of objects is modified by the insert effect.
An example of such an effect is a non-linear distortion that generates additional high frequency spectral components, or a multi-band compressor. If the spectral characteristic of the modified audio signal has changed, applying the unmodified parameters comprised in the parametric data, as received, might lead to undesired and possibly annoying artifacts. Therefore, adapting the parameters to match the new spectral or temporal characteristics improves the quality of the resulting rendered audio signal.
In an embodiment, the generation of the estimated audio signals corresponding to an audio object or plurality of objects comprises time/frequency dependent scaling of the down-mix audio signals based on the power parameters corresponding to audio objects, said power parameters being comprised in the received parametric data.
The advantage of this estimation is that it comprises a multiplication of the down-mix audio signal. This makes the estimation process simple and efficient.
In an embodiment, the decoding means comprise a decoder in accordance with the MPEG Surround standard and conversion means for converting the parametric data into parametric data in accordance with the MPEG Surround standard.
The advantage of using the MPEG Surround decoder is that this type of decoder is used as a rendering engine for an object-oriented decoder. In this case, the object-oriented parameters are combined with user-control data and converted to MPEG Surround parameters, such as level differences and correlation parameters between channels (pairs). Hence the MPEG Surround parameters result from the combined effect of object-oriented parameters, i.e. transmitted information, and the desired rendering properties, i.e. user-controllable information set at the decoder side. In such a case no intermediate object signals are required.
The invention further provides a receiver and a communication system, as well as corresponding methods.
In an embodiment, the insert and send effects are applied simultaneously. Using of, for example, insert effects does not exclude use of send effects, and vice versa.
The invention further provides a computer program product enabling a programmable device to perform the method according to the invention.
These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments shown in the drawings, in which:
Throughout the figures, same reference numerals indicate similar or corresponding features. Some of the features indicated in the drawings are typically implemented in software, and as such represent software entities, such as software modules or objects.
The signal fed into the receiver 200 is a single signal that corresponds to the stream of multiplexed down-mix audio data that corresponds to the down-mix audio signal and the parametric data. The function of the receiver is then demultiplexing of the two data streams. If the down-mix audio signal is provided in compressed form (such as MPEG-1 layer 3), receiver 200 also performs decompression or decoding of the compressed audio signal into a time-domain audio down-mix signal.
Although, the input of the receiver 200 is depicted a single signal/data path it could also comprise multiple data paths for separate down-mix signals and/or parametric data. Consequently the down-mix signals and the parametric data are fed into decoding means 300 that decode the audio objects from the down-mix audio signals based on the parametric data. The decoded audio objects are further fed into rendering means 400 for generating at least one output audio signal from the decoded audio objects. Although, the decoding means and rendering means are drawn as separate units, they very often are merged together. As a result of such merger of the decoding and rendering processing means there is no need for explicit decoding of individual audio objects. Instead rendered audio signals are provided at the much lower computational cost, and with no loss of audio quality.
The examples of insert effects are among others: dynamic range compression, generation of distortion (e.g. to simulate guitar amplifiers), or vocoder. This type of effects is applied preferably on a limited (preferably single) set of audio objects.
The insert effects used in the units 531 and 532 are either of the same type or they differ. The insert effect used by the unit 532 is for example a non-linear distortion that generates additional high frequency spectral components, or a multi-band compressor. If the spectral characteristic of the modified audio signal has changed, applying the unmodified parameters comprised in the parametric data as received in the decoding means 300, might lead to undesired and possibly annoying artifacts. Therefore, adapting the parametric data to match the new spectral characteristics improves the quality of the resulting audio signal. This adaptation of the parametric data is performed in the unit 600. The adapted parametric data 504 is fed into the decoding means 300 and is used for decoding of the modified down-mix signal(s) 503.
It should be noted that the two units 531 and 532 comprised in the insert means 530 are just an example. The number of the units can vary depending on the number of insert effects to be applied. Further, the units 531 and 532 can be implemented in hardware or software.
An adder 540 adds up the audio signals provided from the gain means 560, and a unit 570 applies the send effect. The resulting signal 505, also called the “wet” output, is fed into the rendering means, or alternatively, is mixed with (or added to) the output of the rendering means.
The examples of the send effects are among others reverberation, modulation effects such e.g. chorus, flanger, or phaser.
It should be noted that the two units 561 and 562 comprised in the gain means 560 are just an example. The number of the units can very depending on the number of signals corresponding to audio objects or plurality of audio objects for which the level of the send effect is to be set.
The estimation means 510 and the gain means 560 can be combined in a single processing step that estimates a weighted combination of multiple object signals. The gains 561 and 562 can be incorporated in the estimation means 511 and 512, respectively. This is also described in the equations below, where Q is a (estimation of a) weighted combination of object signals and is obtained by one single scaling operation per time/frequency tile.
The gains per object or combination of objects can be interpreted as ‘effect send levels’. In several applications, the amount of effect is preferably user-controllable per object. For example, the user might desire one of the objects without reverberation, another object with a small amount of reverberation, and yet another object with full reverberation. In such an example, the gains per object could be equal to 0, 0.5 and 1.0, for each of the respective objects.
In an embodiment, the generation of the estimated audio signals corresponding to an audio object or plurality of objects comprises time/frequency dependent scaling of the down-mix audio signals based on the power parameters corresponding to audio objects, said power parameters being comprised in the parametric data.
This embodiment is explained for the following example. At the encoder I object signals si[n], i=0, . . . , I−1, with n the sample index are down-mixed to create a down-mix signal x[n], by summation of the down-mix signals:
The down-mix signal is accompanied by object-oriented parameters that describe the (relative) signal power of each object within individual time/frequency tiles of the down-mix signal x[n]. The object signals si[n] are e.g. first windowed using overlapping analysis windows w[n]:
s
i
[n,m]=s
i
[n+mL/2]w[n],
With L the length of the window and e.g. L/2 the corresponding hop size (assuming 50% overlap), and m the window index. A typical form of the analysis window is a Hanning window:
The resulting segmented signals si[n,m] are subsequently transformed to the frequency domain using an FFT:
With k the FFT bin index. The FFT bin indices k are subsequently grouped into parameter bands b. In other words, each parameter band b corresponds to a set of adjacent frequency bin indices k. For each parameter band b, and each segment m of each object signal Si[k,m], a power value σi2[b,m] is computed:
with (*) being the complex conjugation operator. These parameters σi2[b,m] are comprised in the parametric data (preferably quantized in the logarithmic domain).
The estimation process of an object or plurality of objects at the object-oriented audio decoder comprises time/frequency dependent scaling of the down mix audio signal. A discrete-time down-mix signal x[n] with n the same index is split into time/frequency tiles X[k,m] with k a frequency index and m a frame (temporal segment) index. This is achieved by e.g. windowing the signal x[n] with an analysis window x[n]:
x[n,m]=x[n+mL/2]w[n],
With L the window length and L/2 the corresponding hop size. In this case, a preferred analysis window is given by the square root of the Hanning window:
Subsequently, the windowed signal w[n,m] is transformed to the frequency domain using an FFT:
The frequency-domain components of X[k,m] are subsequently grouped into so-called parameter bands b (b=0, . . . , B−1). These parameter bands coincide with the parameter bands at the encoder. The decoder-side estimate Śi[k,m] of segment m of object i is given by:
With b(k) the parameter band that was associated with frequency index k.
A weighted combination Q of object signals Si with weights gi is given by:
In the object-oriented decoder, Q can be estimated according to:
In other words, an object signal or any linear combination of plurality of audio object signals can be estimated at the proposed object-oriented audio decoder by a time-frequency dependent scaling of the down-mix signal X[k,m].
In order to result in time-domain output signals, each estimated object signal is transformed to the time domain (using an inverse FFT), multiplied by a synthesis window (identical to the analysis window), and combined with previous frames using overlap-add.
In an embodiment, the generation of the estimated audio signals comprises weighting an object or a combination of a plurality of objects by means of time/frequency dependent scaling of the down-mix audio signals based on the power parameters corresponding to audio objects, said power parameters being comprised in the received parametric data.
It should be noted that a send effect unit might have more output signals than input signals. For example in the case of a stereo or multi-channel reverberation unit has a mono input signal.
In an embodiment, the down-mixed signal and the parametric data are in accordance with an MPEG Surround standard. The existing MPEG Surround decoder next to decoding functionality also functions as a rendering device. In such a case, no intermediate audio signals corresponding to the decode objects are required. The object decoding and rendering are combined into a single device.
According to one of embodiments, the method comprises the steps of receiving at least one down-mix audio signal and parametric data, generating modified down-mix audio signals, decoding the audio objects from the down-mix audio signals, and generating at least one output audio signal from the decoded audio objects. In the method each down-mix audio signal comprises a down-mix of a plurality of audio objects. The parametric data comprises a plurality of object parameters for each of the plurality of audio objects. The modified down-mix audio signals are obtained by applying effects to estimated audio signals corresponding to audio objects comprised in said down-mix audio signals. The estimated audio signals are derived from the down-mix audio signals based on the parametric data. The modified down-mix audio signals based on a type of the applied effect are decoded by decoding means 300 or rendered by rendering means 400. The decoding step is performed by the decoding means 300 for the down-mix audio signals or the modified down-mix audio signals based on the parametric data.
The last step of generating at least one output audio signal from the decoded audio objects, which can be called a rendering step, can be combined with the decoding step into one processing step.
In an embodiment a receiver for receiving audio signals comprises: a receiving element, effect means, decoding means, and rendering means. The receiver element receives from a transmitter at least one down-mix audio signal and parametric data. Each down-mix audio signal comprises a down-mix of a plurality of audio objects. The parametric data comprises a plurality of object parameters for each of the plurality of audio objects.
The effect means generate modified down-mix audio signals. These modified down-mix audio signals are obtained by applying effects to estimated audio signals corresponding to audio objects comprised in said down-mix audio signals. The estimated audio signals are derived from the down-mix audio signals based on the parametric data. The modified down-mix audio signals based on a type of the applied effect are decoded by decoding means or rendered by rendering means.
The decoding means decode the audio objects from the down-mix audio signals or the modified down-mix audio signals based on the parametric data. The rendering means generate at least one output audio signal from the decoded audio objects.
The transmitter 700 is for example a signal recording device and the receiver 900 is for example a signal player device. In the specific example when a signal recording function is supported, the transmitter 700 comprises means 710 for receiving a plurality of audio objects. Consequently, these objects are encoded by encoding means 720 for encoding the plurality of audio objects in at least one down-mix audio signal and parametric data. An embodiment of such encoding means 620 is given in Faller, C., “Parametric joint-coding of audio sources”, Proc. 120th AES Convention, Paris, France, May 2006. Each down-mix audio signal comprises a down-mix of a plurality of audio objects. Said parametric data comprises a plurality of object parameters for each of the plurality of audio objects. The encoded audio objects are transmitted to the receiver 900 by means 730 for transmitting down-mix audio signals and the parametric data. Said means 730 have an interface with the network 800, and may transmit the down-mix signals through the network 800.
The receiver 900 comprises a receiver element 910 for receiving from the transmitter 700 at least one down-mix audio signal and parametric data. Each down-mix audio signal comprises a down-mix of a plurality of audio objects. Said parametric data comprises a plurality of object parameters for each of the plurality of audio objects. The effect means 920 generate modified down-mix audio signals. Said modified down-mix audio signals are obtained by applying effects to estimated audio signals corresponding to audio objects comprised in said down-mix audio signals. Said estimated audio signals are derived from the down-mix audio signals based on the parametric data. Said modified down-mix audio signals based on a type of the applied effect are decoded by decoding means, or rendered by rendering means, or combined with the output of rendering means. The decoding means decode the audio objects from the down-mix audio signals or the modified down-mix audio signals based on the parametric data. The rendering means generate at least one output audio signal from the decoded audio objects.
In an embodiment, the insert and send effects are applied simultaneously.
In an embodiment, the effects are applied in response to user input. The user can by means of e.g. button, slider, knob, or graphical user interface, set the effects according to own preferences.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims.
In the accompanying claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim. The word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer.
Number | Date | Country | Kind |
---|---|---|---|
07100339.6 | Jan 2007 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2008/050029 | 1/7/2008 | WO | 00 | 7/1/2009 |