The present invention relates to audio signal processing and, in particular, to a decoder, an encoder, a system, methods and a computer program for audio object coding employing audio object adaptive individual time-frequency resolution.
Embodiments according to the invention are related to an audio decoder for decoding a multi-object audio signal consisting of a downmix signal and an object-related parametric side information (PSI). Further embodiments according to the invention are related to an audio decoder for providing an upmix signal representation in dependence on a downmix signal representation and an object-related PSI. Further embodiments of the invention are related to a method for decoding a multi-object audio signal consisting of a downmix signal and a related PSI. Further embodiments according to the invention are related to a method for providing an upmix signal representation in dependence on a downmix signal representation and an object-related PSI.
Further embodiments of the invention are related to an audio encoder for encoding a plurality of audio object signals into a downmix signal and a PSI. Further embodiments of the invention are related to a method for encoding a plurality of audio object signals into a downmix signal and a PSI.
Further embodiments according to the invention are related to a computer program corresponding to the method(s) for decoding, encoding, and/or providing an upmix signal.
Further embodiments of the invention are related to audio object adaptive individual time-frequency resolution switching for signal mixture manipulation.
In modern digital audio systems, it is a major trend to allow for audio-object related modifications of the transmitted content on the receiver side. These modifications include gain modifications of selected parts of the audio signal and/or spatial re-positioning of dedicated audio objects in case of multi-channel playback via spatially distributed speakers. This may be achieved by individually delivering different parts of the audio content to the different speakers.
In other words, in the art of audio processing, audio transmission, and audio storage, there is an increasing desire to allow for user interaction on object-oriented audio content playback and also a demand to utilize the extended possibilities of multi-channel playback to individually render audio contents or parts thereof in order to improve the hearing impression. By this, the usage of multi-channel audio content brings along significant improvements for the user. For example, a three-dimensional hearing impression can be obtained, which brings along an improved user satisfaction in entertainment applications. However, multi-channel audio content is also useful in professional environments, for example in telephone conferencing applications, because the talker intelligibility can be improved by using a multi-channel audio playback. Another possible application is to offer to a listener of a musical piece to individually adjust playback level and/or spatial position of different parts (also termed as “audio objects”) or tracks, such as a vocal part or different instruments. The user may perform such an adjustment for reasons of personal taste, for easier transcribing one or more part(s) from the musical piece, educational purposes, karaoke, rehearsal, etc.
The straightforward discrete transmission of all digital multi-channel or multi-object audio content, e.g., in the form of pulse code modulation (PCM) data or even compressed audio formats, demands very high bitrates. However, it is also desirable to transmit and store audio data in a bitrate efficient way. Therefore, one is willing to accept a reasonable tradeoff between audio quality and bitrate requirements in order to avoid an excessive resource load caused by multi-channel/multi-object applications.
Recently, in the field of audio coding, parametric techniques for the bitrate-efficient transmission/storage of multi-channel/multi-object audio signals have been introduced by, e.g., the Moving Picture Experts Group (MPEG) and others. One example is MPEG Surround [ISO/IEC 23003-1:2007, MPEG-D (MPEG audio technologies), Part 1: MPEG Surround, 2007] as a channel oriented approach [ISO/IEC 23003-1:2007, MPEG-D (MPEG audio technologies), Part 1: MPEG Surround, 2007, and C. Faller and F. Baumgarte, “Binaural Cue Coding—Part II: Schemes and applications,” IEEE Trans. on Speech and Audio Proc., vol. 11, no. 6, November 2003], or MPEG Spatial Audio Object Coding (SAOC) as an object oriented approach [C. Faller, “Parametric Joint-Coding of Audio Sources,” 120th AES Convention, Paris, 2006; ISO/IEC, “MPEG audio technologies—Part 2: Spatial Audio Object Coding (SAOC)”, ISO/IEC JTC1/SC29/WG11 (MPEG) International Standard 23003-2; J. Herre, S. Disch, J. Hilpert, O. Hellmuth: “From SAC To SAOC—Recent Developments in Parametric Coding of Spatial Audio,” 22nd Regional UK AES Conference, Cambridge, UK, April 2007; J. Engdegård, B. Resch, C. Falch, O. Hellmuth, J. Hilpert, A. Holzer, L. Terentiev, J. Breebaart, J. Koppens, E. Schuijers and W. Oomen: “Spatial Audio Object Coding (SAOC)—The Upcoming MPEG Standard on Parametric Object Based Audio Coding,” 124th AES Convention, Amsterdam 2008]. Another object—oriented approach is termed as “informed source separation” [M. Parvaix and L. Girin: “Informed Source Separation of underdetermined instantaneous Stereo Mixtures using Source Index Embedding,” IEEE ICASSP, 2010; M. Parvaix, L. Girin, J.- M. Brassier: “A watermarking-based method for informed source separation of audio signals with a single sensor,” IEEE Transactions on Audio, Speech and Language Processing, 2010; A. Liutkus and J. Pinel and R. Badeau and L. Girin and G. Richard: “Informed source separation through spectrogram coding and data embedding,” Signal Processing Journal, 2011; A. Ozerov, A. Liutkus, R. Badeau, G. Richard: “Informed source separation: source coding meets source separation,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2011; Shuhua Zhang and Laurent Girin: “An Informed Source Separation System for Speech Signals,” INTERSPEECH, 2011; L. Girin and J. Pinel: “Informed Audio Source Separation from Compressed Linear Stereo Mixtures,” AES 42nd International Conference: Semantic Audio, 2011]. These techniques aim at reconstructing a desired output audio scene or a desired audio source object on the basis of a downmix of channels/objects and additional side information describing the transmitted/stored audio scene and/or the audio source objects in the audio scene.
The estimation and the application of channel/object related side information in such systems is done in a time-frequency selective manner. Therefore, such systems employ time-frequency transforms such as the Discrete Fourier Transform (DFT), the Short Time Fourier Transform (STFT) or filter banks like Quadrature Mirror Filter (QMF) banks, etc. The basic principle of such systems is depicted in
In case of the STFT, the temporal dimension is represented by the time-block number and the spectral dimension is captured by the spectral coefficient (“bin”) number. In case of QMF, the temporal dimension is represented by the time-slot number and the spectral dimension is captured by the sub-band number. If the spectral resolution of the QMF is improved by subsequent application of a second filter stage, the entire filter bank is termed hybrid QMF and the fine resolution sub-bands are termed hybrid sub-bands.
As already mentioned above, in SAOC the general processing is carried out in a time-frequency selective way and can be described as follows within each frequency band:
Time-frequency based systems may utilize a time-frequency (t/f) transform with static temporal and frequency resolution. Choosing a certain fixed t/f-resolution grid typically involves a trade-off between time and frequency resolution.
The effect of a fixed t/f-resolution can be demonstrated on the example of typical object signals in an audio signal mixture. For example, the spectra of tonal sounds exhibit a harmonically related structure with a fundamental frequency and several overtones. The energy of such signals is concentrated at certain frequency regions. For such signals, a high frequency resolution of the utilized t/f-representation is beneficial for separating the narrowband tonal spectral regions from a signal mixture. In the contrary, transient signals, like drum sounds, often have a distinct temporal structure: substantial energy is only present for short periods of time and is spread over a wide range of frequencies. For these signals, a high temporal resolution of the utilized t/f-representation is advantageous for separating the transient signal portion from the signal mixture.
It would be desirable to take into account the different needs of different types of audio objects regarding their representation in the time-frequency domain when generating and/or evaluating object-specific side information at the encoder side or at the decoder side, respectively.
According to an embodiment, an audio decoder for decoding a multi-object audio signal including a downmix signal and side information, the side information including object-specific side information for at least one audio object in at least one time/frequency region, and object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region, may have: an object-specific time/frequency resolution determiner configured to determine the object-specific time/frequency resolution information from the side information for the at least one audio object; and an object separator configured to separate the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution.
According to another embodiment, an audio encoder for encoding a plurality of audio objects into a downmix signal and side information may have: a time-to-frequency transformer configured to transform the plurality of audio objects at least to a first plurality of corresponding transformations using a first time/frequency resolution and to a second plurality of corresponding transformations using a second time/frequency resolution; a side information determiner configured to determine at least a first side information for the first plurality of corresponding transformations and a second side information for the second plurality of corresponding transformations, the first and second side information indicating a relation of the plurality of audio objects to each other in the first and second time/frequency resolutions, respectively, in a time/frequency region; and a side information selector configured to select, for at least one audio object of the plurality of audio objects, one object-specific side information from at least the first and second side information on the basis of a suitability criterion indicative of a suitability of at least the first or second time/frequency resolution for representing the audio object in the time/frequency domain, the object-specific side information being inserted into the side information output by the audio encoder.
According to another embodiment, a method for decoding a multi-object audio signal including a downmix signal and side information, the side information including object-specific side information for at least one audio object) in at least one time/frequency region, and object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region, may have the steps of: determining the object-specific time/frequency resolution information from the side information for the at least one audio object; and separating the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution.
According to another embodiment, a method for encoding a plurality of audio object to a downmix signal and side information may have the steps of: transforming the plurality of audio object at least to a first plurality of corresponding transformations using a first time/frequency resolution and to a second plurality of corresponding transformations using a second time/frequency resolution; determining at least a first side information for the first plurality of corresponding transformations and a second side information for the second plurality of corresponding transformations, the first and second side information indicating a relation of the plurality of audio object to each other in the first and second time/frequency resolutions, respectively, in a time/frequency region; and selecting, for at least one audio object of the plurality of audio objects, one object-specific side information from at least the first and second side information on the basis of a suitability criterion indicative of a suitability of at least the first or second time/frequency resolution for representing the audio object in the time/frequency domain, the object-specific side information being inserted into the side information output by the audio encoder.
According to another embodiment, an audio decoder for decoding a multi-object audio signal including a downmix signal and side information, the side information including object-specific side information for at least one audio object in at least one time/frequency region, and object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region, may have: an object-specific time/frequency resolution determiner configured to determine the object-specific time/frequency resolution information from the side information for the at least one audio object; and an object separator configured to separate the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution, wherein object-specific side information for at least one other audio object within the downmix signal has a different object-specific time/frequency resolution.
According to another embodiment, a method for decoding a multi-object audio signal including a downmix signal and side information, the side information including object-specific side information for at least one audio object in at least one time/frequency region, and object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region, may have the steps of: determining the object-specific time/frequency resolution information from the side information for the at least one audio object; and separating the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution, wherein object-specific side information for at least one other audio object within the downmix signal has a different object-specific time/frequency resolution.
Another embodiment may have a computer program for performing any of the methods when the computer program runs on a computer.
According to another embodiment, an audio decoder for decoding a multi-object audio signal including a downmix signal and side information, the side information including object-specific side information for at least one audio object in at least one time/frequency region, and object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region, may have: an object-specific time/frequency resolution determiner configured to determine the object-specific time/frequency resolution information from the side information for the at least one audio object; and an object separator configured to separate the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution, wherein the object-specific side information is a fine structure object-specific side information for the at least one audio object in the at least one time/frequency region, and wherein the side information further includes coarse object-specific side information for the at least one audio object in the at least one time/frequency region, the coarse object-specific side information being constant within the at least one time/frequency region, or wherein the fine structure object-specific side information describes a difference between the coarse object-specific side information and the at least one audio object.
According to another embodiment, a method for decoding a multi-object audio signal including a downmix signal and side information, the side information including object-specific side information for at least one audio object in at least one time/frequency region, and object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region, may have the steps of: determining the object-specific time/frequency resolution information from the side information for the at least one audio object; and separating the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution, wherein the object-specific side information is a fine structure object-specific side information for the at least one audio object in the at least one time/frequency region, and wherein the side information further includes coarse object-specific side information for the at least one audio object in the at least one time/frequency region, the coarse object-specific side information being constant within the at least one time/frequency region, or wherein the fine structure object-specific side information describes a difference between the coarse object-specific side information and the at least one audio object.
According to at least some embodiments, an audio decoder for decoding a multi-object signal is provided. The multi-object audio signal consists of a downmix signal and side information. The side information comprises object-specific side information for at least one audio object in at least one time/frequency region. The side information further comprises object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region. The audio decoder comprises an object-specific time/frequency resolution determiner configured to determine the object-specific time/frequency resolution information from the side information for the at least one audio object. The audio decoder further comprises an object separator configured to separate the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution.
Further embodiments provide an audio encoder for encoding a plurality of audio objects into a downmix signal and side information. The audio encoder comprises a time-to-frequency transformer configured to transform the plurality of audio objects at least to a first plurality of corresponding transformations using a first time/frequency resolution and to a second plurality of corresponding transformations using a second time/frequency resolution. The audio encoder further comprises a side information determiner configured to determine at least a first side information for the first plurality of corresponding transformations and a second side information for the second plurality of corresponding transformations. The first and second side information indicate a relation of the plurality of audio objects to each other in the first and second time/frequency resolutions, respectively, in a time/frequency region. The audio encoder also comprises a side information selector configured to select, for at least one audio object of the plurality of audio objects, one object-specific side information from at least the first and second side information on the basis of a suitability criterion. The suitability criterion is indicative of a suitability of at least the first or second time/frequency resolution for representing the audio object in the time/frequency domain. The selected object-specific side information is inserted into the side information output by the audio encoder.
Further embodiments of the present invention provide a method for decoding a multi-object audio signal consisting of a downmix signal and side information. The side information comprises object-specific side information for at least one audio object in at least one time/frequency region, and object-specific time/frequency resolution information indicative of an object-specific time/frequency resolution of the object-specific side information for the at least one audio object in the at least one time/frequency region. The method comprises determining the object-specific time/frequency resolution information from the side information for the at least one audio object. The method further comprises separating the at least one audio object from the downmix signal using the object-specific side information in accordance with the object-specific time/frequency resolution.
Further embodiments of the present invention provide a method for encoding a plurality of audio objects to a downmix signal and side information. The method comprises transforming the plurality of audio object at least to a first plurality of corresponding transformations using a first time/frequency resolution and to a second plurality of corresponding transformations using a second time/frequency resolution. The method further comprises determining at least a first side information for the first plurality of corresponding transformations and a second side information for the second plurality of corresponding transformations. The first and second side information indicate a relation of the plurality of audio objects to each other in the first and second time/frequency resolutions, respectively, in a time/frequency region. The method further comprises selecting, for at least one audio object of the plurality of audio objects, one object-specific side information from at least the first and second side information on the basis of a suitability criterion. The suitability criterion is indicative of a suitability of at least the first or second time/frequency resolution for representing the audio object in the time/frequency domain. The object-specific side information is inserted into the side information output by the audio encoder.
The performance of audio object separation typically decreases if the utilized t/f-representation does not match with the temporal and/or spectral characteristics of the audio object to be separated from the mixture. Insufficient performance may lead to crosstalk between the separated objects. Said crosstalk is perceived as pre- or post-echoes, timbre modifications, or, in the case of human voice, as so-called double-talk. Embodiments of the invention offer several alternative t/f-representations from which the most suited t/f-representation can be selected for a given audio object and a given time/frequency region when determining the side information at an encoder side, or when using the side information at a decoder side. This provides improved separation performance for the separation of the audio objects and an improved subjective quality of the rendered output signal compared to the state of the art.
Compared to other schemes for encoding/decoding spatial audio objects, the amount of side information may be substantially the same or slightly higher. According to embodiments of the invention, the side information is used in an efficient manner, as it is applied in an object-specific way taking into account the object-specific properties of a given audio object regarding its temporal and spectral structure. In other words, the t/f-representation of the side information is tailored to the various audio objects.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
In the case of a stereo downmix, the channels of the downmix signal 18 are denoted L0 and R0, in case of a mono downmix same is simply denoted L0. In order to enable the SAOC decoder 12 to recover the individual objects s1 to sN, side information estimator 17 provides the SAOC decoder 12 with side information including SAOC-parameters. For example, in case of a stereo downmix, the SAOC parameters comprise object level differences (OLD), inter-object cross correlation parameters (IOC), downmix gain values (DMG) and downmix channel level differences (DCLD). The side information 20 including the SAOC-parameters, along with the downmix signal 18, forms the SAOC output data stream received by the SAOC decoder 12.
The SAOC decoder 12 comprises an upmixer which receives the downmix signal 18 as well as the side information 20 in order to recover and render the audio signals s1 and sN onto any user-selected set of channels ŷ1 to ŷM, with the rendering being prescribed by rendering information 26 input into SAOC decoder 12.
The audio signals s1 to sN may be input into the encoder 10 in any coding domain, such as, in time or spectral domain. In case the audio signals s1 to sN are fed into the encoder 10 in the time domain, such as PCM coded, encoder 10 may use a filter bank, such as a hybrid QMF bank, in order to transfer the signals into a spectral domain, in which the audio signals are represented in several sub-bands associated with different spectral portions, at a specific filter bank resolution. If the audio signals s1 to sN are already in the representation expected by encoder 10, same does not have to perform the spectral decomposition.
As outlined above, side information extractor 17 computes SAOC-parameters from the input audio signals s1 to sN. According to the currently implemented SAOC standard, encoder 10 performs this computation in a time/frequency resolution which may be decreased relative to the original time/frequency resolution as determined by the filter bank time slots 34 and sub-band decomposition, by a certain amount, with this certain amount being signaled to the decoder side within the side information 20. Groups of consecutive filter bank time slots 34 may form a SAOC frame 41. Also the number of parameter bands within the SAOC frame 41 is conveyed within the side information 20. Hence, the time/frequency domain is divided into time/frequency tiles exemplified in
The side information extractor 17 calculates SAOC parameters according to the following formulas. In particular, side information extractor 17 computes object level differences for each object i as
wherein the sums and the indices n and k, respectively, go through all temporal indices 34, and all spectral indices 30 which belong to a certain time/frequency tile 42, referenced by the indices l for the SAOC frame (or processing time slot) and m for the parameter band. Thereby, the energies of all sub-band values xi of an audio signal or object i are summed up and normalized to the highest energy value of that tile among all objects or audio signals.
Further the SAOC side information extractor 17 is able to compute a similarity measure of the corresponding time/frequency tiles of pairs of different input objects s1 to sN. Although the SAOC downmixer 16 may compute the similarity measure between all the pairs of input objects s1 to sN, downmixer 16 may also suppress the signaling of the similarity measures or restrict the computation of the similarity measures to audio objects s1 to sN which form left or right channels of a common stereo channel. In any case, the similarity measure is called the inter-object cross-correlation parameter IOCi,jl,m. The computation is as follows
with again indices n and k going through all sub-band values belonging to a certain time/frequency tile 42, and i and j denoting a certain pair of audio objects s1 to sN.
The downmixer 16 downmixes the objects s1 to sN by use of gain factors applied to each object s1 to sN. That is, a gain factor Di is applied to object i and then all thus weighted objects s1 to sN are summed up to obtain a mono downmix signal, which is exemplified in
This downmix prescription is signaled to the decoder side by means of down mix gains DMGi and, in case of a stereo downmix signal, downmix channel level differences DCLDi.
The downmix gains are calculated according to:
DMG
i=20 log10(Di+ε), (mono downmix),
DMG
i=10 log10(D1,i2+D2,i2+ε), (stereo downmix),
where ε is a small number such as 10−9.
For the DCLDs the following formula applies:
In the normal mode, downmixer 16 generates the downmix signal according to:
for a mono downmix, or
for a stereo downmix, respectively.
Thus, in the abovementioned formulas, parameters OLD and IOC are a function of the audio signals and parameters DMG and DCLD are a function of D. By the way, it is noted that D may be varying in time.
Thus, in the normal mode, downmixer 16 mixes all objects s1 to sN with no preferences, i.e., with handling all objects s1 to sN equally.
At the decoder side, the upmixer performs the inversion of the downmix procedure and the implementation of the “rendering information” 26 represented by a matrix R (in the literature sometimes also called A) in one computation step, namely, in case of a two-channel downmix
where matrix E is a function of the parameters OLD and IOC. The matrix E is an estimated covariance matrix of the audio objects s1 to sN. In current SAOC implementations, the computation of the estimated covariance matrix E is typically performed in the spectral/temporal resolution of the SAOC parameters, i.e., for each (l,m), so that the estimated covariance matrix may be written as El,m. The estimated covariance matrix El,m of size N×N with its coefficients being defined as
e
i,j
l,m=√{square root over (OLDil,mOLDjl,m)}IOCi,jl,m.
Thus, the matrix El,m with
has along its diagonal the object level differences, i.e., ei,jl,m=OLDil,m for i=j, since OLDil,m=OLDjl,m and IOCi,jl,m=1 for i=j. Outside its diagonal the estimated covariance matrix E has matrix coefficients representing the geometric mean of the object level differences of objects i and j, respectively, weighted with the inter-object cross correlation measure IOCi,jl,m.
Some limitations of the current SAOC concept are described now: In order to keep the amount of data associated with the side information relatively small, the side information for the different audio objects is determined in an advantageously coarse manner for time/frequency regions that span several time-slots and several (hybrid) sub-bands of the input signals corresponding to the audio objects. As stated above, the separation performance observed at the decoder side might be sub-optimal if the utilized t/f-representation is not adapted to the temporal or spectral characteristics of the object signal to be separated from the mixture signal (downmix signal) in each processing block (i.e., t/f region or t/f-tile). The side information for tonal parts of an audio object and transient parts of an audio object are determined and applied on the same time/frequency tiling, regardless of current object characteristics. This typically leads to the side information for the primarily tonal audio object parts being determined at a spectral resolution that is somewhat too coarse, and also the side information for the primarily transient audio object parts being determined at a temporal resolution that is somewhat too coarse. Similarly, applying this non-adapted side information in a decoder leads to sub-optimal object separation results that are impaired by object crosstalk in form of, e.g., spectral roughness and/or audible pre- and post-echoes.
For improving the separation performance at the decoder side, it would be desirable to enable the decoder or a corresponding method for decoding to individually adapt the t/f-representation used for processing the decoder input signals (“side information and downmix”) according to the characteristics of the desired target signal to be separated. For each target signal (object) the most suitable t/f-representation is individually selected for processing and separating, for example, out of a given set of available representations. The decoder is thereby driven by side information that signals the t/f-representation to be used for each individual object at a given time span and a given spectral region. This information is computed at the encoder and conveyed in addition to the side information already transmitted within SAOC.
The E-SIE may comprise two modules. One module computes for each object signal up to H t/f-representations, which differ in temporal and spectral resolution and meet the following requirement: time/frequency-regions R(tR,fR) can be defined such that the signal content within these regions can be described by any of the H t/f-representations.
Accordingly, an audio encoder for encoding a plurality of audio object signals si into a downmix signal X and side information PSI is provided. The audio encoder comprises an enhanced side information estimator E-SIE schematically illustrated in
Note that the grouping of the t/f-plane into t/f-regions R(tR,fR) may not necessarily be equidistantly spaced, as
The adaptation of the t/f-resolution is not only limited to specifying a differing parameter-tiling for different objects, but the transform the SAOC scheme is based on (i.e., typically presented by the common time/frequency resolution used in state-of-the-art systems for SAOC processing) can also be modified to better fit the individual target objects. This is especially useful, e.g., when a higher spectral resolution than provided by the common transform the SAOC scheme is based on is needed. In the example case of MPEG SAOC, the raw resolution is limited to the (common) resolution of the (hybrid) QMF bank. By the inventive processing, it is possible to increase the spectral resolution, but as a trade-off, some of the temporal resolution is lost in the process. This is accomplished using a so-called (spectral) zoom-transform applied on the outputs of the first filter-bank. Conceptually, a number of consecutive filter bank output samples are handled as a time-domain signal and a second transform is applied on them to obtain a corresponding number of spectral samples (with only one temporal slot). The zoom transform can be based on a filter bank (similar to the hybrid filter stage in the MPEG SAOC), or a block-based transform such as DFT or Complex Modified Discrete Cosine Transform (CMDCT). In a similar manner, it is also possible to increase the temporal resolution at the cost of the spectral resolution (temporal zoom transform): A number of concurrent outputs of several filters of the (hybrid) QMF bank are sampled as a frequency-domain signal and a second transform is applied to them to obtain a corresponding number of temporal samples (with only one large spectral band covering the spectral range of the several filters).
For each object, the H t/f-representations are fed together with the mixing parameters into the second module, the Side Information Computation and Selection module SI-CS. The SI-CS module determines, for each of the object signals, which of the H t/f-representations should be used for which t/f-region R(tR,fR) at the decoder to estimate the object signal.
For each of the H different t/f-representations, the corresponding side information (SI) is computed. For example, the t/f-SIE module within SAOC can be utilized. The computed H side information data are fed into the Side Information Assessment and Selection module (SI-AS). For each object signal, the SI-AS module determines the most appropriate t/f-representation for each t/f-region for estimating the object signal from the signal mixture.
Besides the usual mixing scene parameters, the SI-AS outputs, for each object signal and for each t/f-region, side information that refers to the individually selected t/f-representation. An additional parameter denoting the corresponding t/f-representation, may also be output.
Two methods for selecting the most suitable t/f-representation for each object signal are presented:
The parametric estimation of the SDR for the case of SAOC-based object estimation is now described.
Notations:
S Matrix of N original audio object signals
X Matrix of M mixture signals
Dγo M×N Downmix matrix
X=DS Calculation of downmix scene
Sest Matrix of N estimated audio object signals
Within SAOC, the object signals are conceptually estimated from the mixture signals with the formula:
S
est
=ED*(DED*)−1X with E=SS*
Replacing X with DS gives:
S
est
=ED*(DED*)−1DS=TS
The energy of original object signal parts in the estimated object signals can be computed as:
E
est
=S
est
S
est
*=TSS*T*=TET*
The distortion terms in the estimated signal can then be computed by:
E
dist=diag(E)−Eest,
with diag(E) denoting a diagonal matrix that contains the energies of the original object signals. The SDR can then be computed by relating diag(E) to Edist. For estimating the SDR in a manner relative to the target source energy in a certain t/f-region R(tR,fR), the distortion energy calculation is carried out on each processed t/f-tile in the region R(tR,fR), and the target and the distortion energies are accumulated over all t/f-tiles within the t/f-region R(tR,fR).
Therefore, the suitability criterion may be based on a source estimation. In this case the side information selector (SI-AS) 56 may further comprise a source estimator configured to estimate at least a selected audio object signal of the plurality of audio object signals si using the downmix signal X and at least the first information and the second information corresponding to the first and second time/frequency resolutions TFR1, TFR2, respectively. The source estimator thus provides at least a first estimated audio object signal Si, estim1 and a second estimated audio object signal si, estim2 (possibly up to H estimated audio object signals si, estim H). The side information selector 56 also comprises a quality assessor configured to assess a quality of at least the first estimated audio object signal si, estim1 and the second estimated audio object signal si, estim2. Moreover, the quality assessor may be configured to assess the quality of at least the first estimated audio object signal si, estim1 and the second estimated audio object signal si, estim2 on the basis of a signal-to-distortion ratio SDR as a source estimation performance measure, the signal-to-distortion ratio SDR being determined solely on the basis of the side information PSI, in particular the estimated covariance matrix Eest.
The audio encoder according to some embodiments may further comprise a downmix signal processor that is configured to transform the downmix signal X to a representation that is sampled in the time/frequency domain into a plurality of time-slots and a plurality of (hybrid) sub-bands. The time/frequency region R(tR,fR) may extend over at least two samples of the downmix signal X. An object-specific time/frequency resolution TFRh specified for at least one audio object may be finer than the time/frequency region R(tR,fR). As mentioned above, in relation to the uncertainty principle of time/frequency representation the spectral resolution of a signal can be increased at the cost of the temporal resolution, or vice versa. Although the downmix signal sent from the audio encoder to an audio decoder is typically analyzed in the decoder by a time-frequency transform with a fixed predetermined time/frequency resolution, the audio decoder may still transform the analyzed downmix signal within a contemplated time/frequency region R(tR,fR) object-individually to another time/frequency resolution that is more appropriate for extracting a given audio object si from the downmix signal. Such a transform of the downmix signal at the decoder is called a zoom transform in this document. The zoom transform can be a temporal zoom transform or a spectral zoom transform.
In principle, in simple embodiments of the inventive system, side information for up to H t/f-representations has to be transmitted for every object and for every t/f-region R(tR,fR) as separation at the decoder side is carried out by choosing from up to H t/f-representations. This large amount of data can be drastically reduced without significant loss of perceptual quality. For each object, it is sufficient to transmit for each t/f-region R(tR,fR) the following information:
At the decoder, the estimation of a desired audio objects from the mixture at the decoder can be carried out as described in the following for each t/f-region R(tR,fR).
The object-specific side information (PSIi) may comprise a fine structure object-specific side information fsliη,κ, fsci,jη,κ for the at least one audio object si in at least one time/frequency region R(tR,fR). The fine structure object-specific side information fsliη,κ may be a fine structure level information describing how the level (e.g., signal energy, signal power, amplitude, etc. of the audio object) varies within the time/frequency region R(tR,fR). The fine structure object-specific side information fsci,jη,κ may be an inter-object correlation information of the audio objects i and j, respectively. Here, the fine structure object-specific side information fsliη,κ, fsci,jη,κ is defined on a time/frequency grid according to the object-specific time/frequency resolution TFRi, with fine-structure time-slots η and fine-structure (hybrid) sub-bands κ. This topic will be described below in the context of
The side information may further comprise coarse object-specific side information OLDi, IOCi,j, and/or an absolute energy level NRGi for at least one audio object si in the considered time/frequency region R(tR,fR). The coarse object-specific side information OLDi, IOCi,j, and/or NRGi is constant within the at least one time/frequency region R(tR,fR).
Briefly, according to the embodiment shown in
The downmix signal X is provided to a plurality of object separators 1201 to 120H. Each of the object separators 1201 to 120H is configured to perform the separation task for one specific t/f-representation. To this end, each object separator 1201 to 120H further receives the side information of the N different audio objects s1 to sN in the specific t/f-representation that the object separator is associated with. Note that
The object separators 1201 to 120H provide N×H estimated separated audio objects ŝ1,1 . . . ŝN,H which may be fed to an optional t/f-resolution converter 130 in order to bring the estimated separated audio objects ŝ1,1 . . . ŝN,H to a common t/f-representation, if this is not already the case. Typically, the common t/f-resolution or representation may be the true t/f-resolution of the filter bank or transform the general processing of the audio signals is based on, i.e., in case of MPEG SAOC the common resolution is the granularity of QMF time-slots and (hybrid) sub-bands. For illustrative purposes it may be assumed that the estimated audio objects are temporarily stored in a matrix 140. In an actual implementation, estimated separated audio objects that will not be used later may be discarded immediately or are not even calculated in the first place. Each row of the matrix 140 comprises H different estimations of the same audio object, i.e., the estimated separated audio object determined on the basis of H different t/f-representations. The middle portion of the matrix 140 is schematically denoted with a grid. Each matrix element ŝ1,1 . . . ŝN,H corresponds to the audio signal of the estimated separated audio object. In other words, each matrix element comprises a plurality of time-slot/sub-band samples within the target t/f-region R(tR,fR) (e.g., 7 time-slots×3 sub-bands=21 time-slot/sub-band samples in the example of
The audio decoder is further configured to receive the object-specific time/frequency resolution information TFRI1 to TFRIN for the different audio objects and for the current t/f-region R(tR,fR). For each audio object i, the object-specific time/frequency resolution information TFRIi indicates which of the estimated separated audio objects ŝi,1 . . . ŝi,H should be used to approximately reproduce the original audio object. The object-specific time/frequency resolution information has typically been determined by the encoder and provided to the decoder as part of the side information. In
The selector 112 outputs N selected audio object signals that may be further processed. For example, the N selected audio object signals may be provided to a renderer 150 configured to render the selected audio object signals to an available loudspeaker setup, e.g., stereo or 5.1 loudspeaker setup. To this end, the renderer 150 may receive preset rendering information and/or user rendering information that describes how the audio signals of the estimated separated audio objects should be distributed to the available loudspeakers. The renderer 150 is optional and the estimated separated audio objects ŝi,1 . . . ŝi,H at the output of the selector 112 may be used and processed directly. In alternative embodiments, the renderer 150 may be set to extreme settings such as “solo mode” or “karaoke mode.” In the solo mode, a single estimated audio object is selected to be rendered to the output signal. In the karaoke mode, all but one estimated audio object are selected to be rendered to the output signal. Typically the lead vocal part is not rendered, but the accompaniment parts are. Both modes are highly demanding in terms of separation performance, as even little crosstalk is perceivable.
When determining the side information for the audio object i at the audio encoder side, the audio encoder analyzes the audio object i within the t/f-region R(tR,fR) and determines a coarse side information and a fine structure side information. The coarse side information may be the object level difference OLDi, the inter-object covariance IOCi,j and/or an absolute energy level NRGi, as defined in, among others, the SAOC standard ISO/IEC 23003-2. The coarse side information is defined on a t/f-region basis and typically provides backward compatibility as existing SAOC decoders use this kind of side information. The fine structure object-specific side information fslin,k for the object i provides three further values indicating how the energy of the audio object i is distributed among three spectral sub-regions. In the illustrated case, each of the three spectral sub-regions corresponds to one (hybrid) sub-band, but other distributions are also possible. It may even be envisaged to make one spectral sub-region smaller than another spectral sub-region in order to have a particularly fine spectral resolution available in the smaller spectral sub-band. In a similar manner, the same t/f-region R(tR,fR) may be subdivided into several temporal sub-regions for more adequately representing the content of audio object j in the t/f-region R(tR,fR).
The fine structure object-specific side information fslin,k may describe a difference between the coarse object-specific side information (e.g., OLDi, IOCi,j, and/or NRGi) and the at least one audio object si.
The lower part of
The object separator 120 may be configured to determine the estimated covariance matrix En,k with elements ei,jn,k of the at least one audio object si and at least one further audio object sj according to
e
i,j
n,k=√{square root over (fslin,kfsljn,k)}fsci,jn,k,
wherein
At least one of fslin,k, fsljn,k, and fsci,jn,k varies within the time/frequency region R(tR,fR) according to the object-specific time/frequency resolution TFRh for the audio objects i or j indicated by the object-specific time/frequency resolution information TFRIi, TFRIj, respectively. The object separator 120 may be further configured to separate the at least one audio object si from the downmix signal X using the estimated covariance matrix En,k in the manner described above.
An alternative to the approach described above has to be taken when the spectral or temporal resolution is increased from the resolution of the underlying transform, e.g., with a subsequent zoom transform. In such a case, the estimation of the object covariance matrix needs to be done in the zoomed domain, and the object reconstruction takes place also in the zoomed domain. The reconstruction result can then be inverse transformed back to the domain of the original transform, e.g., (hybrid) QMF, and the interleaving of the tiles into the final reconstruction takes place in this domain. In principle, the calculations operate in the same way as they would in the case of utilizing a differing parameter tiling with the exception of the additional transforms.
The processing is performed at the object-specific time/frequency resolution TFRh by the object separator 121 which also receives the side information of at least one of the audio objects in the object-specific time/frequency resolution TFRh. In the example of
The object separator 121 outputs at least one extracted audio object ŝi for the time/frequency region R(tR,fR) at the object-specific time/frequency resolution (zoom t/f-resolution). The at least one extracted audio object ŝi is then inverse zoom transformed by an inverse zoom transformer 132 to obtain the extracted audio object ŝi in R(tR,fR) at the time/frequency resolution of the downmix signal or at another desired time/frequency resolution. The extracted audio object ŝi in R(tR,fR) is then combined with the extracted audio object ŝi in other time/frequency regions, e.g., R(tR−1,fR−1), R(tR−1,fR), . . . R(tR+1,fR+1), in order to assemble the extracted audio object ŝi.
According to corresponding embodiments, the audio decoder may comprise a downmix signal time/frequency transformer 115 configured to transform the downmix signal X within the time/frequency region R(tR,fR) from a downmix signal time/frequency resolution to at least the object-specific time/frequency resolution TFRh of the at least one audio object si to obtain a re-transformed downmix signal Xη,κ. The downmix signal time/frequency resolution is related to downmix time-slots n and downmix (hybrid) sub-bands k. The object-specific time/frequency resolution TFRh is related to object-specific time-slots η and object-specific (hybrid) sub-bands κ. The object-specific time-slots η may be finer or coarser than the downmix time-slots n of the downmix time/frequency resolution. Likewise, the object-specific (hybrid) sub-bands κ may be finer or coarser than the downmix (hybrid) sub-bands of the downmix time/frequency resolution. As explained above in relation to the uncertainty principle of time/frequency representation, the spectral resolution of a signal can be increased at the cost of the temporal resolution, and vice versa. The audio decoder may further comprise an inverse time/frequency transformer 132 configured to time/frequency transform the at least one audio object si within the time/frequency region R(tR,fR) from the object-specific time/frequency resolution TFRh back to the downmix signal time/frequency resolution. The object separator 121 is configured to separate the at least one audio object si from the downmix signal X at the object-specific time/frequency resolution TFRh.
In the zoomed domain, the estimated covariance matrix Eη,κ is defined for the object-specific time-slots η and the object-specific (hybrid) sub-bands κ. The abovementioned formula for the elements of the estimated covariance matrix of the at least one audio object si and at least one further audio object sj may be expressed in the zoomed domain as:
e
i,j
η,κ=√{square root over (fsliη,κfsljη,κ)}fsci,jη,κ,
wherein
As explained above, the further audio object j might not be defined by side information that has the object-specific time/frequency resolution TFRh of the audio object i so that the parameters fsljη,κ and fsci,jη,κ may not be available or determinable at the object-specific time/frequency resolution TFRh. In this case, the coarse side information of audio object j in R(tR,fR) or temporally averaged values or spectrally averaged values may be used to approximate the parameters fsljη,κ and fsci,jη,κ in the time/frequency region R(tR,fR) or in sub-regions thereof.
Also at the encoder side, the fine structure side information should typically be considered. In an audio encoder according to embodiments the side information determiner (t/f-SIE) 55-1 . . . 55-H is further configured to provide fine structure object-specific side information fslin,k or fsliη,κ and coarse object-specific side information OLDi as a part of at least one of the first side information and the second side information. The coarse object-specific side information OLDi is constant within the at least one time/frequency region R(tR,fR). The fine structure object-specific side information fslin,k, fsliη,κ may describe a difference between the coarse object-specific side information OLDi and the at least one audio object si. The inter-object correlations IOCi,j and fsci,jn,k,fsci,jη,κ may be processed in an analog manner, as well as other parametric side information.
Backward Compatibility with SAOC
The proposed solution advantageously improves the perceptual audio quality, possibly even in a fully decoder-compatible way. By defining the t/f-regions R(tR,fR) to be congruent to the t/f-grouping within state-of-the-art SAOC, existing standard SAOC decoders can decode the backward compatible portion of the PSI and produce reconstructions of the objects on a coarse t/f-resolution level. If the added information is used by an enhanced SAOC decoder, the perceptual quality of the reconstructions is considerably improved. For each audio object, this additional side information comprises the information, which individual t/f-representation should be used for estimating the object, together with a description of the object fine structure based on the selected t/f-representation.
Additionally, if an enhanced SAOC decoder is running on limited resources, the enhancements can be ignored, and a basic quality reconstruction can still be obtained requiring only low computational complexity.
The concept of object-specific t/f-representations and its associated signaling to the decoder can be applied on any SAOC-scheme. It can be combined with any current and also future audio formats. The concept allows for enhanced perceptual audio object estimation in SAOC applications by an audio object adaptive choice of an individual t/f-resolution for the parametric estimation of audio objects.
Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, for example, a microprocessor, a programmable computer, or an electronic circuit. In some embodiments, some single or multiple method steps may be executed by such an apparatus.
The inventive encoded audio signal can be stored on a digital storage medium or can be transmitted on a transmission medium such as a wireless transmission medium or a wired transmission medium such as the Internet.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, for example, a floppy disk, a DVD, a Blue-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transmitting.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are advantageously performed by any hardware apparatus.
The above described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
Number | Date | Country | Kind |
---|---|---|---|
13167484.8 | May 2013 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2014/059570, filed May 9, 2014, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. EP 13167484.8, filed May 13, 2013, which is also incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2014/059570 | May 2014 | US |
Child | 14939677 | US |