The present invention relates to audio signal processing and, in particular, to an apparatus and a method for extracting a direct/ambience signal from a downmix signal and spatial parametric information. Further embodiments of the present invention relate to a utilization of direct-/ambience separation for enhancing binaural reproduction of audio signals. Yet further embodiments relate to binaural reproduction of multi-channel sound, where multi-channel audio means audio having two or more channels. Typical audio content having multi-channel sound is movie soundtracks and multi-channel music recordings.
The human spatial hearing system tends to process the sound roughly in two parts. These are on the one hand, a localizable or direct and, on the other hand, an unlocalizable or ambient part. There are many audio processing applications, such as binaural sound reproduction and multi-channel upmixing, where it is desirable to have access to these two audio components.
In the art, methods of direct/ambience separation as described in “Primary-ambient signal decomposition and vector-based localization for spatial audio coding and enhancement”, Goodwin, Jot, IEEE Intl. Conf. On Acoustics, Speech and Signal proc, April 2007; “Correlation-based ambience extraction from stereo recordings”, Merimaa, Goodwin, Jot, AES 123rd Convention, New York, 2007; “Multiple-loudspeaker playback of stereo signals”, C. Faller, Journal of the AES, October 2007; “Primary-ambient decomposition of stereo audio signals using a complex similarity index”; Goodwin et al., Pub. No: US2009/0198356 A1, August 2009; “Patent application title: Method to Generate Multi-Channel Audio Signal from Stereo Signals”, Inventors: Christof Faller, Agents: FISH & RICHARDSON P. C., Assignees: LG ELECTRONICS, INC., Origin: MINNEAPOLIS, Minn. US, IPC8 Class: AH04R500FI, USPC Class: 381 1; and “Ambience generation for stereo signals”, Avendano et al., Date Issued: Jul. 28, 2009, Application: Ser. No. 10/163,158, Filed: Jun. 4, 2002 are known, which may be used for various applications. The state-of-art direct-ambience separation algorithms are based on inter-channel signal comparison of stereo sound in frequency bands.
Moreover, in “Binaural 3-D Audio Rendering Based on Spatial Audio Scene Coding”, Goodwin, Jot, AES 123rd Convention, New York 2007, binaural playback with ambience extraction is addressed. Ambience extraction in connection to binaural reproduction is also mentioned in J. Usher and J. Benesty, “Enhancement of spatial sound quality: a new reverberation-extraction audio upmixer,” IEEE Trans. Audio, Speech, Language Processing, vol. 15, pp. 2141-2150, September 2007. The latter paper focuses on ambience extraction in stereo microphone recordings, using adaptive least-mean-square cross-channel filtering of the direct component in each channel. Spatial audio codecs, e.g. MPEG surround, typically consist of a one or two channel audio stream in combination with spatial side information, which extends the audio into multiple channels, as described in ISO/IEC 23003-1—MPEG Surround; and Breebaart, J., Herre, J., Villemoes, L., Jin, C., Kjörling, K., Plogsties, J., Koppens, J. (2006). “Multi-channel goes mobile: MPEG Surround binaural rendering”. Proc. 29th AES conference, Seoul, Korea.
However, modern parametric audio coding technologies, such as MPEG-surround (MPS) and parametric stereo (PS) only provide a reduced number of audio downmix channels—in some cases only one—along with additional spatial side information. The comparison between the “original” input channels is then only possible after first decoding the sound into the intended output format.
Therefore, a concept for extracting a direct signal portion or an ambient signal portion from a downmix signal and spatial parametric information is needed. However, there are no existing solutions to the direct/ambience extraction using the parametric side information.
According to an embodiment, an apparatus for extracting a direct and/or ambience signal from a downmix signal and spatial parametric information, the downmix signal and the spatial parametric information representing a multi-channel audio signal having more channels than the downmix signal, wherein the spatial parametric information has inter-channel relations of the multi-channel audio signal, may have a direct/ambience estimator for estimating a direct level information of a direct portion of the multi-channel audio signal and/or for estimating an ambience level information of an ambient portion of the multi-channel audio signal based on the spatial parametric information; and a direct/ambience extractor for extracting a direct signal portion and/or an ambient signal portion from the downmix signal based on the estimated direct level information of the direct portion or based on the estimated ambience level information of the ambient portion.
According to another embodiment, a method for extracting a direct and/or ambience signal from a downmix signal and spatial parametric information, the downmix signal and the spatial parametric information representing a multi-channel audio signal having more channels than the downmix signal, wherein the spatial parametric information has inter-channel relations of the multi-channel audio signal, may have the steps of estimating a direct level information of a direct portion of the multi-channel audio signal and/or estimating an ambience level information of an ambient portion of the multi-channel audio signal based on the spatial parametric information; and extracting a direct signal portion and/or an ambient signal portion from the downmix signal based on the estimated direct level information of the direct portion or based on the estimated ambience level information of the ambient portion.
According to another embodiment, a computer program may have a program code for performing, when the computer program is executed on a computer, the method of extracting a direct and/or ambience signal from a downmix signal and spatial parametric information, the downmix signal and the spatial parametric information representing a multi-channel audio signal comprising more channels than the downmix signal, wherein the spatial parametric information comprises inter-channel relations of the multi-channel audio signal, the method having the steps of estimating a direct level information of a direct portion of the multi-channel audio signal and/or estimating an ambience level information of an ambient portion of the multi-channel audio signal based on the spatial parametric information; and extracting a direct signal portion and/or an ambient signal portion from the downmix signal based on the estimated direct level information of the direct portion or based on the estimated ambience level information of the ambient portion.
The basic idea underlying the present invention is that the above-mentioned direct/ambience extraction can be achieved when a level information of a direct portion or an ambient portion of a multi-channel audio signal is estimated based on the spatial parametric information and a direct signal portion or an ambient signal portion is extracted from a downmix signal based on the estimated level information. Here, the downmix signal and the spatial parametric information represent the multi-channel audio signal having more channels than the downmix signal. This measure enables a direct and/or ambience extraction from a downmix signal having one or more input channels by using spatial parametric side information.
According to an embodiment of the present invention, an apparatus for extracting a direct/ambience signal from a downmix signal and spatial parametric information comprises a direct/ambience estimator and a direct/ambience extractor. The downmix signal and the spatial parametric information represent a multi-channel audio signal having more channels than the downmix signal. Moreover, the spatial parametric information comprises inter-channel relations of the multi-channel audio signal. The direct/ambience estimator is configured for estimating a level information of a direct portion or an ambient portion of the multi-channel audio signal based on the spatial parametric information. The direct/ambience extractor is configured for extracting a direct signal portion or an ambient signal portion from the downmix signal based on the estimated level information of the direct portion or the ambient portion.
According to another embodiment of the present invention, the apparatus for extracting a direct/ambience signal from a downmix signal and spatial parametric information further comprises a binaural direct sound rendering device, a binaural ambient sound rendering device and a combiner. The binaural direct sound rendering device is configured for processing the direct signal portion to obtain a first binaural output signal. The binaural ambient sound rendering device is configured for processing the ambient signal portion to obtain a second binaural output signal. The combiner is configured for combining the first and the second binaural output signals to obtain a combined binaural output signal. Therefore, a binaural reproduction of an audio signal, wherein the direct signal portion and the ambience signal portion of the audio signal are processed separately, may be provided.
In the following, embodiments of the present invention are explained with reference to the accompanying drawings in which:
a is a schematic illustration of the spectral decomposition of a multi-channel audio signal according to an embodiment of the present invention;
b is a schematic illustration for calculating inter-channel relations of a multi-channel audio signal based on the spectral decomposition of
a is a block diagram of an embodiment of a direct/ambience estimator using a stereo ambience estimation formula;
b is a graph of an exemplary direct-to-total energy ratio versus inter-channel coherence;
a is a block diagram of an overview of binaural direct sound rendering according to an embodiment of the present invention;
b is a block diagram of details of the binaural direct sound rendering of
a is a block diagram of an overview of binaural ambient sound rendering according to an embodiment of the present invention;
b is a block diagram of details of the binaural ambient sound rendering of details of the binaural ambient sound rendering of
a is a block diagram of an embodiment of an apparatus for extracting a direct/ambient signal from a mono downmix signal in a filterbank domain;
b is a block diagram of an embodiment of a direct/ambience extraction block of
In practice, the spatial parameters (spatial parametric information 105) in the
Specifically, the embodiments of
The estimation of direct and/or ambience levels (level information 113) is based on information about the inter-channel relations or inter-channels differences, such as level differences and/or correlation. These values can be calculated from a stereo or multi-channel signal.
wherein Chi is the inspected channel and R the linear combination of remaining channels, while < . . . > denotes a time average. An example of a linear combination R of remaining channels is their energy-normalized sum. Furthermore, the channel level difference (CLDi) is typically a decibel value of the parameter σi.
With reference to the above equations, the channel level difference (CLDi) or parameter σi may correspond to a level Pi of channel Chi normalized to a level PR of the linear combination R of the rest of the channels. Here, the levels Pi or PR can be derived from the inter-channel level difference parameter ICLDi of channel Chi and a linear combination ICLDR of inter-channel level difference parameters ICLDj (j≠i) of the rest of the channels.
Here, ICLDi and ICLDj may be related to a reference channel Chref, respectively. In further embodiments, the inter-channel level difference parameters ICLDi and ICLDj may also be related to any other channel of the multi-channel audio signal (Ch1 . . . ChN) being the reference channel Chref. This, eventually, will lead to the same result for the channel level difference (CLDi) or parameter σi.
According to further embodiments, the inter-channel relations 335 of
According to further embodiments, the direct/ambience extractor 420 may also be configured to perform a downmix of the estimated level information 113 of the direct portion or the ambient portion of the multi-channel audio signal 101 by combining the estimated level information of the direct portion with coherent summation and the estimated level information of the ambient portion with incoherent summation.
It is pointed out that the estimated level information may represent energy levels or power levels of the direct portion or the ambient portion, respectively.
In particular, the downmixing of the energies (i.e. level information 113) of the estimated direct/ambient part may be performed by assuming full incoherence or full coherence between the channels. The two formulas that may be applied in case of downmixing based on incoherent or coherent summation, respectively, are as follows.
For incoherent signals, the downmixed energy or downmixed level information can be calculated by
For coherent signals, the downmixed energy or downmixed level information can be calculated by
Here, g is the downmix gain, which may be obtained from the downmixing information, while E(Chi) denotes the energy of the direct/ambient portion of a channel Chi of the multi-channel audio signal. As a typical example of incoherent downmixing, in case of downmixing 5.1 channels into two, the energy of the left downmix can be:
EL
Here, it is to be noted that in the embodiments of
In further embodiments, the direct/ambience extractor 520 is configured to determine a direct-to-total (DTT) or an ambient-to-total (ATT) energy ratio from the downmixed level information 555-1, 555-2 of the direct portion or the ambient portion and use as the gain parameters 565-1, 565-2 extraction parameters based on the determined DTT or ATT energy ratio.
In yet further embodiments, the direct/ambience extractor 520 is configured to multiply the downmix signal 115 with a first extraction parameter sqrt (DTT) to obtain the direct signal portion 125-1 and with a second extraction parameter sqrt (ATT) to obtain the ambient signal portion 125-2. Here, the downmix signal 115 may corresponds to the mono downmix signal 215 as shown in the
In the mono downmix case, the ambience extraction can be done by applying sqrt(ATT) and sqrt(DTT). However, the same approach is valid also for multichannel downmix signals, in particular, by applying sqrt(ATTi) and sqrt(DTTi) for each channel Chi.
According to further embodiments, in case the downmix signal 115 comprises a plurality of channels (‘multichannel downmix case’), the direct/ambience extractor 520 may be configured to apply a first plurality of extraction parameters, e.g. sqrt(DTTi), to the downmix signal 115 to obtain the direct signal portion 125-1 and a second plurality of extraction parameters, e.g. sqrt(ATTi), to the downmix signal 115 to obtain the ambient signal portion 125-2. Here, the first and the second plurality of extraction parameters may constitute a diagonal matrix.
In general, the direct/ambience extractor 120; 420; 520 can also be configured to extract the direct signal portion 125-1 or the ambient signal portion 125-2 by applying a quadratic M-by-M extraction matrix to the downmix signal 115, wherein a size (M) of the quadratic M-by-M extraction matrix corresponds to a number (M) of downmix channels (Ch1 . . . ChM).
The application of ambience extraction can therefore be described by applying a quadratic M-by-M extraction matrix, where M is the number of downmix channels (Ch1 . . . ChM). This may include all possible ways to manipulate the input signal to get the direct/ambience output, including the relatively simple approach based on the sqrt(ATTi) and sqrt(DTTi) parameters representing main elements of a quadratic M-by-M extraction matrix being configured as a diagonal matrix, or an LMS crossmixing approach as a full matrix. The latter will be described in the following. Here, it is to be noted that the above approach of applying the M-by-M extraction matrix covers any number of channels, including one.
According to further embodiments, the extraction matrix may not necessarily be a quadratic matrix of matrix size M-by-M, because we could have a lesser number of output channels. Therefore, the extraction matrix may have a reduced number of lines. An example of this would be extracting a single direct signal instead of M.
It is also not necessary to take all M downmix channels as the input corresponding to having M columns of the extraction matrix. This, in particular, could be relevant to applications where it is not required to have all channels as inputs.
The used symbols in the LMS solution for the crossmixing weights for direct/ambience extraction are:
Chi channel i
ai gain of the direct sound in channel i
D and {circumflex over (D)} direct part of the sound and its estimate
Ai and Âi ambient part of channel i and its estimate
PX=E[XX*] estimated energy of X
E[ ] expectation
E{circumflex over (X)} estimation error of X
w{circumflex over (D)}i LMS crossmixing weights for channel i to the direct part
wÂi.n LMS crossmixing weights for channel n to ambience of channel i
In this context, it is to be noted that the derivation of the LMS solution may be based on a spectral representation of respective channels of the multi-channel audio signal, which means that everything functions in frequency bands.
The signal model is given by
Chi=aiD+Ai
The derivation first deals with a) the direct part and then b) with the ambient part. Finally, the solution for the weights is derived and the method for a normalization of the weights is described.
a) Direct Part
The estimation of the weights direct part is
The estimation error reads
To have the LMS solution, we need ED orthogonal to the input signals
E[E{circumflex over (D)}Chk]=0, for all k
In matrix form, the above relation reads
b) Ambience Part
We start from the same signal model and estimate the weights from
The estimation error is
and the orthogonality
E[EÂiChk]=0, for all k
In matrix form, the above relation reads
Solution for the Weights
The weights can be solved by inverting matrix A, which is identical in both calculation of the direct part and the ambient part. In case of stereo signals the solution is:
where div is divisor a2a2PDPA1+a1a1PDPA2+PA1PA2.
Normalization of the Weights
The weights are for LMS solution, but because the energy levels should be preserved, the weights are normalized. This also makes the division by term div unnecessary in the above formulas. The normalization happens by ensuring the energies of the output direct and ambient channels are PD and PAi, where i is the channel index.
This is straightforward assuming that we know the inter-channel coherences, mixing factors and the channel energies. For simplicity, we focus in the two channel case and specially to one weight pair wÂ1,1 and wÂ1,2 which were the gains to produce the first ambience channel from the first and second input channels. The steps are as follows:
Step 1: Calculate the output signal energy (wherein coherent part adds up amplitudewise, and incoherent part energywise)
PÂ1(wÂ1,1√{square root over (|ICC|·P1)}+sign(ICC)wÂ1,2√{square root over (|ICC|·P2)})2+(1−|ICC|)P1wÂ1,12+(1−|ICC|)P2wÂ1,22
Step 2: Calculate the normalization gain factor
and apply the result to the crossmixing weight factors wÂ1,1 and wÂ1,2. In step 1, the absolute values and the sign-operators for the ICC are included to take into account also the case that the input channels are negatively coherent. The remaining weight factors are also normalized in the same fashion.
In particular, referring to the above, the direct/ambience extractor 620 may be configured to derive the LMS solution by assuming a stable multi-channel signal model, such that the LMS solution will not be restricted to a stereo channel downmix signal.
a shows a block diagram of an embodiment 700 of a direct/ambience estimator 710, which is based on a stereo ambience estimation formula. The direct/ambience estimator 710 of
DTTi=ƒDTT[σi(Chi,R),ICCi(Chi,R)],
ATTi=1−DTTi
explicitly showing a dependency on a channel level difference (CLDi) or parameter σi and an inter-channel coherence (ICCi) parameter of the channel Chi. As depicted in
In particular, the direct/ambience ratio estimation can be performed in that the ratio (DTT) of the direct energy in a channel in comparison to the total energy of that channel may be formulated by
where
Ch is the inspected channel and R is the linear combination of the rest of the channels. is the time average. This formula follows when the ambience level is assumed equal in the channel and the linear combination of the rest of the channels, and the coherence of it to be zero.
b shows a graph 750 of an exemplary DTT (direct-to-total) energy ratio 760 as a function of the inter-channel coherence parameter ICC 770. In the
On the encoder side of the encoder/decoder system 800, an embodiment of an encoder 810 is shown, which may comprise a downmixer 815 for downmixing the multi-channel audio signal (Ch1 . . . ChN) into the downmix signal 115 having the plurality Ch1 . . . ChM of downmix channels, wherein the number of channels is reduced from N to M. The downmixer 815 may also be configured to output the spatial parametric information 105 by calculating inter-channel relations from the multi-channel audio signal 101. In the encoder/decoder system 800 of
On the one hand, the inter-channel relation parameters σi(Chi, R) and ICCi(Chi, R) may be calculated between channel Chi and the linear combination R of the rest of the channels in the encoder 810 and transmitted within the encoded signal. The decoder 820 may in turn receive the encoded signal and be operative on the transmitted inter-channel relation parameters σi(Chi, R) and ICCi(Chi, R).
On the other hand, the encoder 810 may also be configured to calculate the inter-channel coherence parameters ICCi,j between pairs of different channels (Chi, Chj) to be transmitted. In this case, the decoder 810 should be able to derive the parameters ICCi(Chi, R) between channel Chi and the linear combination R of the rest of the channels from the transmitted pairwise calculated ICCi,j(Chi, Chj) parameters, such that the corresponding embodiments having been described earlier may be realized. It is to be noted in this context that the decoder 820 cannot reconstruct the parameters ICCi(Chi, R) from the knowledge of the downmix signal 115 alone.
In embodiments, the transmitted spatial parameters are not only about pairwise channel comparisons.
For example, the most typical MPS case is that there are two downmix channels. The first set of spatial parameters in MPS decoding makes the two channels into three: Center, Left and Right. The set of parameters that guide this mapping are called center prediction coefficient (CPC) and an ICC parameter that is specific to this two-to-three configuration.
The second set of spatial parameters divides each into two: The side channels into corresponding front and rear channels, and the center channel into center and Lfe channel. This mapping is about ICC and CLD parameters introduced before.
It is not practical to make calculation rules for all kinds of downmixing configurations and all kinds of spatial parameters. It is however practical to follow the downmixing steps, virtually. As we know how the two channels are made into three, and the three are made into six, we in the end find an input-output-relation how the two input channels are routed to the six outputs. The outputs are only linear combinations of the downmix channels, plus linear combinations of the decorrelated versions of them. It is not necessary to actually decode the output signal and measure that, but as we know this “decoding matrix”, we can computationally efficiently calculate the ICC and CLD parameters between any channels or combination of channels in parametric domain.
Regardless of the downmix- and the multichannel signal configuration, each output of the decoded signal is a linear combination of the downmix signals plus a linear combination of a decorrelated version of each of them.
where operator D[ ] corresponds to a decorrelator, i.e. a process which makes an incoherent duplicate of the input signal. The factors a and b are known, since they are directly derivable from the parametric side information. This is because by definition, the parametric information is the guide for the decoder how to create the multichannel output from the downmix signals. The above formula can be simplified to
since all the decorrelated parts can be combined for the energetic/coherence comparison. The energy of D is known, since the factors b were also known in the first formula.
From this point, it is to be noted that we can do any kind of coherence and energy comparison between the output channels, or between different linear combinations of the output channels. In case of a simple example of two downmix channels, and a set of output channels, of which, for example, channels number 3 and 5 are compared against each other, the sigma is calculated as follows:
where E[ ] is the expectation (in practice: average) operator. Both of the terms can be formulated as follows
All parameters above are known or measurable from the downmix signals. Crossterms E[Ch_dmx*D] were by definition zero and therefore they are not in the lower row of the formula. Similarly, the coherence formula is
Again, since all parts of the above formula are linear combination of the inputs plus decorrelated signal, the solution is straightforwardly available.
The above examples were with comparing two output channels, but similarly one can make a comparison between linear combinations of output channels, such as with an exemplary process that will be described later.
In summary of the previous embodiments, the presented technique/concept may comprise the following steps:
The usage of spatial parametric side information is best explained and summarized by the embodiment of
Referring to the
In reproduction of audio, there often arises a need to reproduce the sound over headphones. Headphone listening has a specific feature which makes it drastically different to loudspeaker listening and also to any natural sound environment. The audio is set directly to the left and right ear. Produced audio content is typically produced for loudspeaker playback. Therefore, the audio signals do not contain the properties and cues that our hearing system uses in spatial sound perception. That is the case unless binaural processing is introduced into the system.
Binaural processing, fundamentally, may be said to be a process that takes in input sound and modifies it so that it contains only such inter-aural and monaural properties that are perceptually correct (in respect to the way that our hearing system processes the spatial sound). The binaural processing is not a straightforward task and the existing solutions according to the state of the art have much sub-optimalities.
There is a large number of applications where binaural processing for music and movie playback is already included, such as media players and processing devices that are designed to transform multi-channel audio signals into the binaural counterpart for headphones. Typical approach is to use head-related transfer functions (HRTFs) to make virtual loudspeakers and add a room effect to the signal. This, in theory, could be equivalent to listening with loudspeakers in a specific room.
Practice has, however, repeatedly shown that this approach has not consistently satisfied the listeners. There seems to be a compromise that good spatialization with this straightforward method comes with the price of losing audio quality, such as having non-advantageous changes in sound color or timbre, annoying perception of room effect and loss of dynamics. Further problems include inaccurate localization (e.g. in-head localization, front-back-confusion), lack of spatial distance of the sound sources and inter-aural mismatch, i.e. auditory sensation near the ears due to wrong inter-aural cues.
Different listeners may judge the problems very differently. The sensitivity also varies depending on the input material, such as music (strict quality criteria in terms of sound color), movies (less strict) and games (even less strict, but localization is important). There are also typically different design goals depending on the content.
Therefore, the following description deals with an approach of overcoming the above problems as successfully as possible to maximize the averaged perceived overall quality.
a shows a block diagram of an overview 900 of a binaural direct sound rendering device 910 according to further embodiments of the present invention. As shown in
Here, the binaural direct sound rendering device 910 may be configured to feed the direct signal portion 125-1 through head related transfer functions (HRTFs) to obtain a transformed direct signal portion. The binaural direct sound rendering device 910 may furthermore be configured to apply room effect to the transformed direct signal portion to finally obtain the first binaural output signal 915.
b shows a block diagram of details 905 of the binaural direct sound rendering device 910 of
Specifically, referring to
In embodiments, therefore, room effect can advantageously be applied in parallel to the HRTFs, and not serially (i.e. by applying room effect after feeding the signal through HRTFs). Specifically, only the sound that propagates directly from the source goes through or is transformed by the corresponding HRTFs. The indirect/reverberated sound can be approximated to enter the ears all around, i.e. in statistic fashion (by employing coherence control instead of HRTFs). There may also be serial implementations, but the parallel method is advantageous.
a shows a block diagram of an overview 1000 of a binaural ambience sound rendering device 1010 according to further embodiments of the present invention. As shown in
b shows a block diagram of details 1005 of the binaural ambient sound rendering device 1010 of
According to a further embodiment, the binaural ambient sound rendering device 1010 is configured to apply room effect and/or a filter to the ambient signal portion 125-2 for providing the second binaural output signal 1015, so that the second binaural output signal 1015 will be adapted to inter-aural coherence of real diffuse sound fields.
In the above embodiments, decorrelation and coherence control may be performed in two consecutive steps, but this is not a requirement. It is also possible to achieve the same result with a single-step process, without an intermediate formulation of incoherent signals. Both methods are equally valid.
The frequency transform operation of the
The above direct/ambience separation process can be subdivided into two different parts. In the direct/ambience estimation part, the levels and/or ratios of the direct ambient part are estimated based on combination of a signal model and the properties of the audio signal. In the direct/ambience extraction part, the known ratios and the input signal can be used in creating the output direct in ambience signals.
Finally,
a shows a block diagram of an embodiment of an apparatus 1300 for extracting a direct/ambient signal from a mono downmix signal in a filterbank domain. As shown in
In particular, the analysis filterbank 1310 of the apparatus 1300 may be implemented to perform a short-time Fourier transform (STFT) or may, for example, be configured as an analysis QMF filterbank, while the synthesis filterbanks 1320, 1322 of the apparatus 1300 may be implemented to perform an inverse short-time Fourier transform (ISTFT) or may, for example, be configured as synthesis QMF filterbanks.
The analysis filterbank 1310 is configured for receiving a mono downmix signal 1315, which may correspond to the mono downmix signal 215 as shown in the
The DTTmono-, ATTmono-based parameters 1333, 1335 may be supplied from a DTTmono, ATTmono calculator 1330 as shown in
As a result of the application of the DTTmono- or ATTmono-based parameters 1333, 1335, a plurality 1353, 1355 of modified filterbank subbands will be obtained, respectively. Subsequently, the plurality 1353, 1355 of modified filterbank subbands is fed into the synthesis filterbanks 1320, 1322, respectively, which are configured to synthesize the plurality 1353, 1355 of modified filterbank subbands so as to obtain the direct signal portion 1325-1 or the ambient signal portion 1325-2 of the mono downmix signal 1315, respectively. Here, the direct signal portion 1325-1 of
Referring to
According to embodiments, the spatial parameters and the derived parameters are given in a frequency resolution according to the critical bands of the human auditory system, e.g. 28 bands, which is normally less than the resolution of the filterbank.
Therefore, the direct/ambience extraction according to the
Here, a dividing of the left channel (L) into the corresponding output channels L, LS, the right channel (R) into the corresponding output channels R, RS and the center channel (C) into the corresponding output channels C, LFE, respectively, may be represented by a one-to-two (OTT) configuration having a respective input for the corresponding ICC, CLD parameters.
The exemplary MPEG Surround decoding scheme 1400 which specifically corresponds to a “5-2-5 configuration” may, for example, comprise the following steps. In a first step, the spatial parameters or parametric side information may be formulated into the decoding matrices 1430, 1440, which are shown in
Before going further, it is to be pointed out that the just-mentioned exemplary process needs the measurement of
E[|Ldmx|2],E[|Rdmx|2]
which are the mean powers of the downmix channels, and
E[LdmxRdmx*]
which may be referred to as the cross-spectrum, from the downmix channels. Here, the mean powers of the downmix channels are purposefully referred to as energies, since the term “mean power” is not a that common term to be used.
The expectation operator indicated by the square brackets can be replaced in practical applications by a time-average, recursive or non-recursive. The energies and the cross-spectrum are straight-forwardly measurable from the downmix signal.
It is also to be noted that the energy of a linear combination of two channels can be formulated from the energies of the channels, the mixing factors and the cross-spectrum (all in parametric domain, where no signal operations are needed).
The linear combination
Ch=aLdmx+bRdmx
has the following energy:
E[|Ch|2]=E[|aLdmx+bRdmx|2]=a2E[|Ldmx|2]+b2E[|Rdmx|2]+ab(E[LdmxR*dmx]+E[RdmxL*dmx])=a2E[|Ldmx|2]+b2E[|Rdmx|2]+2ab(Re{E[LdmxR*dmx]})
The following describes the individual steps of the exemplary process (i.e. decoding scheme).
First Step (Spatial Parameters to Mixing Matrices)
As described before, the M1- and M2 matrices are created according to MPS Surround standard. The a:th row-b:th column element of M1 is M1(a,b).
Second Step (Mixing Matrices with Energies and Cross-Spectra of the Downmix to Inter-Channel Information of the Upmixed Channels)
Now we have the mixing matrices M1 and M2. We need to formulate how the output channels are created from the left downmix channel (Ldmx) and the right downmix channel (Rdmx). We assume that the decorrelators are used (
L=aLLdmxbLRdmx+cLD1[S1]+DLD2[S2]eLD3[S3]
The above is exemplary for the upmixed front left channel. The other channels can be formulated in the same way. The D-elements are the decorrelators, a-e are weights that are calculable from the M1 and M2 matrix entries.
In particular, the factors a-e are straight-forwardly formulable from the matrix entries:
and for the other channels accordingly.
The S-signals are
SnM1n+3,1Ldmx+M1n+3,2Rdmx
These S-signals are the inputs to the decorrelators from the left hand side matrix in
E[|D[Sn]|2]=E[|Sn|2]
can be calculated as was explained above. The decorrelator does not affect the energy. A perceptually motivated way to do multichannel ambience extraction is by comparing a channel against the sum of all other channels. (Note that this is one option of many.) Now, if we exemplarily consider the case of the channel L, the rest of the channels reads:
We use the symbol “X” here because using “R” for “rest of the channels” might be confusing.
Then the energy of the channel L is
E[|L|2]=aL2E[|Ldmx|2]+bL2E[|Rdmx|2]+cL2E[|S1|2]+dL2E[|S2|2]+eL2E[S3|2]+2abRe{E[LdmxR*dmx]}
Then the energy of the channel X is
And the cross-spectrum is:
Now we can formulate the ICC
and sigma
Third Step (Inter-Channel Information in the Upmixed Channels to DTT Parameters of the Upmixed Channels)
Now we can calculate the DTT of channel L according to
The direct energy of L is
E[|DL|2]=DTT·E[L|2]
The ambience energy of L is
E[|AL|2]=(1−DTT)·E[|L|2]
Fourth Step (Downmixing the Direct/Ambient Energies)
If exemplarily using an incoherent downmixing rule, the left downmix channel ambience energy is
and similarly for the direct part and the right channel direct and ambient part. Note that the above is just one downmixing rule. There can be other downmixing rules as well.
Fifth Step (Calculating the Weights for Ambience Extraction in Downmix Channels)
The left downmix DTT ratio is
The weight factors can then be calculated as described in the
Basically, the above described exemplary process relates the CPC, ICC, and CLD parameters in the MPS stream to the ambience ratios of the downmix channels.
According to further embodiments, there are typically other means to achieve similar goals, and other conditions as well. For example, there may be other rules for downmixing, other loudspeaker layouts, other decoding methods and other ways to make the multi-channel ambience estimation than the one described previously, wherein a specific channel is compared to the remaining channels.
Although the present invention has been described in the context of block diagrams where the blocks represent actual or logical hardware components, the present invention can also be implemented by a computer-implemented method. In the latter case, the blocks represent corresponding method steps where these steps stand for the functionalities performed by corresponding logical or physical hardware blocks.
The described embodiments are merely illustrative for the principles of the present invention. It is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the appending patent claims and not by the specific details presented by way of description and explanation of the embodiments herein.
Dependent on certain implementation requirements of the inventive methods, the inventive methods can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, in particular, a disc, a DVD or a CD having electronically, readable control signals stored thereon, which co-operate with programmable computer systems, such that the inventive methods are performed. Generally, the present invention can, therefore, be implemented as a computer program product with the program code stored on a machine-readable carrier, the program code being operative for performing the inventive methods when the computer program product runs on a computer. In other words, the inventive methods are, therefore, a computer program having a program code for performing at least one of the inventive methods when the computer program runs on a computer. The inventive encoded audio signal can be stored on any machine-readable storage medium, such as a digital storage medium.
An advantage of the novel concept and technique is that the above-mentioned embodiments, i.e. apparatus, method or computer program, described in this application allow for estimating and extracting the direct and/or ambient components from an audio signal with aid of parametric spatial information. In particular, the novel processing of the present invention functions in frequency bands, as typically in the field of ambience extraction. The presented concept is relevant to audio signal processing, since there are a number of applications that need separation of direct and ambient components from an audio signal.
Opposed to standard ambience extraction methods, the present concept is not based on stereo input signals only and may also apply to mono downmix situations. For a single channel downmix, in general no inter-channel differences can be computed. However, by taking the spatial side information into account, ambience extraction becomes possible in this case also.
The present invention is advantageous in that it utilizes the spatial parameters to estimate the ambience levels of the “original” signal. It is based on the concept that the spatial parameters already contain information about the inter-channel differences of the “original” stereo or multi-channel signal.
Once the original stereo or multi-channel ambience levels are estimated, one can also derive the direct and ambience levels in the provided downmix channel(s). This may be done by linear combinations (i.e. weighted summation) of the ambience energies for ambience part, and direct energies or amplitudes for direct part. Therefore, embodiments of the present invention provide ambience estimation and extraction with aid of spatial side information.
Extending from this concept of side information-based processing, the following beneficial properties or advantages exist.
Embodiments of the present invention provide ambience estimation with aid of spatial side information and the provided downmix channels. Such and ambience estimation is important in cases when there are more than one downmix channel provided along with the side information. The side information, and the information that is measured from the downmix channels, can be used together in ambience estimation. In MPEG surround with a stereo downmix, these two information sources together provide the complete information of the inter-channel relations of the original multi-channel sound, and the ambience estimation is based on these relations.
Embodiments of the present invention also provide downmixing of the direct and ambient energies. In the described situation of side-information based ambience extraction, there is an intermediate step of estimating the ambience in a number of channels higher than the provided downmix channels. Therefore, this ambience information has to be mapped to the number of downmix audio channels in a valid way. This process can be referred to as downmixing due to its correspondence to audio channel downmixing. This may be most straightforwardly done by combining the direct and ambience energy in the same way as the provided downmix channels were downmixed.
The downmixing rule does not have one ideal solution, but is likely to be dependent on the application. For instance, in MPEG surround it can be beneficial to treat the channels differently (center, front loud speakers, rear loud speakers) due to their typically different signal content.
Moreover, embodiments provide a multi-channel ambience estimation independently in each channel in respect to the other channels. This property/approach allows to simply use the presented stereo ambience estimation formula to each channel relative to all other channels. By this measure, it is not necessary to assume equal ambience level in all channels. The presented approach is based on the assumption about spatial perception that the ambient component in each channel is that component which has an incoherent counterpart in some of all other channels. An example that suggest the validity of this assumption is that one of two channels emitting noise (ambience) can be divided further into two channels with half energy each, without affecting the perceived sound scene significantly.
In terms of signal processing, it is advantageous that the actual direct/ambience ratio estimation happens by applying the presented ambience estimation formula to each channel versus the linear combination of all other channels.
Finally, embodiments provide an application of the estimated direct ambience energies to extract the actual signals. Once the ambience levels in the downmix channels are known, one may apply two inventive methods for obtaining the ambience signals. The first method is based on a simple multiplication, wherein the direct and ambient parts for each downmix channel can be generated by multiplying the signal with sqrt (direct-to-total-energy-ratio) and sqrt (ambient-to-total-energy-ratio). This provides for each downmix channel two signals that are coherent to each other, but have the energies that the direct and ambient part were estimated to have.
The second method is based on a least-mean-square solution with crossmixing of the channels, wherein the channel crossmixing (also possible with negative signs) allows better estimation of the direct ambience signals than the above solution. In contrast to a least means solution for stereo input and equal ambient levels in the channels provided in “Multiple-loudspeaker playback of stereo signals”, C. Faller, Journal of the AES, October 2007 and “Patent application title: Method to Generate Multi-Channel Audio Signal from Stereo Signals”, Inventors: Christof Faller, Agents: FISH & RICHARDSON P.C., Assignees: LG ELECTRONICS, INC., Origin: MINNEAPOLIS, Minn. US, IPC8 Class: AH04R500FI, USPC Class: 381 1, the present invention provides a least-mean-square solution that does not require equal ambience levels and is also extendable to any number of channels.
Additional properties of the novel processing are the following. In the ambience processing for binaural rendering, the ambience can be processed with a filter that has the property of providing inter-aural coherence in frequency bands that is similar to the inter-aural coherence in real diffuse sound fields, wherein the filter may also include room effect. In the direct part processing for binaural rendering, the direct part can be fed through head related transfer functions (HRTFs) with possible addition of room effect, such as early reflections and/or reverberation.
Besides this, a “level-of-separation” control corresponding to a dry/wet control may be realized in further embodiments. In particular, full separation may not be desirable in many applications as it may lead to audible artifacts, like abrupt changes, modulation effects, etc. Therefore, all the relevant parts of the described processes can be implemented with a “level-of-separation” control for controlling the amount of desired and useful separation. With regard to
The main benefits of the presented solution are the following. The system works in all situations, also with parametric stereo and MPEG surround with mono downmix, unlike previous solutions that rely on downmix information only. The system is furthermore able to utilize spatial side information conveyed together with the audio signal in spatial audio bitstreams to more accurately estimate direct and ambience energies than with simple inter-channel analysis of the downmix channels. Therefore, many applications, such as binaural processing, may benefit by applying different processing for direct and ambient parts of the sound.
Embodiments are based on the following psychoacoustic assumptions. Human auditory systems localizes sources based on inter-aural cues in time-frequency tiles (areas restricted into certain frequency and time range). If two or more incoherent concurrent sources which overlap in time and frequency are presented simultaneously in different locations, the hearing system is not able to perceive the location of the sources. This is because the sum of these sources does not produce reliable inter-aural cues on the listener. The hearing system my thus be described so that it picks up from the audio scene closed time-frequency tiles that provide reliable localization information, and treats the rest as unlocalizable. By these means the hearing system is able to localize sources in complex sound environments. Simultaneous coherent sources have a different effect, they form approximately the same inter-aural cues that a single source between the coherent sources would form.
This is also the property that embodiments take advantage of. The level of localizable (direct) and unlocalizable (ambience) sound can be estimated and these components will then be extracted. The spatialization signal processing is applied only to the localizable/direct part, while the diffuseness/spaciousness/envelope processing is applied to the unlocalizable/ambient part. This gives a significant benefit in the design of a binaural processing system, since many processes may be applied only there where they are needed, leaving the remaining signal unaffected. All processing happens in frequency bands that approximate the human hearing frequency resolution.
Embodiments are based on a decomposition of the signal to maximize the perceptual quality, but minimize the perceived problems. By such a decomposition, it is possible to obtain the direct and the ambience component of an audio signal separately. The two components can then be further processed to achieve a desired effect or representation.
Specifically, embodiments of the present invention allow ambience estimation with aid of the spatial side information in the coded domain.
The present invention is also advantageous in that typical problems of headphone reproduction of audio signals can be reduced by separating the signals in a direct and ambient signal. Embodiments allow to improve existing direct/ambience extraction methods to be applied to binaural sound rendering for headphone reproduction.
The main use case of the spatial side information based processing is naturally MPEG surround and parametric stereo (and similar parametric coding techniques). Typical applications which benefit from ambience extraction are binaural playback due to the ability to apply a different extent of room effect to different parts of the sound, and upmixing to a higher number of channels due to the ability to position and process different components of the sound differently. There may also be applications where the user would need modification of the direct/ambience level, e.g. for purpose of enhancing speech intelligibility.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10174230 | Aug 2010 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2011/050265, filed Jan. 11, 2011, which is incorporated herein by reference in its entirety, and additionally claims priority from U.S. Application No. 61/295,278, filed Jan. 15, 2010 and European Application No. EP 10174230.2, filed Aug. 26, 2010, all of which are incorporated herein by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
7567845 | Avendano et al. | Jul 2009 | B1 |
8781133 | Walther et al. | Jul 2014 | B2 |
20070236858 | Disch et al. | Oct 2007 | A1 |
20090198356 | Goodwin et al. | Aug 2009 | A1 |
Number | Date | Country |
---|---|---|
1264264 | Aug 2000 | CN |
1 761 110 | Mar 2007 | EP |
2009-531724 | Sep 2009 | JP |
2005101905 | Oct 2005 | WO |
WO 2005101905 | Oct 2005 | WO |
2007110101 | Oct 2007 | WO |
Entry |
---|
Official Communication issued in corresponding Japanese Patent Application No. 2012-548400, mailed on Sep. 25, 2013. |
Norimatsu, “Low Bit Rate High Sound Quality Multi-Channel Audio Encoding Technique”, MPEG Surround Panasonic Technical Publication, vol. 54, No. 4, Jan. 15, 2009, 6 pages. |
Official Communication issued in corresponding Chinese Patent Application No. 201180014038.9, mailed on Aug. 2, 2013. |
Official Communication issued in International Patent Application No. PCT/EP2011/050265, mailed on Mar. 15, 2011. |
Goodwin et al., “Primary-Ambient Signal Decomposition and Vector-Based Localization for Spatial Audio Coding and Enhancement,” IEEE Intl. Conf. on Acoustics, Speech and Signal Proc, Apr. 2007, pp. I-9 to I-12. |
Merimaa et al., “Correlation-Based Ambience Extraction from Stereo Recordings,” AES 123rd Convention, Convention Paper 7282, Oct. 5-8, 2007, pp. 1-5, New York, New York. |
Faller, “Multiple-Loudspeaker Playback of Stereo Signals,” J. Audio Eng. Soc., vol. 54, No. 11, Nov. 2006, pp. 1051-1064. |
Goodwin et al., “Binaural 3-D Audio Rendering Based on Spatial Audio Scene Coding,” AES 123rd Convention, Convention Paper 7277, Oct. 5-8, 2007, pp. 1-12, New York, New York. |
Usher et al., “Enhancement of Spatial Sound Quality: A New Reverberation-Extraction Audio Upmixer,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 7, Sep. 2007, pp. 2141-2150. |
“Text of ISO/IEC FDIS 23003-1, MPEG Surround,” ISO/IEC 2006, Jul. 2006, 293 pages. |
Breebarrt et al., “Multi-Channel Goes Mobile: MPEG Surround Binaural Rendering,” AES 29th International Conference, Sep. 2-4, 2006, pp. 1-13, Seoul, Korea. |
English translation of Official Communication issued in corresponding Japanese Patent Application No. 2012-548400, mailed on Oct. 22, 2014. |
Number | Date | Country | |
---|---|---|---|
20120314876 A1 | Dec 2012 | US |
Number | Date | Country | |
---|---|---|---|
61295278 | Jan 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2011/050265 | Jan 2011 | US |
Child | 13546048 | US |