The present application is concerned with audio coding using down-mixing of signals.
Many audio encoding algorithms have been proposed in order to effectively encode or compress audio data of one channel, i.e., mono audio signals. Using psychoacoustics, audio samples are appropriately scaled, quantized or even set to zero in order to remove irrelevancy from, for example, the PCM coded audio signal. Redundancy removal is also performed.
As a further step, the similarity between the left and right channel of stereo audio signals has been exploited in order to effectively encode/compress stereo audio signals.
However, upcoming applications pose further demands on audio coding algorithms. For example, in teleconferencing, computer games, music performance and the like, several audio signals which are partially or even completely uncorrelated have to be transmitted in parallel. In order to keep the bit rate for encoding these audio signals low enough in order to be compatible to low-bit rate transmission applications, recently, audio codecs have been proposed which downmix the multiple input audio signals into a downmix signal, such as a stereo or even mono downmix signal. For example, the MPEG Surround standard downmixes the input channels into the downmix signal in a manner prescribed by the standard. The downmixing is performed by use of so-called OTT−1 and TTT−1 boxes for downmixing two signals into one and three signals into two, respectively. In order to downmix more than three signals, a hierarchic structure of these boxes is used. Each OTT−1 box outputs, besides the mono downmix signal, channel level differences between the two input channels, as well as inter-channel coherence/cross-correlation parameters representing the coherence or cross-correlation between the two input channels. The parameters are output along with the downmix signal of the MPEG Surround coder within the MPEG Surround data stream. Similarly, each TTT−1 box transmits channel prediction coefficients enabling recovering the three input channels from the resulting stereo downmix signal. The channel prediction coefficients are also transmitted as side information within the MPEG Surround data stream. The MPEG Surround decoder upmixes the downmix signal by use of the transmitted side information and recovers, the original channels input into the MPEG Surround encoder.
However, MPEG Surround, unfortunately, does not fulfill all requirements posed by many applications. For example, the MPEG Surround decoder is dedicated for upmixing the downmix signal of the MPEG Surround encoder such that the input channels of the MPEG Surround encoder are recovered as they are. In other words, the MPEG Surround data stream is dedicated to be played back by use of the loudspeaker configuration having been used for encoding.
However, according to some implications, it would be favorable if the loudspeaker configuration could be changed at the decoder's side.
In order to address the latter needs, the spatial audio object coding (SAOC) standard is currently designed. Each channel is treated as an individual object, and all objects are downmixed into a downmix signal. However, in addition the individual objects may also comprise individual sound sources as e.g. instruments or vocal tracks. However, differing from the MPEG Surround decoder, the SAOC decoder is free to individually upmix the downmix signal to replay the individual objects onto any loudspeaker configuration. In order to enable the SAOC decoder to recover the individual objects having been encoded into the SAOC data stream, object level differences and, for objects forming together a stereo (or multi-channel) signal, inter-object cross correlation parameters are transmitted as side information within the SAOC bitstream. Besides this, the SAOC decoder/transcoder is provided with information revealing how the individual objects have been downmixed into the downmix signal. Thus, on the decoder's side, it is possible to recover the individual SAOC channels and to render these signals onto any loudspeaker configuration by utilizing user-controlled rendering information.
However, although the SAOC codec has been designed for individually handling audio objects, some applications are even more demanding. For example, Karaoke applications necessitate a complete separation of the background audio signal from the foreground audio signal or foreground audio signals. Vice versa, in the solo mode, the foreground objects have to be separated from the background object. However, owing to the equal treatment of the individual audio objects it was not possible to completely remove the background objects or the foreground objects, respectively, from the downmix signal.
According to an embodiment, an audio decoder for decoding a multi-audio-object signal having an audio signal of a first type and an audio signal of a second type encoded therein, the multi-audio-object signal having a downmix signal and side information, the side information having level information of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution, and a residual signal specifying residual level values in a second predetermined time/frequency resolution, may have a processor for computing prediction coefficients based on the level information; and an up-mixer for up-mixing the downmix signal based on the prediction coefficients and the residual signal to acquire a first up-mix audio signal approximating the audio signal of the first type and/or a second up-mix audio signal approximating the audio signal of the second type.
According to another embodiment, an audio object encoder may have: a processor for computing level information of an audio signal of the first type and an audio signal of the second type in a first predetermined time/frequency resolution; a processor for computing prediction coefficients based on the level information; a downmixer for downmixing the audio signal of the first type and the audio signal of the second type to acquire a downmix signal; a setter for setting a residual signal specifying residual level values at a second predetermined time/frequency resolution such that up-mixing the downmix signal based on both the prediction coefficients and the residual signal results in a first up-mix audio signal approximating the audio signal of the first type and a second up-mix audio signal approximating the audio signal of the second type, the approximation being improved compared to the absence of the residual signal, the level information and the residual signal being included by a side information forming, along with the downmix signal, a multi-audio-object signal.
According to another embodiment, a method for decoding a multi-audio-object signal having an audio signal of a first type and an audio signal of a second type encoded therein, the multi-audio-object signal having a downmix signal and side information, the side information having level information of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution, and a residual signal specifying residual level values in a second predetermined time/frequency resolution, may have the steps of computing prediction coefficients based on the level information; and up-mixing the downmix signal based on the prediction coefficients and the residual signal to acquire a first up-mix audio signal approximating the audio signal of the first type and/or a second up-mix audio signal approximating the audio signal of the second type.
According to another embodiment, a multi-audio-object encoding method may have the steps of: computing level information of an audio signal of the first type and an audio signal of the second type in a first predetermined time/frequency resolution; computing prediction coefficients based on the level information; downmixing the audio signal of the first type and the audio signal of the second type to acquire a downmix signal; setting a residual signal specifying residual level values at a second predetermined time/frequency resolution such that up-mixing the downmix signal based on both the prediction coefficients and the residual signal results in a first up-mix audio signal approximating the audio signal of the first type and a second up-mix audio signal approximating the audio signal of the second type, the approximation being improved compared to the absence of the residual signal, the level information and the residual signal being included by a side information forming, along with the downmix signal, a multi-audio-object signal.
According to another embodiment, a program may have a program code for executing, when running on a processor, a method for decoding a multi-audio-object signal having an audio signal of a first type and an audio signal of a second type encoded therein, the multi-audio-object signal having a downmix signal and side information, the side information having level information of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution, and a residual signal specifying residual level values in a second predetermined time/frequency resolution, wherein the method may have the steps of computing prediction coefficients based on the level information; and up-mixing the downmix signal based on the prediction coefficients and the residual signal to acquire a first up-mix audio signal approximating the audio signal of the first type and/or a second up-mix audio signal approximating the audio signal of the second type.
According to another embodiment, a program may have a program code for executing, when running on a processor, a multi-audio-object encoding method, wherein the method may have the steps of: computing level information of an audio signal of the first type and an audio signal of the second type in a first predetermined time/frequency resolution; computing prediction coefficients based on the level information; downmixing the audio signal of the first type and the audio signal of the second type to acquire a downmix signal; setting a residual signal specifying residual level values at a second predetermined time/frequency resolution such that up-mixing the downmix signal based on both the prediction coefficients and the residual signal results in a first up-mix audio signal approximating the audio signal of the first type and a second up-mix audio signal approximating the audio signal of the second type, the approximation being improved compared to the absence of the residual signal, the level information and the residual signal being included by a side information forming, along with the downmix signal, a multi-audio-object signal.
According to another embodiment, a multi-audio-object signal may have an audio signal of a first type and an audio signal of a second type encoded therein, the multi-audio-object signal having a downmix signal and side information, the side information having level information of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution, and a residual signal specifying residual level values in a second predetermined time/frequency resolution, wherein the residual signal is set such that computing prediction coefficients based on the level information and up-mixing the downmix signal based on the prediction coefficients and the residual signal results in a first up-mix audio signal approximating the audio signal of the first type and a second up-mix audio signal approximating the audio signal of the second type.
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
a shows a block diagram of an audio encoder for a Karaoke/Solo mode application, according to a comparison embodiment;
b shows a block diagram of an audio encoder for a Karaoke/Solo mode application, according to an embodiment;
a and b show plots of quality measurement results;
a to h show tables reflecting a possible syntax for the SOAC bitstream according to an embodiment of the present invention;
Before embodiments of the present invention are described in more detail below, the SAOC codec and the SAOC parameters transmitted in an SAOC bitstream are presented in order to ease the understanding of the specific embodiments outlined in further detail below.
The SAOC decoder 12 comprises an upmixer 22 which receives the downmix signal 18 as well as the side information 20 in order to recover and render the audio signals 141 and 14N onto any user-selected set of channels 241 to 24M, with the rendering being prescribed by rendering information 26 input into SAOC decoder 12.
The audio signals 141 to 14N may be input into the downmixer 16 in any coding domain, such as, for example, in time or spectral domain. In case, the audio signals 141 to 14N are fed into the downmixer 16 in the time domain, such as PCM coded, downmixer 16 uses a filter bank, such as a hybrid QMF bank, i.e., a bank of complex exponentially modulated filters with a Nyquist filter extension for the lowest frequency bands to increase the frequency resolution therein, in order to transfer the signals into spectral domain in which the audio signals are represented in several subbands associated with different spectral portions, at a specific filter bank resolution. If the audio signals 141 to 14N are already in the representation expected by downmixer 16, same does not have to perform the spectral decomposition.
As outlined above, downmixer 16 computes SAOC-parameters from the input audio signals 141 to 14N. Downmixer 16 performs this computation in a time/frequency resolution which may be decreased relative to the original time/frequency resolution as determined by the filter bank time slots 34 and subband decomposition, by a certain amount, with this certain amount being signaled to the decoder side within the side information 20 by respective syntax elements bsFrameLength and bsFreqRes. For example, groups of consecutive filter bank time slots 34 may form a frame 40. In other words, the audio signal may be divided-up into frames overlapping in time or being immediately adjacent in time, for example. In this case, bsFrameLength may define the number of parameter time slots 41, i.e. the time unit at which the SAOC parameters such as OLD and IOC, are computed in an SAOC frame 40 and bsFreqRes may define the number of processing frequency bands for which SAOC parameters are computed. By this measure, each frame is divided-up into time/frequency tiles exemplified in
The downmixer 16 calculates SAOC parameters according to the following formulas. In particular, downmixer 16 computes object level differences for each object i as
wherein the sums and the indices n and k, respectively, go through all filter bank time slots 34, and all filter bank subbands 30 which belong to a certain time/frequency tile 42. Thereby, the energies of all subband values xi of an audio signal or object i are summed up and normalized to the highest energy value of that tile among all objects or audio signals.
Further the SAOC downmixer 16 is able to compute a similarity measure of the corresponding time/frequency tiles of pairs of different input objects 141 to 14N. Although the SAOC downmixer 16 may compute the similarity measure between all the pairs of input objects 141 to 14N, downmixer 16 may also suppress the signaling of the similarity measures or restrict the computation of the similarity measures to audio objects 141 to 14N which form left or right channels of a common stereo channel. In any case, the similarity measure is called the inter-object cross-correlation parameter IOCi, j. The computation is as follows
with again indexes n and k going through all subband values belonging to a certain time/frequency tile 42, and i and j denoting a certain pair of audio objects 141 to 14N.
The downmixer 16 downmixes the objects 141 to 14N by use of gain factors applied to each object 141 to 14N. That is, a gain factor Di is applied to object i and then all thus weighted objects 141 to 14N are summed up to obtain a mono downmix signal. In the case of a stereo downmix signal, which case is exemplified in
This downmix prescription is signaled to the decoder side by means of down mix gains DMGi and, in case of a stereo downmix signal, downmix channel level differences DCLDi.
The downmix gains are calculated according to:
DMGi=20 log10(Di+ε),(mono downmix),
DMGi=10 log10(D1,i2+D2,i2+ε),(stereo downmix),
where ε is a small number such as 10−9.
For the DCLDs the following formula applies:
In the normal mode, downmixer 16 generates the downmix signal according to:
for a mono downmix, or
for a stereo downmix, respectively.
Thus, in the abovementioned formulas, parameters OLD and IOC are a function of the audio signals and parameters DMG and DCLD are a function of D. By the way, it is noted that D may be varying in time.
Thus, in the normal mode, downmixer 16 mixes all objects 141 to 14N with no preferences, i.e., with handling all objects 141 to 14N equally.
The upmixer 22 performs the inversion of the downmix procedure and the implementation of the “rendering information” represented by matrix A in one computation step, namely
where matrix E is a function of the parameters OLD and IOC.
In other words, in the normal mode, no classification of the objects 141 to 14N into BGO, i.e., background object, or FGO, i.e., foreground object, is performed. The information as to which object shall be presented at the output of the upmixer 22 is to be provided by the rendering matrix A. If, for example, object with index 1 was the left channel of a stereo background object, the object with index 2 was the right channel thereof, and the object with index 3 was the foreground object, then rendering matrix A would be
to produce a Karaoke-type of output signal.
However, as already indicated above, transmitting EGO and FGO by use of this normal mode of the SAOC codec does not achieve acceptable results.
The audio decoder 50 of
The multi-audio-object signal consists of a downmix signal 56 and side information 58. The side information 58 comprises level information 60 describing, for example, spectral energies of the audio signal of the first type and the audio signal of the second type in a first predetermined time/frequency resolution such as, for example, the time/frequency resolution 42. In particular, the level information 60 may comprise a normalized spectral energy scalar value per object and time/frequency tile. The normalization may be related to the highest spectral energy value among the audio signals of the first and second type at the respective time/frequency tile. The latter possibility results in OLDs for representing the level information, also called level difference information herein. Although the following embodiments use OLDs, they may, although not explicitly stated there, use an otherwise normalized spectral energy representation.
The side information 58 comprises also a residual signal 62 specifying residual level values in a second predetermined time/frequency resolution which may be equal to or different to the first predetermined time/frequency resolution.
The means 52 for computing prediction coefficients is configured to compute prediction coefficients based on the level information 60. Additionally, means 52 may compute the prediction coefficients further based on inter-correlation information also comprised by side information 58. Even further, means 52 may use time varying downmix prescription information comprised by side information 58 to compute the prediction coefficients. The prediction coefficients computed by means 52 are needed for retrieving or upmixing the original audio objects or audio signals from the downmix signal 56.
Accordingly, means 54 for upmixing is configured to upmix the downmix signal 56 based on the prediction coefficients 64 received from means 52 and the residual signal 62. By using the residual 62, decoder 50 is able to better suppress cross talks from the audio signal of one type to the audio signal of the other type. In addition to the residual signal 62, means 54 may use the time varying downmix prescription to upmix the downmix signal. Further, means 54 for upmixing may use user input 66 in order to decide which of the audio signals recovered from the downmix signal 56 to be actually output at output 68 or to what extent. As a first extreme, the user input 66 may instruct means 54 to merely output the first up-mix signal approximating the audio signal of the first type. The opposite is true for the second extreme according to which means 54 is to output merely the second up-mix signal approximating the audio signal of the second type. Intermediate options are possible as well according to which a mixture of both up-mix signals is rendered an output at output 68.
The audio encoder 80 further comprises means 86 for computing level information, means 88 for downmixing, means 90 for computing prediction coefficients and means 92 for setting a residual signal. Additionally, audio encoder 80 may comprise means for computing inter-correlation information, namely means 94. Means 86 computes level information describing the level of the audio signal of the first type and the audio signal of the second type in the first predetermined time/frequency resolution from the audio signal as optionally output by means 82. Similarly, means 88 downmixes the audio signals. Means 88 thus outputs the downmix signal 56. Means 86 also outputs the level information 60. Means 90 for computing prediction coefficients acts similarly to means 52. That is, means 90 computes prediction coefficients from the level information 60 and outputs the prediction coefficients 64 to means 92. Means 92, in turn, sets the residual signal 62 based on the downmix signal 56, the predication coefficients 64 and the original audio signals at a second predetermined time/frequency resolution such that up-mixing the downmix signal 56 based on both the prediction coefficients 64 and the residual signal 62 results in a first up-mix audio signal approximating the audio signal of the first type and the second up-mix audio signal approximating the audio signal of the second type, the approximation being approved compared to the absence of the residual signal 62.
The residual signal 62 and the level information 60 are comprised by the side information 58 which forms, along with the downmix signal 56, the multi-audio-object signal to be decoded by decoder
As shown in
Again, it is noted that the audio signal of the first type may be a mono or stereo audio signal. The same applies for the audio signal of the second type. The residual signal 62 may be signaled within the side information in the same time/frequency resolution as the parameter time/frequency resolution used to compute, for example, the level information, or a different time/frequency resolution may be used. Further, it may be possible that the signaling of the residual signal is restricted to a sub-portion of the spectral range occupied by the time/frequency tiles 42 for which level information is signaled. For example, the time/frequency resolution at which the residual signal is signaled, may be indicated within the side information 58 by use of syntax elements bsResidualBands and bsResidualFramesPerSAOCFrame. These two syntax elements may define another sub-division of a frame into time/frequency tiles than the sub-division leading to tiles 42.
By the way, it is noted that the residual signal 62 may or may not reflect information loss resulting from a potentially used core encoder 96 optionally used to encode the downmix signal 56 by audio encoder 80. As shown in
The ability to set, within the multiple-audio-object signal, the time/frequency resolution used for the residual signal 62 different from the time/frequency resolution used for computing the level information 60 enables to achieve a good compromise between audio quality on the one hand and compression ratio of the multiple-audio-object signal on the other hand. In any case, the residual signal 62 enables to better suppress cross-talk from one audio signal to the other within the first and second up-mix signals to be output at output 68 according to the user input 66.
As will become clear from the following embodiment, more than one residual signal 62 may be transmitted within the side information in case more than one foreground object or audio signal of the second type is encoded. The side information may allow for an individual decision as to whether a residual signal 62 is transmitted for a specific audio signal of a second type or not. Thus, the number of residual signals 62 may vary from one up to the number of audio signals of the second type.
In the audio decoder of
where the “1” denotes—depending on the number of channels of d—a scalar, or an identity matrix, and D−1 is a matrix uniquely determined by a downmix prescription according to which the audio signal of the first type and the audio signal of the second type are downmixed into the downmix signal, and which is also comprised by the side information, and H is a term being independent from d but dependent from the residual signal.
As noted above and described further below, the downmix prescription may vary in time and/or may spectrally vary within the side information. If the audio signal of the first type is a stereo audio signal having a first (L) and a second input channel (R), the level information, for example, describes normalized spectral energies of the first input channel (L), the second input channel (R) and the audio signal of the second type, respectively, at the time/frequency resolution 42.
The aforementioned computation according to which the means 56 for up-mixing performs the up-mixing may even be representable by
wherein {circumflex over (L)} is a first channel of the first up-mix signal, approximating L and {circumflex over (R)} is a second channel of the first up-mix signal, approximating R, and the “1” is a scalar in case d is mono, and a 2×2 identity matrix in case d is stereo. If the downmix signal 56 is a stereo audio signal having a first (L0) and second output channel (R0), and the computation according to which the means 56 for up-mixing performs the up-mixing may be representable by
As far as the term H being dependent on the residual signal res is concerned, the computation according to which the means 56 for up-mixing performs the up-mixing may be representable by
The multi-audio-object signal may even comprise a plurality of audio signals of the second type and the side information may comprise one residual signal per audio signal of the second type. A residual resolution parameter may be present in the side information defining a spectral range over which the residual signal is transmitted within the side information. It may even define a lower and an upper limit of the spectral range.
Further, the multi-audio-object signal may also comprise spatial rendering information for spatially rendering the audio signal of the first type onto a predetermined loudspeaker configuration. In other words, the audio signal of the first type may be a multi channel (more than two channels) MPEG Surround signal downmixed down to stereo.
In the following, embodiments will be described which make use of the above residual signal signaling. However, it is noted that the term “object” is often used in a double sense. Sometimes, an object denotes an individual mono audio signal. Thus, a stereo object may have a mono audio signal forming one channel of a stereo signal. However, at other situations, a stereo object may denote, in fact, two objects, namely an object concerning the right channel and a further object concerning the left channel of the stereo object. The actual sense will become apparent from the context.
Before describing the next embodiment, same is motivated by deficiencies realized with the baseline technology of the SAOC standard selected as reference model 0 (RM0) in 2007. The RM0 allowed the individual manipulation of a number of sound objects in terms of their panning position and amplification/attenuation. A special scenario has been presented in the context of a “Karaoke” type application. In this case
As it is visible from subjective evaluation procedures, and could be expected from the underlying technology principle, manipulations of the object position lead to high-quality results, while manipulations of the object level are generally more challenging. Typically, the higher the additional signal amplification/attenuation is, the more potential artefacts arise. In this sense, the Karaoke scenario is extremely demanding since an extreme (ideally: total) attenuation of the FGO is necessitated.
The dual usage case is the ability to reproduce only the FGO without the background/MBO, and is referred to in the following as the solo mode.
It is noted, however, that if a surround background scene is involved, it is referred to as a Multi-Channel Background Object (MBO). The handling of the MBO is the following, which is shown in
In the transcoder 116, the downmix signal 112 is preprocessed and the SAOC and MPS side information streams 106, 114 are transcoded into a single MPS output side information stream 118. This currently happens in a discontinuous way, i.e. either only full suppression of the FGO(s) is supported or full suppression of the MBO.
Finally, the resulting downmix 120 and MPS side information 118 are rendered by an MPEG Surround decoder 122.
In
Assuming one FGO (e.g. one lead vocal), the key observation used by the following embodiment of
When comparing the embodiment of
The advantages resulting from the introduction of the TTT box of
The handling of the three TTT output signals L.R.C. is performed in the “mixing” box 128 of the SAOC transcoder 116.
The processing structure of
The processing structure of
In summary, the embodiment of
Thus, a regular summation is replaced by a “TTT summation” (which can be cascaded when desired).
In order to emphasize the just-mentioned difference between the normal mode of the SAOC encoder and the enhanced mode, reference is made to
Problematically, the processing according to
A possible bitstream format for the one with cascaded TTTs could be as follows:
An addition to the SAOC bitstream that needs to be able to be skipped if to be digested in “regular decode mode”:
As to complexity and memory requirements, the following can be stated. As can be seen from the previous explanations, the enhanced Karaoke/Solo mode of
The relation of this additional structure to the complexity of an MPEG Surround system can be appreciated by looking at the structure of an entire MPEG Surround decoder which for the relevant stereo downmix case (5-2-5 configuration) consists of one TTT element and 2 OTT elements. This already shows that the added functionality comes at a moderate price in terms of computational complexity and memory consumption (note that conceptual elements using residual coding are on average no more complex than their counterparts which include decorrelators instead).
This extension of
A subjective evaluation procedure reveals the improvement in terms of audio quality of the output signal for a Karaoke or solo application. The conditions evaluated are:
The bitrate for the proposed enhanced mode is similar to RM0 if used without residual coding. All other enhanced modes necessitate about 10 kbit/s for every 6 bands of residual coding.
a shows the results for the mute/Karaoke test with 10 listening subjects. The proposed solution has an average MUSHRA score which is higher than RM0 and increases with each step of additional residual coding. A statistically significant improvement over the performance of RM0 can be clearly observed for modes with 6 and more bands of residual coding.
The results for the solo test with 9 subjects in
Overall, for a Karaoke application good quality is achieved at the cost of a ca. 10 kbit/s higher bitrate than RM0. Excellent quality is possible when adding ca. 40 kbit/s on top of the bitrate of RM0. In a realistic application scenario where a maximum fixed bitrate is given, the proposed enhanced mode nicely allows to spend “unused bitrate” for residual coding until the permissible maximum rate is reached. Therefore, the best possible overall audio quality is achieved. A further improvement over the presented experimental results is possible due to a more intelligent usage of residual bitrate: While the presented setup was using residual coding from DC to a certain upper border frequency, an enhanced implementation would spend only bits for the frequency range that is relevant for separating FGO and background objects.
In the foregoing description, an enhancement of the SAOC technology for the Karaoke-type applications has been described. Additional detailed embodiments of an application of the enhanced Karaoke/solo mode for multi-channel FGO audio scene processing for MPEG SAOC are presented.
In contrast to the FGOs, which are reproduced with alterations, the MBO signals have to be reproduced without alteration, i.e. every input channel signal is reproduced through the same output channel at an unchanged level. Consequently, the preprocessing of the MBO signals by an MPEG Surround encoder had been proposed yielding a stereo downmix signal that serves as a (stereo) background object (BGO) to be input to the subsequent Karaoke/solo mode processing stages comprising an SAOC encoder, an MBO transcoder and an MPS decoder.
As can be seen, according to the Karaoke/solo mode coder structure, the input objects are classified into a stereo background object (EGO) 104 and foreground objects (FGO) 110.
While in RM0 the handling of these application scenarios is performed by an SAOC encoder/transcoder system, the enhancement of
Since the straightforward implementation of the TTT building block involves three input signals at encoder side,
As can be seen from
In case of an FGO mono downmix as is the case with
which provides the downmix (L0 R0)T and a signal F0:
The 3rd signal obtained through this linear system is discarded, but can be reconstructed at transcoder side incorporating two prediction coefficients c1 and c2 (CPC) according to:
{circumflex over (F)}0=c1L0+c2R0.
The inverse process at the transcoder is given by:
The parameters m1 and m2 correspond to:
m
1=cos(μ) and m2=sin(μ)
and μ is responsible for panning the FGO in the common TTT downmix (L0 R0)T. The prediction coefficients c1 and c2 necessitated by the TTT upmix unit at transcoder side can be estimated using the transmitted SAOC parameters, i.e. the object level differences (OLDs) for all input audio objects and inter-object correlation (IOC) for BGO downmix (MBO) signals. Assuming statistical independence of FGO and BGO signals the following relationship holds for the CPC estimation:
The variables PLo, PRo, PLoRo, PLoFo and PRoFo can be estimated as follows, where the parameters OLDL, OLDR and IOCLR correspond to the BGO, and OLDF is an FGO parameter:
P
Lo=OLDL+m12OLDF,
P
Ro=OLDR+m22OLDF,
P
LoRo=IOCLR+m1m2OLDF,
P
LoFo
=m
1(OLDL−OLDF)+m2IOCLR,
P
RoFo
=m
2(OLDR−OLDF)+m1IOCLR.
Additionally, the error introduced by the implication of the CPCs is represented by the residual signal 132 that can be transmitted within the bitstream, such that:
res=F0−{circumflex over (F)}0.
In some application scenarios the restriction of a single mono downmix of all FGOs is inappropriate, hence needs to be overcome. For example, the FGOs can be divided into two or more independent groups with different positions in the transmitted stereo downmix and/or individual attenuation. Therefore, the cascaded structure shown in
The detailed mathematics involved with the two-stage cascade shown in
Without loss in generality, but for a simplified illustration the following explanation is based on a cascade consisting of two TTT elements as shown in
Here, the two sets of CPCs result in the following signal reconstruction:
{circumflex over (F)}01=c11L01+c12R01 and {circumflex over (F)}02=c21L02+c22R02.
The inverse process is represented by:
A special case of the two-stage cascade comprises one stereo FGO with its left and right channel being summed properly to the corresponding channels of the BGO, yielding μ=0 and
For this particular panning style and by neglecting the inter-object correlation, OLDLR=0 the estimation of two sets of CPCs reduce to:
with OLDFL and OLDFR denoting the OLDs of the left and right FGO signal, respectively.
The general N-stage cascade case refers to a multi-channel FGO downmix according to:
where each stage features its own CPCs and residual signal.
At the transcoder side, the inverse cascading steps are given by:
To abolish the necessity of preserving the order of the TTT elements, the cascaded structure can easily be converted into an equivalent parallel by rearranging the N matrices into one single symmetric TTN matrix, thus yielding a general TTN style:
where the first two lines of the matrix denote the stereo downmix to be transmitted. On the other hand, the term TTN—two-to-N—refers to the upmixing process at transcoder side.
Using this description the special case of the particularly panned stereo FGO reduces the matrix to:
Accordingly this unit can be termed two-to-four element or TTF.
It is also possible to yield a TTF structure reusing the SAOC stereo preprocessor module.
For the limitation of N=4 an implementation of the two-to-four (TTF) structure which reuses parts of the existing SAOC system becomes feasible. The processing is described in the following paragraphs.
The SAOC standard text describes the stereo downmix preprocessing for the “stereo-to-stereo transcoding mode”. Precisely the output stereo signal Y is calculated from the input stereo signal X together with a decorrelated signal Xd as follows:
Y=G
mod
X+P
2
X
d
The decorrelated component Xd is a synthetic representation of parts of the original rendered signal which have already been discarded in the encoding process. According to
The nomenclature is defined as:
Note that GMod is a function of D, A and E.
To calculate the residual signal XRes the decoder processing may be mimicked in the encoder, i.e. to determine GMod. In general scenarios A is not known, but in the special case of a Karaoke scenario (e.g. with one stereo background and one stereo foreground object, N=4) it is assumed that
which means that only the BGO is rendered.
For an estimation of the foreground object the reconstructed background object is subtracted from the downmix signal X. This and the final rendering is performed in the “Mix” processing block. Details are presented in the following.
The rendering matrix A is set to
where it is assumed that the first 2 columns represent the 2 channels of the FGO and the second 2 columns represent the 2 channels of the BGO.
The BGO and FGO stereo output is calculated according to the following formulas.
Y
BGO
=G
Mod
X+X
Res
As the downmix weight matrix D is defined as
the FGO object can be set to
As an example, this reduces to
Y
FGO
=X−Y
BGO
for a downmix matrix of
XRes are the residual signals obtained as described above. Please note that no decorrelated signals are added.
The final output Y is given by
The above embodiments can also be applied if a mono FGO instead of a stereo FGO is used. The processing is then altered according to the following.
The rendering matrix A is set to
where it is assumed that the first column represents the mono FGO and the subsequent columns represent the 2 channels of the BGO.
The BGO and FGO stereo output is calculated according to the following formulas.
Y
FGO
=G
Mod
X+X
Res
As the downmix weight matrix D is defined as
the BGO object can be set to
As an example, this reduces to
for a downmix matrix of
XRes are the residual signals obtained as described above. Please note that no decorrelated signals are added.
The final output Y is given by
For the handling of more than 4 FGO objects, the above embodiments can be extended by assembling parallel stages of the processing steps just described.
The above just-described embodiments provided the detailed description of the enhanced Karaoke/solo mode for the cases of multi-channel FGO audio scene. This generalization aims to enlarge the class of Karaoke application scenarios, for which the sound quality of the MPEG SAOC reference model can be further improved by application of the enhanced Karaoke/solo mode. The improvement is achieved by introducing a general NTT structure into the downmix part of the SAOC encoder and the corresponding counterparts into the SAOCtoMPS transcoder. The use of residual signals enhanced the quality result.
a to 13h show a possible syntax of the SAOC side information bit stream according to an embodiment of the present invention.
After having described some embodiments concerning an enhanced mode for the SAOC codec, it should be noted that some of the embodiments concern application scenarios where the audio input to the SAOC encoder contains not only regular mono or stereo sound sources but multi-channel objects. This was explicitly described with respect to
A further embodiment for an enhanced Karaoke/Solo mode is described below. It allows the individual manipulation of a number of audio objects in terms of their level amplification/attenuation without significant decrease in the resulting sound quality. A special “Karaoke-type” application scenario necessitates a total suppression of the specific objects, typically the lead vocal, (in the following called ForeGround Object FGO) keeping the perceptual quality of the background sound scene unharmed. It also entails the ability to reproduce the specific FGO signals individually without the static background audio scene (in the following called BackGround Object BGO), which does not necessitate user controllability in terms of panning. This scenario is referred to as a “Solo” mode. A typical application case contains a stereo BGO and up to four FGO signals, which can, for example, represent two independent stereo objects.
According to this embodiment and
An MBO can be treated the same way as explained above, i.e. it is preprocessed by an MPEG Surround encoder yielding a mono or stereo downmix signal that serves as BGO to be input to the subsequent enhanced SAOC encoder. In this case the transcoder has to be provided with an additional MPEG Surround bitstream next to the SAOC bitstream.
Next, the calculation performed by the TTN (OTN) element is explained. The TTN/OTN matrix expressed in a first predetermined time/frequency resolution 42, M, is the product of two matrices
M=D
−1
C,
where D−1 comprises the downmix information and C implies the channel prediction coefficients (CPCs) for each FGO channel. C is computed by means 52 and box 152, respectively, and D−1 is computed and applied, along with C, to the SAOC downmix by means and box 152, respectively. The computation is performed according to
for the TTN element, i.e. a stereo downmix and
for the OTN element, i.e. a mono downmix.
The CPCs are derived from the transmitted SAOC parameters, i.e. the OLDs, IOCs, DMGs and DCLDs. For one specific FGO channel j the CPCs can be estimated by
The parameters OLDL, OLDR and IOCLR correspond to the BGO, the remainder are FGO values.
The coefficients mj and nj denote the downmix values for every FGO j for the right and left downmix channel, and are derived from the downmix gains DMG and downmix channel level differences DCLD
With respect to the OTN element, the computation of the second CPC values cj2 becomes redundant.
To reconstruct the two object groups BGO and FGO, the downmix information is exploited by the inverse of the downmix matrix D that is extended to further prescribe the linear combination for signals F01 to F0N, i.e.
In the following, the downmix at encoder's side is recited: Within the TTN−1 element, the extended downmix matrix is
for a stereo BGO,
for a mono BGO,
and for the OTN−1 element it is
for a stereo BGO,
for a mono BGO.
The output of the TTN/OTN element yields
for a stereo BGO and a stereo downmix. In case the BGO and/or downmix is a mono signal, the linear system changes accordingly.
The residual signal resi corresponds to the FGO object and if not transferred by SAOC stream—because, for example, it lies outside the residual frequency range, or it is signalled that for FGO object i no residual signal is transferred at all—resi is inferred to be zero. {circumflex over (F)}i is the reconstructed/up-mixed signal approximating FGO object i. After computation, it may be passed through an synthesis filter bank to obtain the time domain such as PCM coded version of FGO object i. It is recalled that L0 and R0 denote the channels of the SAOC downmix signal and are available/signalled in an increased time/frequency resolution compared to the parameter resolution underlying indices (n,k). {circumflex over (L)} and {circumflex over (R)} are the reconstructed/up-mixed signals approximating the left and right channels of the BGO object. Along with the MPS side bitstream, it may be rendered onto the original number of channels.
According to an embodiment, the following TTN matrix is used in an energy mode.
The energy based encoding/decoding procedure is designed for non-waveform preserving coding of the downmix signal. Thus the TTN upmix matrix for the corresponding energy mode does not rely on specific waveforms, but only describe the relative energy distribution of the input audio objects. The elements of this matrix MEnergy are obtained from the corresponding OLDs according to
for a stereo BGO,
and
for a mono BGO,
so that the output of the TTN element yields
or respectively
Accordingly, for a mono downmix the energy-based upmix matrix MEnergy becomes
for a stereo BGO, and
for a mono BGO,
so that the output of the OTN element results in.
or respectively
Thus, according to the just mentioned embodiment, the classification of all objects (Obj1 . . . ObjN) into BGO and FGO, respectively, is done at encoder's side. The BGO may be a mono (L) or stereo
object. The downmix of the BGO into the downmix signal is fixed. As far as the FGOs are concerned, the number thereof is theoretically not limited. However, for most applications a total of four FGO objects seems adequate. Any combinations of mono and stereo objects are feasible. By way of parameters mi (weighting in left/mono downmix signal) and ni (weighting in right downmix signal), the FGO downmix is variable both in time and frequency. As a consequence, the downmix signal may be mono (L0) or stereo
Again, the signals (F01 . . . F0N)T are not transmitted to the decoder/transcoder. Rather, same are predicted at decoder's side by means of the aforementioned CPCs.
In this regard, it is again noted that the residual signals res may even be disregarded by a decoder. In this case, a decoder—means 52, for example—predicts the virtual signals merely based in the CPCs, according to:
Then, BGO and/or FGO are obtained by—by, for example, means 54—inversion of one of the four possible linear combinations of the encoder,
for example,
where again D−1 is a function of the parameters DMG and DCLD.
Thus, in total, a residual neglecting TTN (OTN) Box 152 computes both just-mentioned computation steps
for example:
It is noted, that the inverse of D can be obtained straightforwardly in case D is quadratic. In case of a non-quadratic matrix D, the inverse of D shall be the pseudo-inverse, i.e. pinv(D)=D*(DD*)−1 or pinv(D)=(D*D)−1 D*. In either case, an inverse for D exists.
Finally,
Depending on an actual implementation, the inventive encoding/decoding methods can be implemented in hardware or in software. Therefore, the present invention also relates to a computer program, which can be stored on a computer-readable medium such as a CD, a disk or any other data carrier. The present invention is, therefore, also a computer program having a program code which, when executed on a computer, performs the inventive method of encoding or the inventive method of decoding described in connection with the above figures.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
This application claims priority from Provisional U.S. Patent Application No. 60/980,571, which was filed on Oct. 17, 2007, and from Provisional U.S. Patent Application No. 60/991,335, which was filed on Nov. 30, 2007, which are both incorporated herein in their entirety by reference.
Number | Date | Country | |
---|---|---|---|
60980571 | Oct 2007 | US | |
60991335 | Nov 2007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13451649 | Apr 2012 | US |
Child | 13747502 | US | |
Parent | 12253515 | Oct 2008 | US |
Child | 13451649 | US |