The present invention relates to a transformation of multi-channel parameters, and in particular to the generation of coherence parameters and level parameters, which indicate spatial properties between two audio signals, based on an object-parameter based representation of a spatial audio scene.
There are several approaches for parametric coding of multi-channel audio signals, such as ‘Parametric Stereo (PS)’, ‘Binaural Cue Coding (BCC) for Natural Rendering’ and ‘MPEG Surround’, which aim at representing a multi-channel audio signal by means of a down-mix signal (which could be either monophonic or comprise several channels) and parametric side information (‘spatial cues’) characterizing its perceived spatial sound stage.
Those techniques could be called channel-based, i.e. the techniques try to transmit a multi-channel signal already present or generated in a bitrate-efficient manner. That is, a spatial audio scene is mixed to a predetermined number of channels before transmission of the signal to match a predetermined loudspeaker set-up and those techniques aim at the compression of the audio channels associated to the individual loudspeakers.
The parametric coding techniques rely on a down-mix channel carrying audio content together with parameters, which describe the spatial properties of the original spatial audio scene and which are used on the receiving side to reconstruct the multi-channel signal or the spatial audio scene.
A closely related group of techniques, e.g. ‘BCC for Flexible Rendering’, are designed for efficient coding of individual audio objects rather than channels of the same multi-channel signal for the sake of interactively rendering them to arbitrary spatial positions and independently amplifying or suppressing single objects without any a priori encoder knowledge thereof. In contrast to common parametric multi-channel audio coding techniques (which convey a given set of audio channel signals from an encoder to a decoder), such object coding techniques allow rendering of the decoded objects to any reproduction setup, i.e. the user on the decoding side is free to choose a reproduction setup (e.g. stereo, 5.1 surround) according to his preference.
Following the object coding concept, parameters can be defined, which identify the position of an audio object in space, to allow for flexible rendering on the receiving side. Rendering at the receiving side has the advantage, that even non-ideal loudspeaker set-ups or arbitrary loudspeaker set-ups can be used to reproduce the spatial audio scene with high quality. In addition, an audio signal, such as, for example, a down-mix of the audio channels associated with the individual objects, has to be transmitted, which is the basis for the reproduction on the receiving side.
Both discussed approaches rely on a multi-channel speaker set-up at the receiving side, to allow for a high-quality reproduction of the spatial impression of the original spatial audio scene.
As previously outlined, there are several state-of-the-art techniques for parametric coding of multi-channel audio signals which are capable of reproducing a spatial sound image, which is—dependent on the available data rate—more or less similar to that of the original multi-channel audio content.
However, given some pre-coded audio material (i.e. spatial sound described by a given number of reproduction channel signals), such a codec does not offer any means for a-posteriori and interactive rendering of single audio objects according to the liking of the listener. On the other hand, there are spatial audio object coding techniques which are specially designed for the latter purpose, but since the parametric representations used in such systems are different from those for multi-channel audio signals, separate decoders are needed in case one wants to benefit from both techniques in parallel. The drawback that results from this situation is that, although the back-ends of both systems fulfill the same task, which is rendering of spatial audio scenes on a given loudspeaker setup, they have to be implemented redundantly, i.e. two separate decoders are necessitated to provide both functionalities.
Another limitation of the prior-art object coding technology is the lack of a means for storing and/or transmitting pre-rendered spatial audio object scenes in a backwards compatible way. The feature of enabling interactive positioning of single audio objects provided by the spatial audio object coding paradigm turns out to be a drawback when it comes to identical reproduction of a readily rendered audio scene.
Summarizing, one is confronted with the unfortunate situation that, although a multi-channel playback environment may be present which implements one of the above approaches, a further playback environment may be necessitated to also implement the second approach. It may be noted, that according to the longer history, channel-based coding schemes are much more common, such as, for example, the famous 5.1 or 7.1/7.2 multi-channel signals stored on DVD or the like.
That is, even if a multi-channel audio decoder and associated playback equipment (amplifier stages and loudspeakers) are present, a user needs an additional complete set-up, i.e. at least an audio decoder, when he wants to play back object-based coded audio data. Normally, the multi-channel audio decoders are directly associated to the amplifier stages and a user does not have direct access to the amplifier stages used for driving the loudspeakers. This is, for example, the case in most of the commonly available multi-channel audio or multimedia receivers. Based on existing consumer electronics, a user desiring to be able to listen to audio content encoded with both approaches would even need a complete second set of amplifiers, which is, of course, an unsatisfying situation.
According to an embodiment, a multi-channel parameter transformer for generating a level parameter indicating an energy relation between a first audio signal and a second audio signal of a representation of a multi-channel spatial audio signal, may have an object parameter provider for providing object parameters for a plurality of audio objects associated to a down-mix channel depending on the object audio signals associated to the audio objects, the object parameters having an energy parameter for each audio object indicating an energy information of the object audio signal; and a parameter generator for deriving the level parameter by combining the energy parameters and object rendering parameters related to a rendering configuration.
According to another embodiment, a method for generating a level parameter indicating an energy relation between a first audio signal and a second audio signal of a representation of a multi-channel spatial audio signal, may have the steps of providing object parameters for a plurality of audio objects associated to a down-mix channel depending on the object audio signals associated to the audio objects, the object parameters having an energy parameter for each audio object indicating an energy information of the object audio signal; and deriving the level parameter by combining the energy parameters and object rendering parameters related to a rendering configuration.
According to another embodiment, a computer program may have a program code for performing, when running on a computer, a method for generating a level parameter indicating an energy relation between a first audio signal and a second audio signal of a representation of a multi-channel spatial audio signal, which may have the steps of: providing object parameters for a plurality of audio objects associated to a down-mix channel depending on the object audio signals associated to the audio objects, the object parameters having an energy parameter for each audio object indicating an energy information of the object audio signal; and deriving the level parameter by combining the energy parameters and object rendering parameters related to a rendering configuration.
It is therefore desirable to be able to provide a method to reduce the complexity of systems, which are capable of both decoding of parametric multi-channel audio streams as well as parametrically coded spatial audio object streams.
An embodiment of the invention is a multi-channel parameter transformer for generating a level parameter indicating an energy relation between a first audio signal and a second audio signal of a representation of a multi-channel spatial audio signal, comprising: an object parameter provider for providing object parameters for a plurality of audio objects associated to a down-mix channel depending on the object audio signals associated to the audio objects, the object parameters comprising an energy parameter for each audio object indicating an energy information of the object audio signal; and a parameter generator for deriving the level parameter by combining the energy parameters and object rendering parameters related to a rendering configuration.
According to a further embodiment of the present invention, the parameter transformer generates a coherence parameter and a level parameter, indicating a correlation or coherence and an energy relation between a first and a second audio signal of a multi-channel audio signal associated to a multi-channel loudspeaker configuration. The correlation- and level parameters are generated based on provided object parameters for at least one audio object associated to a down-mix channel, which is itself generated using an object audio signal associated to the audio object, wherein the object parameters comprise an energy parameter indicating an energy of the object audio signal. To derive the coherence and the level parameter, a parameter generator is used, which combines the energy parameter and additional object rendering parameters, which are influenced by a playback configuration. According to some embodiments, the object rendering parameters comprise loudspeaker parameters indicating the location of the playback loudspeakers with respect to a listening position. According to some embodiments, the object rendering parameters comprise object location parameters indicating the location of the objects with respect to a listening position. To this end, the parameter generator takes advantage of synergy effects resulting from both spatial audio coding paradigms.
According to a further embodiment of the present invention, the multi-channel parameter transformer is operative to derive MPEG Surround compliant coherence and level parameters (ICC and CLD), which can furthermore be used to steer an MPEG Surround decoder. It is noted that Inter-channel coherence/cross-correlation (ICC)-represents the coherence or cross-correlation between the two input channels. When time differences are not included, coherence and correlation are the same. Stated differently, both terms point to the same characteristic, when inter channel time differences or inter channel phase differences are not used.
In this way, a multi-channel parameter transformer together with a standard MPEG Surround-transformer can be used to reproduce an object-based encoded audio signal. This has the advantage, that only an additional parameter transformer is necessitated, which receives a spatial audio object coded (SAOC) audio signal and which transforms the object parameters such, that they can be used by a standard MPEG SURROUND-decoder to reproduce the multi-channel audio signal via the existing playback equipment. Therefore, common playback equipment can be used without major modifications to also reproduce spatial audio object coded content.
According to a further embodiment of the present invention, the generated coherence and level parameters are multiplexed with the associated down-mix channel into a MPEG SURROUND compliant bitstream. Such a bitstream can then be fed to a standard MPEG SURROUND-decoder without requiring any further modifications to the existing playback environment.
According to a further embodiment of the present invention, the generated coherence and level parameters are directly transmitted to a slightly modified MPEG Surround-decoder, such that the computational complexity of a multi-channel parameter transformer can be kept low.
According to a further embodiment of the present invention, the generated multi-channel parameters (coherence parameter and level parameter) are stored after the generation, such that a multi-channel parameter transformer can also be used as a means for preserving the spatial information gained during scene rendering. Such scene rendering can, for example, also be performed at the music-studio while generating the signals, such that a multi-channel compatible signal can be generated without any additional effort, using a multi-channel parameter transformer as described in more detail in the following paragraphs. Thus, pre-rendered scenes could be reproduced using legacy equipment.
Prior to a more detailed description of several embodiments of the present invention, a short review of the multi-channel audio coding and object audio coding techniques and spatial audio object coding techniques will be given. To this end, reference will also be made to the enclosed Figures.
a shows a prior art multi-channel audio coding scheme;
b shows a prior art object coding scheme;
a and 6b show application scenarios for spatial audio object coded content;
a shows a schematic view of a multi-channel audio encoding and decoding scheme, whereas
When decoding, this information can be used to redistribute the audio channels comprised in the down-mix signal to the reconstructed audio channels 12a to 12d. It may be noted, that the generic multi-channel audio scheme is implemented to reproduce the same number of reconstructed channels 12a to 12d as the number of original audio channels 2a to 2d input into the multi-channel audio encoder 4. However, other decoding schemes can also be implemented, reproducing more or less channels than the number of the original audio channels 2a to 2d.
In a way, the multi-channel audio techniques schematically sketched in
b details the prior art approach to object-based audio coding. As an example, coding of sound objects and the ability of “content-based interactivity” is part of the MPEG-4 concept. The conventional audio object coding technique schematically sketched in
A conventional audio object decoder 28 aims at reproducing the audio objects 22a to 22d, to derive reconstructed audio objects 28a to 28d. A scene composer 30 within a conventional audio object decoder allows for a discrete positioning of the reconstructed audio objects 28a to 28d (sources) and the adaptation to various loudspeakers set-ups. A scene is fully defined by a scene description 34 and associated audio objects. Some conventional scene composers 30 expect a scene description in a standardized language, e.g. BIFS (binary format for scene description). On the decoder side, arbitrary loudspeaker set-ups may be present and the decoder provides audio channels 32a to 32e to individual loudspeakers, which are optimally tailored to the reconstruction of the audio scene, as the full information on the audio scene is available on the decoder side. For example, binaural rendering is feasible, which results in two audio channels generated to provide a spatial impression when listened to via headphones.
An optional user interaction to the scene composer 30 enables a repositioning/repanning of the individual audio objects on the reproduction side. Additionally, positions or levels of specifically selected audio objects can be modified, to, for example, increase the intelligibility of a talker, when ambient noise objects or other audio objects related to different talkers in a conference are suppressed, i.e. decreased in level.
In other words, conventional audio object coders encode a number of audio objects into elementary streams, each stream associated to one single audio object. The conventional decoder decodes these streams and composes an audio scene under the control of a scene description (BIFS) and optionally based on user interaction. In terms of practical application, this approach suffers from several disadvantages:
Due to the separate encoding of each individual audio (sound) object, the necessitated bitrate for transmission of the whole scene is significantly higher than rates used for a monophonic/stereophonic transmission of compressed audio. Obviously, the necessitated bitrate grows approximately proportionally with the number of transmitted audio objects, i.e. with the complexity of the audio scene.
Consequently, due to the separate decoding of each sound object, the computational complexity for the decoding process significantly exceeds that one of a regular mono/stereo audio decoder. The necessitated computational complexity for decoding grows approximately proportionally with the number of transmitted objects as well (assuming a low complexity composition procedure). When using advanced composition capabilities, i.e. using different computational nodes, these disadvantages are further increased by the complexity associated with the synchronization of corresponding audio nodes and with the overall complexity in running a structured audio engine.
Furthermore, since the total system involves several audio decoder components and a BIFS-based composition unit, the complexity of the necessitated structure is an obstacle to the implementation in real-world applications. Advanced composition capabilities furthermore necessitate the implementation of a structured audio engine with the above-mentioned complications.
As it will become apparent from the discussion of
Utilizing existing multi-channel audio coding structures, such as MPEG Surround, the inventive concept evolves into a bitrate-efficient and compatible extension of existing audio distribution infrastructure towards the capability of using an object-based representation. To distinguish from the prior approaches of audio object coding (AOC) and spatial audio coding (multi-channel audio coding), embodiments of the present invention will in the following be referred to using the term spatial audio object coding or its abbreviation SAOC.
The spatial audio object coding scheme shown in
The SAOC-decoder 56 receives the down-mix signal 54 together with the side information 55. Based on the down-mix signal 54 and the side information 55, the spatial audio object decoder 56 reconstructs a set of audio objects 58a to 58d. Reconstructed audio objects 58a to 58d are input into a mixer/rendering stage 60, which mixes the audio content of the individual audio objects 58a to 58d to generate a desired number of output channels 62a and 62b, which normally correspond to a multi-channel loudspeaker set-up intended to be used for playback.
Optionally, the parameters of the mixer/renderer 60 can be influenced according to a user interaction or control 64, to allow interactive audio composition and thus maintain the high flexibility of audio object coding.
The concept of spatial audio object coding shown in
The transmission is extremely bitrate-efficient due to the use of down-mix signals and accompanying object parameters. That is, object based side information is transmitted together with a down-mix signal, which is composed of audio signals associated to individual audio objects. Therefore, the bit rate demand is significantly decreased as compared to approaches, where the signal of each individual audio object is separately encoded and transmitted. Furthermore, the concept is backwards compatible to already existing transmission structures. Legacy devices would simply render (compose) the downmix signal.
The reconstructed audio objects 58a to 58d can be directly transferred to a mixer/renderer 60 (scene composer). In general, the reconstructed audio objects 58a to 58d could be connected to any external mixing device (mixer/renderer 60), such that the inventive concept can be easily implemented into already existing playback environments. The individual audio objects 58a . . . d could principally be used as a solo presentation, i.e. be reproduced as a single audio stream, although they are usually not intended to serve as a high quality solo reproduction.
In contrast to separate SAOC decoding and subsequent mixing, a combined SAOC-decoder and mixer/renderer is extremely attractive because it leads to very low implementation complexity. As compared to the straight forward approach, a full decoding/reconstruction of the objects 58a to 58d as an intermediate representation can be avoided. The computation is mainly related to the number of intended output rendering channels 62a and 62b. As it becomes apparent from
Generally, transmitting down-mix signals 54 associated with corresponding audio object information 55 can be combined with arbitrary multi-channel audio coding techniques, such as, for example, parametric stereo, binaural cue coding or MPEG Surround.
The embodiment shown in
In
The MPEG Surround decoder 100 additionally receives spatial cues 104 of an MPEG Surround bitstream, such as coherence parameters ICC and level parameters CLD, both representing the signal characteristics between two audio signals within the MPEG Surround encoding/decoding scheme, which is shown in
A multi-channel parameter transformer 106 receives SAOC parameters (object parameters) 122 related to audio objects, which indicate properties of associated audio objects contained within Downmix Signal 102. Furthermore, the transformer 106 receives object rendering parameters via an object rendering parameters input. These parameters can be the parameters of a rendering matrix or can be parameters useful for mapping audio objects into a rendering scenario. Depending on the object positions exemplarily adjusted by the user and input into block 12, the rendering matrix will be calculated by block 112. The output of block 112 is then input into block 106 and particularly into the parameter generator 108 for calculating the spatial audio parameters. When the loudspeaker configuration changes, the rendering matrix or generally at least some of the object rendering parameters change as well. Thus, the rendering parameters depend on the rendering configuration, which comprises the loudspeaker configuration/playback configuration or the transmitted or user-selected object positions, both of which can be input into block 112.
A parameter generator 108 derives the MPEG Surround spatial cues 104 based on the object parameters, which are provided by object parameter provider (SAOC parser) 110. The parameter generator 108 additionally makes use of rendering parameters provided by a weighting factor generator 112. Some or all of the rendering parameters are weighting parameters describing the contribution of the audio objects contained in the down-mix signal 102 to the channels created by the spatial audio object decoder 120. The weighting parameters could, for example, be organized in a matrix, since these serve to map a number of N audio objects to a number M of audio channels, which are associated to individual loudspeakers of a multi-channel loudspeaker set-up used for playback. There are two types of input data to the multi-channel parameter transformer (SAOC 2 MPS transcoder). The first input is an SAOC bitstream 122 having object parameters associated to individual audio objects, which indicate spatial properties (e.g. energy information) of the audio objects associated to the transmitted multi-object audio scene. The second input is the rendering parameters (weighting parameters) 124 used for mapping the N objects to the M audio-channels.
As previously discussed, the SAOC bitstream 122 contains parametric information about the audio objects that have been mixed together to create the down-mix signal 102 input into the MPEG Surround decoder 100. The object parameters of the SAOC bitstream 122 are provided for at least one audio object associated to the down-mix channel 102, which was in turn generated using at least an object audio signal associated to the audio object. A suitable parameter is, for example, an energy parameter, indicating an energy of the object audio signal, i.e. the strength of the contribution of the object audio signal to the down-mix 102. In case a stereo downmix is used, a direction parameter might be provided, indicating the location of the audio object within the stereo downmix. However, other object parameters are obviously also suited and could therefore be used for the implementation.
The transmitted downmix does not have to be a monophonic signal. It could, for example, also be a stereo signal. In that case, 2 energy parameters might be transmitted as object parameters, each parameter indicating each object's contribution to one of the two channels of the stereo signal. That is, for example, if 20 audio objects are used for the generation of the stereo downmix signal, 40 energy parameters would be transmitted as the object parameters.
The SAOC bit stream 122 is fed into an SAOC parsing block, i.e. into object parameter provider 110, which regains the parametric information, the latter comprising, besides the actual number of audio objects dealt with, mainly object level envelope (OLE) parameters which describe the time-variant spectral envelopes of each of the audio objects present.
The SAOC parameters will typically be strongly time dependent, as they transport the information, as to how the multi-channel audio scene changes with time, for example when certain objects emanate or others leave the scene. To the contrary, the weighting parameters of rendering matrix 124 do often not have a strong time or frequency dependency. Of course, if objects enter or leave the scene, the number of necessitated parameters changes abruptly, to match the number of the audio objects of the scene. Furthermore, in applications with interactive user control, the matrix elements may be time variant, as they are then depending on the actual input of a user.
In a further embodiment of the present invention, parameters steering a variation of the weighting parameters or the object rendering parameters or time-varying object rendering parameters (weighting parameters) themselves may be conveyed in the SAOC bitstream, to cause a variation of rendering matrix 124. The weighting factors or the rendering matrix elements may be frequency dependent, if frequency dependent rendering properties are desired (as for example when a frequency-selective gain of a certain object is desired).
In the embodiment of
The parameter generator (scene rendering engine) 108 receives both, the weighting factors and the object parameters (for example the energy parameter OLE) to calculate a mapping of the N audio objects to M output channels, wherein M may be larger than, less than or equal to N and furthermore even varying with time. When using a standard MPEG Surround decoder 100, the resulting spatial cues (for example, coherence and level parameters) may be transmitted to the MPEG-decoder 100 by means of a standards-compliant surround bitstream matching the down-mix signal transmitted together with the SAOC bitstream.
Using a multi-channel parameter transformer 106, as previously described, allows using a standard MPEG Surround decoder to process the down-mix signal and the transformed parameters provided by the parameter transformer 106 to play back the reconstruction of the audio scene via the given loudspeakers. This is achieved with the high flexibility of the audio object coding-approach, i.e. by allowing serious user interaction on the playback side.
As an alternative to the playback of a multi-channel loudspeaker set-up, a binaural decoding mode of the MPEG Surround decoder may be utilized to play back the signal via headphones.
However, if minor modifications to the MPEG Surround decoder 100 are acceptable, e.g. within a software-implementation, the transmission of the spatial cues to the MPEG Surround decoder could also be performed directly in the parameter domain. I.e., the computational effort of multiplexing the parameters into an MPEG Surround compatible bitstream can be omitted. Apart from the decrease in computational complexity, a further advantage is to avoid of a quality degradation introduced by the MPEG-conforming parameter quantization, since such quantization of the generated spatial cues would in this case no longer be necessitated. As already mentioned, this benefit calls for a more flexible MPEG Surround decoder implementation, offering the possibility of a direct parameter feed rather than a pure bitstream feed.
In another embodiment of the present invention, an MPEG Surround compatible bitstream is created by multiplexing the generated spatial cues and the down-mix signal, thus offering the possibility of a playback via legacy equipment. Multi-channel parameter transformer 106 could thus also serve the purpose of transforming audio object coded data into multi-channel coded data at the encoder side. Further embodiments of the present invention, based on the multi-channel parameter transformer of
The following examples will furthermore be based on 5.1-channel representations of multi-channel audio signals as specified in the MPEG Surround standard, which defines two possible parameterisations, that can be visualized by the tree-structures shown in
In case of the transmission of a mono-down-mix 160, the MPEG Surround decoder employs a tree-structure parameterization. The tree is populated by so-called OTT elements (boxes) 162a to 162e for the first parameterization and 164a to 164e for the second parameterization.
Each OTT element up-mixes a mono-input into two output audio signals. To perform the up-mix, each OTT element uses an ICC parameter describing the desired cross-correlation between the output signals and a CLD parameter describing the relative level differences between the two output signals of each OTT element.
Even though structurally similar, the two parameterizations of
However, a multi-channel parameter transformer can be implemented based on either of the two implementations. Once the inventive concept is understood, it may also be applied to other multi channel configurations than the ones described below. For the sake of conciseness, the following embodiments of the present invention focus on the left parameterization of
In the embodiments briefly discussed in the following paragraphs, an SAOC bitstream comprises (relative) levels of each audio object in the down-mixed signal (for each time-frequency tile separately, as is common practice within a frequency-domain framework using, for example, a filterbank or a time-to-frequency transformation).
Furthermore, the present invention is not limited to a specific level representation of the objects, the description below merely illustrates one method to calculate the spatial cues for the MPEG Surround bitstream based on an object power measure that can be derived from the SAOC object parameterization.
As is apparent from
y=Wo.
The parameter generator (the rendering engine 108) utilizes the rendering matrix W to estimate all CLD and ICC parameters based on SAOC data σi2. With respect to the visualizations of
As it can be observed, the first output signal 166a of OTT element 162a is processed further by OTT elements 162b, 162c and 162d, finally resulting in output channels LF, RF, C and LFE. The second output channel 166b is processed further by OTT element 162e, resulting in output channels LS and RS. Substituting the OTT elements of
The number N of the columns of matrix W is not fixed, as N is the number of audio objects, which might be varying.
One possibility to derive the spatial cues (CLD and ICC) for the OTT element 162a is that the respective contribution of each object to the two outputs of OTT element 0 is obtained by summation of the corresponding elements in W. This summation gives a sub-rendering matrix W0 of OTT element 0:
The problem is now simplified to estimating the level difference and correlation for sub-rendering matrix W0 (and for similarly defined sub-rendering matrices W1, W2, W3 and W4 related to the OTT elements 1, 2, 3 and 4, respectively).
Assuming fully incoherent (i.e. mutually independent) object signals, the estimated power of the first output of OTT element 0, p0,12, is given by:
Similarly, the estimated power of the second output of OTT element 0, p0,22, is given by:
The cross-power R0 is given by:
The CLD parameter for OTT element 0 is then given by:
and the ICC parameter is given by:
When
Below, the grouping or identification of channels for the left configuration of
For box 162a, the first virtual signal is the signal representing a combination of the loudspeaker signals lf, rf, c, lfe. The second virtual signal is the virtual signal representing a combination of ls and rs.
For box 162b, the first audio signal is a virtual signal and represents a group including a left front channel and a right front channel, and the second audio signal is a virtual signal and represents a group including a center channel and an lfe channel.
For box 162e, the first audio signal is a loudspeaker signal for the left surround channel and the second audio signal is a loudspeaker signal for the right surround channel.
For box 162c, the first audio signal is a loudspeaker signal for the left front channel and the second audio signal is a loudspeaker signal for the right front channel.
For box 162d, the first audio signal is a loudspeaker signal for the center channel and the second audio signal is a loudspeaker signal for the low frequency enhancement channel.
In these boxes, the weighting parameters for the first audio signal or the second audio signal are derived by combining object rendering parameters associated to the channels represented by the first audio signal or the second audio signal as will be outlined later on.
Below, the grouping or identification of channels for the right configuration of
For box 164a, the first audio signal is a virtual signal and represents a group including a left front channel, a left surround channel, a right front channel, and a right surround channel, and the second audio signal is a virtual signal and represents a group including a center channel and a low frequency enhancement channel.
For box 164b, the first audio signal is a virtual signal and represents a group including a left front channel and a left surround channel, and the second audio signal is a virtual signal and represents a group including a right front channel and a right surround channel.
For box 164e, the first audio signal is a loudspeaker signal for the center channel and the second audio signal is a loudspeaker signal for the low frequency enhancement channel.
For box 164c, the first audio signal is a loudspeaker signal for the left front channel and the second audio signal is a loudspeaker signal for the left surround channel.
For box 164d, the first audio signal is a loudspeaker signal for the right front channel and the second audio signal is a loudspeaker signal for the right surround channel.
In these boxes, the weighting parameters for the first audio signal or the second audio signal are derived by combining object rendering parameters associated to the channels represented by the first audio signal or the second audio signal as will be outlined later on.
The above mentioned virtual signals are virtual, since they do not necessarily occur in an embodiment. These virtual signals are used to illustrate the generation of power values or the distribution of energy which is determined by CLD for all boxes e.g. by using different sub-rendering matrices Wi. Again, the left side of
Above, the sub-rendering matrix W0 for box 162a has been shown.
For box 162b, the sub-rendering matrix is defined as:
For box 162e, the sub-rendering matrix is defined as:
For box 162c, the sub-rendering matrix is defined as:
For box 162d, the sub-rendering matrix is defined as:
For the right configuration in
For box 164a, the sub-rendering matrix is defined as:
For box 164b, the sub-rendering matrix is defined as:
For box 164e, the sub-rendering matrix is defined as:
For box 164c, the sub-rendering matrix is defined as:
For box 164d, the sub-rendering matrix is defined as:
Depending on the implementation, the respective CLD and ICC parameter may be quantized and formatted to fit into an MPEG Surround bitstream, which could be fed into MPEG Surround decoder 100. Alternatively, the parameter values could be passed to the MPEG Surround decoder on a parameter level, i.e. without quantization and formatting into a bitstream. To not only achieve repanning of the objects, i.e. distributing these signal energies appropriately, which can be achieved using the above approach utilizing the MPEG-2 structure of
and it is assumed, that the power of the input down-mix signal is equal to the sum of the object powers (i=object index, k=channel index).
As previously discussed, the computation of the CLD and ICC-parameters utilizes weighting parameters indicating a portion of the energy of the object audio signal associated to loudspeakers of the multi-channel loudspeaker configuration. These weighting factors will generally be dependent on scene data and playback configuration data, i.e. on the relative location of audio objects and loudspeakers of the multi-channel loudspeaker set-up. The following paragraphs will provide one possibility to derive the weighting parameters, based on the object audio parameterization introduced in
As already outlined above, there are independent rendering matrices for each time/frequency tile; however in the following only one single time/frequency tile is regarded for the sake of clarity. The rendering matrix W has got M lines (one for each output channel) and, N columns (one for each audio object) where the matrix element in line s and column i represents the mixing weight with which the particular audio object contributes to the respective output channel:
The matrix elements are calculated from the following scene description and loudspeaker configuration parameters: Scene description (these parameters can vary over time):
Loudspeaker configuration (usually these parameters are time-invariant):
The elements of the mixing matrix are derived from these parameters by pursuing the following scheme for each audio object i:
With respect to the above equations, it may be noted that in the two-dimensional case, an object audio signal associated to an audio object of the spatial audio scene will be distributed between the two speakers of the multi-channel loudspeaker configuration, which are closest to the audio object. However, the object parameters chosen for the above implementation are not the only object parameters which can be used to implement further embodiments of the present invention. For example, in a three-dimensional case, object parameters indicating the location of the loudspeakers or the audio objects may be three-dimensional vectors. Generally, two parameters are necessitated for the two-dimensional case and three parameters are necessitated for the three-dimensional case, when the location shall be unambiguously defined. However, even in the two-dimensional case, different parameterizations may be used, for example transmitting two coordinates within a rectangular coordinate system. It may furthermore be noted, that the optional panning rule parameter p, which is within a range of 1 to 2, is an arbitrary panning rule parameter, which is set to reflect room acoustic properties of a reproduction system/room, and which is, according to some embodiments of the present invention, additionally applicable. Finally, the weighting parameters Ws,i can be derived according to the following formula, after the panning weights V1,i and V2,i have been derived according to the above equations. The matrix elements are finally given by the following equations:
The previously introduced gain factor gi, which is optionally associated to each audio object, may be used to emphasize or suppress individual objects. This may, for example, be performed on the receiving side, i.e. in the decoder, to improve the intelligibility of individually chosen audio objects.
The following example of audio object 152 of
After some mathematics, this leads to the solution:
v1,i≈0.8374; v2,i≈0.5466.
Therefore, according to the above instructions, the weighting parameters (matrix elements) associated to the specific audio object located in direction αi are derived to be:
w1=w2=w3=0; w4=0.8374; w5=0.5466.
The above paragraphs detail embodiments of the present invention utilizing only audio objects, which can be represented by a monophonic signal, i.e. point-like sources. However, the flexible concept is not restricted to the application with monophonic audio sources. To the contrary, one or more objects, which are to be regarded as spatially “diffuse” do also fit well into the inventive concept. Multi-channel parameters have to be derived in an appropriate manner, when non point-like sources or audio objects are to be represented. An appropriate measure to quantify an amount of diffuseness between one or more audio objects, is an object-related cross-correlation parameter ICC.
In the SAOC system discussed so far all audio objects were supposed to be point sources, i.e. pair-wise uncorrelated mono sound sources without any spatial extent. However there are also application scenarios in which it is desirable to allow audio objects that comprise more than only one audio channel, exhibiting to a certain degree pair-wise (de)correlation. The simplest and probably most important case out of these is represented by stereo objects, i.e. objects consisting of two more or less correlated channels that belong together. As an example, such an object could represent the spatial image produced by a symphony orchestra.
In order to smoothly integrate stereo objects into a mono audio object based system as it is described above, both channels of a stereo object are treated as individual objects. The interrelationship of both part objects is reflected by an additional cross-correlation parameter which is calculated based on the same time/frequency grid as is applied for the derivation of the sub-band power values σi2. In other words: A stereo object is defined by a set of parameter triplets {σi2, σj2, ICCi,j} per time/frequency tile, where ICCi,j denotes the pair-wise correlation between the two realizations of one object. These two realizations are denoted by individual objects i and j. having a pair-wise correlation ICCi,j.
For the correct rendering of stereo objects an SAOC decoder provides means for establishing the correct correlation between those playback channels that participate in the rendering of the stereo object, such that the contribution of that stereo object to the respective channels exhibits a correlation as claimed by the corresponding ICCi,j parameter. An SAOC to MPEG Surround transcoder which is capable of handling stereo objects, in turn, derives ICC parameters for the OTT boxes that are involved in reproducing the related playback signals, such that the amount of decorrelation between the output channels of the MPEG Surround decoder fulfills this condition.
In order to do so, compared to the example given in the previous section of this document, the calculation of the powers p0,1 and p0,2 and the cross-power R0 have to be changed. Assuming the indices of the two audio objects that together build a stereo object to be i1 and i2 the formulas change in the following manner:
It can be observed easily that in case of ICCi
Having the capability of using stereo objects has the obvious advantage, that the reproduction quality of the spatial audio scene can be significantly enhanced, when audio sources other than point sources can be treated appropriately. Furthermore, the generation of a spatial audio scene may be performed more efficiently, when one has the capability of using premixed stereo signals, which are widely available for a great number of audio objects.
The following considerations will furthermore show that the inventive concept allows for the integration of point-like sources, which have an “inherent” diffuseness. Instead of objects representing point sources, as in the previous examples, one or more objects may also be regarded as spatially ‘diffuse’. The amount of diffuseness can be characterized by an object-related cross-correlation parameter ICCi,j. For ICCi,j=1, the object i represents a point source, while for ICCi,i=0, the object is maximally diffuse. The object-dependent diffuseness can be integrated in the equations given above by filling in the correct ICCi,j values.
When stereo objects are utilized, the derivation of the weighting factors of the matrix M has to be adapted. However, the adaptation can be performed without inventive skill, as for the handling of stereo objects, two azimuth positions (representing the azimuth values of the left and the right “edge” of the stereo object) are converted into rendering matrix elements.
As already mentioned, regardless of the type of audio objects used, the rendering Matrix elements are generally defined individually for different time/frequency tiles and do in general differ from each other. A variation over time may, for example, reflect a user interaction, through which the panning angles and gain values for every individual object may be arbitrarily altered over time. A variation over frequency allows for different features influencing the spatial perception of the audio scene, as, for example, equalization.
Implementing the inventive concept using a multi-channel parameter transformer allows for a number of completely new, previously not feasible, applications. As, in a general, sense, the functionality of SAOC can be characterized as efficient coding and interactive rendering of audio objects, numerous applications requiring interactive audio can benefit from the inventive concept, i.e. the implementation of an inventive multi-channel parameter transformer or an inventive method for a multi-channel parameter transformation.
As an example, completely new interactive teleconferencing scenarios become feasible. Current telecommunication infrastructures (telephone, teleconferencing etc.) are monophonic. That is, classical object audio coding cannot be applied, since this necessitates the transmission of one elementary stream per audio object to be transmitted. However, these conventional transmission channels can be extended in their functionality by introducing SAOC with a single down-mix channel. Telecommunication terminals equipped with an SAOC extension, that is mainly with a multi-channel parameter transformer or an inventive object parameter transcoder, are able to pick up several sound sources (objects) and mix them into a single monophonic down-mix signal which is transmitted in a compatible way by using the existing coders (for example speech coders). The side information (spatial audio object parameters or object parameters) may be conveyed in a hidden, backwards compatible way. While such advanced terminals produce an output object stream containing several audio objects, the legacy terminals will reproduce the downmix signal. Conversely, the output produced by legacy terminals (i.e. a downmix signal only) will be considered by SAOC transcoders as a single audio object.
The principle is illustrated in
This allows, for example, a user to emphasize one specific talker of interest by applying object-related gain values gi, thus making the remaining talkers nearly inaudible. This would not be possible when using conventional multi-channel audio techniques, since these would try to reproduce the original spatial audio scene as naturally as possible, without the possibility of allowing a user interaction to emphasize selected audio objects.
b illustrates a more complex scenario, in which teleconferencing is performed among three teleconferencing sites 200, 202 and 208. Since each site is only capable of receiving and sending one audio signal, the infrastructure uses so-called multi-point control units MCU 210. Each site 200, 202 and 208 is connected to the MCU 210. From each site to the MCU 210, a single upstream contains the signal from the site. The downstream for each site is a mix of the signals of all other sites, possibly excluding the site's own signal (the so-called “N-1 signal”).
According to the previously discussed concept and the inventive parameter transcoders, the SAOC bitstream format supports the ability to combine two or more object streams, i.e. two streams having a down-mix channel and associated audio object parameters into a single stream in a computationally efficient way, i.e. in a way not requiring a preceding full reconstruction of the spatial audio scene of the sending site. Such a combination is supported without decoding/re-encoding of the objects according to the present invention. Such a spatial audio object coding scenario is particularly attractive when using low delay MPEG communication coders, such as, for example low delay AAC.
Another field of interest for the inventive concept is interactive audio for gaming and the like. Due to its low computational complexity and independency from a particular rendering set-up, SAOC is ideally suited to represent sound for interactive audio, such as gaming applications. The audio could furthermore be rendered depending on the capabilities of the output terminal. As an example, a user/player could directly influence the rendering/mixing of the current audio scene. Moving around in a virtual scene is reflected by an adaptation of the rendering parameters. Using a flexible set of SAOC sequences/bitstreams would enable the reproduction of a non-linear game story controlled by user interaction.
According to a further embodiment of the present invention, inventive SAOC coding is applied within a multi-player game, in which a user interacts with other players in the same virtual world/scene. For each user, the video and audio scene is based on his position and orientation in the virtual world and rendered accordingly on his local terminal. General game parameters and specific user data (position, individual audio; chat etc.) is exchanged between the different players using a common game server. With legacy techniques, every individual audio source not available by default on each client gaming device (particularly user chat, special audio effects) in a game scene has to be encoded and sent to each player of the game scene as an individual audio stream. Using SAOC, the relevant audio stream for each player can easily be composed/combined on the game server, be transmitted as a single audio stream to the player (containing all relevant objects) and rendered at the correct spatial position for each audio object (=other game players' audio).
According to a further embodiment of the present invention, SAOC is used to play back object soundtracks with a control similar to that of a multi-channel mixing desk using the possibility to adjust relative level, spatial position and audibility of instruments according to the listener's liking. Such, a user can:
As the above examples have shown, the application of the inventive concept opens the field for a wide variety of new, previously unfeasible applications. These applications become possible, when using an inventive multi-channel parameter transformer of
In a transformation step 312, the coherence parameter and the level parameter are derived combining the direction parameter and the energy parameter with additional loudspeaker parameters indicating a location of loudspeakers of the multi-channel loudspeaker configuration intended to be used for playback.
Further embodiments comprise an object parameter transcoder for generating a coherence parameter indicating a correlation between two audio signals of a representation of a multi-channel audio signal associated to a multi-channel loudspeaker configuration and for generating a level parameter indicating an energy relation between the two audio signals based on a spatial audio object coded bit stream. This device includes a bit stream decomposer for extracting a down-mix channel and associated object parameters from the spatial audio object coded bit stream and a multi-channel parameter transformer as described before.
Alternatively or additionally, the object parameter transcoder comprises a multi-channel bit stream generator for combining the down-mix channel, the coherence parameter and the level parameter to derive the multi-channel representation of the multi-channel signal or an output interface for directly outputting the level parameter and the coherence parameter without any quantization and/or entropy encoding.
Another object parameter transcoder has an output interface is further operative to output the down mix channel in association with the coherence parameter and the level parameter or has a storage interface connected to the output interface for storing the level parameter and the coherence parameter on a storage medium.
Furthermore, the object parameter transcoder has a multi-channel parameter transformer as described before, which is operative to derive multiple coherence parameter and level parameter pairs for different pairs of audio signals representing different loudspeakers of the multi-channel loudspeaker configuration.
Depending on certain implementation requirements of the inventive methods, the inventive methods can be implemented in hardware or in software. The implementation can be performed using a digital storage medium, in particular a disk, DVD or a CD having electronically readable control signals stored thereon, which cooperate with a programmable computer system such that the inventive methods are performed. Generally, the present invention is, therefore, a computer program product with a program code stored on a machine readable carrier, the program code being operative for performing the inventive methods when the computer program product runs on a computer. In other words, the inventive methods are, therefore, a computer program having a program code for performing at least one of the inventive methods when the computer program runs on a computer.
While the foregoing has been particularly shown and described with reference to particular embodiments thereof, it will be understood by those skilled in the art that various other changes in the form and details may be made without departing from the spirit and scope thereof. It is to be understood that various changes may be made in adapting to different embodiments without departing from the broader concepts disclosed herein and comprehended by the claims that follow.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
This application is a U.S. national entry of PCT Patent Application Serial No. PCT/EP2007/008682 filed 5 Oct. 2007, and claims priority to U.S. Patent Application No. 60/829,653 filed 16 Oct. 2006, each of which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2007/008682 | 10/5/2007 | WO | 00 | 10/5/2010 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2008/046530 | 4/24/2008 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5761634 | Stewart et al. | Jun 1998 | A |
5912976 | Klayman et al. | Jun 1999 | A |
7447629 | Breebaart | Nov 2008 | B2 |
7555009 | Oshima | Jun 2009 | B2 |
7761177 | Oh et al. | Jul 2010 | B2 |
7797163 | Pang et al. | Sep 2010 | B2 |
7961890 | Holzer et al. | Jun 2011 | B2 |
7965848 | Villemoes et al. | Jun 2011 | B2 |
7986789 | Purnhagen et al. | Jul 2011 | B2 |
8213641 | Faller et al. | Jul 2012 | B2 |
8214221 | Pang et al. | Jul 2012 | B2 |
8379868 | Goodwin et al. | Feb 2013 | B2 |
20050022841 | Wittebrood | Feb 2005 | A1 |
20050074127 | Herre et al. | Apr 2005 | A1 |
20050177360 | Schuijers et al. | Aug 2005 | A1 |
20050195981 | Faller et al. | Sep 2005 | A1 |
20060009225 | Herre et al. | Jan 2006 | A1 |
20060100809 | Yoneda | May 2006 | A1 |
20060165237 | Villemoes et al. | Jul 2006 | A1 |
20060190247 | Lindblom | Aug 2006 | A1 |
20060235679 | Sperschneider et al. | Oct 2006 | A1 |
20070002971 | Purnhagen et al. | Jan 2007 | A1 |
20070055510 | Hilpert et al. | Mar 2007 | A1 |
20070071247 | Pang et al. | Mar 2007 | A1 |
20080008323 | Hilpert et al. | Jan 2008 | A1 |
20080140426 | Kim et al. | Jun 2008 | A1 |
20080255857 | Pang et al. | Oct 2008 | A1 |
20080319765 | Oh et al. | Dec 2008 | A1 |
20090006106 | Pang et al. | Jan 2009 | A1 |
20090110203 | Taleb | Apr 2009 | A1 |
20090144063 | Beack et al. | Jun 2009 | A1 |
20090164222 | Kim et al. | Jun 2009 | A1 |
20090177479 | Yoon et al. | Jul 2009 | A1 |
20090182564 | Beack et al. | Jul 2009 | A1 |
20100153097 | Hotho et al. | Jun 2010 | A1 |
Number | Date | Country |
---|---|---|
2007312598 | Oct 2007 | AU |
0951021 | Oct 1999 | EP |
1376538 | Jan 2004 | EP |
1691348 | Aug 2006 | EP |
1984916 | Aug 2007 | EP |
1853092 | Nov 2007 | EP |
2100297 | Apr 2008 | EP |
2002511683 | Apr 2002 | JP |
2002369152 | Dec 2002 | JP |
20046031 | Jan 2004 | JP |
2004193877 | Jul 2004 | JP |
200593058 | Apr 2005 | JP |
2005-151129 | Jun 2005 | JP |
2006101248 | Apr 2006 | JP |
2008517337 | May 2008 | JP |
2010-505328 | Feb 2010 | JP |
2129737 | Apr 1999 | RU |
2005104123 | May 2008 | RU |
2363116 | Jul 2009 | RU |
2005103637 | Jul 2009 | RU |
200611241 | Aug 1994 | TW |
20061226041 | Apr 2006 | TW |
2006627380 | Jul 2006 | TW |
WO 9952326 | Oct 1999 | WO |
WO2004086817 | Oct 2004 | WO |
WO2005098826 | Oct 2005 | WO |
WO 2006048203 | May 2006 | WO |
WO2006060279 | Jun 2006 | WO |
WO 2006084916 | Aug 2006 | WO |
Entry |
---|
PUlkki, Compensating displacement of amplitude-panned virtual sources, AES, 2002. |
Jang, Inseon et al.; “Low-bitrate multichannel audio coding”; 2005; Journal of Broadcast Engineering; The Korean Society of Broadcast Engineers, vol. 10, pp. 328-339. |
Office Action mailed Nov. 9, 2010 in related Korean Patent Application No. 10-2009-7007754, 5 pages. |
“Concepts of Object-Oriented Spatial Audio Coding”, ISO/IEC JTC 1/SC 29/WG 11 N8329, Klagenfurt, Austria, Jul. 2006, 8 pages. |
“WD on ISO/IEC 23003:2:200x, SAOC text and reference software”, ISO/IEC JTC 1/SC 29/WG11 N9517, Shenzhen, China, Oct. 2007, 80 pages. |
Engdegard, et al., “CT/Fraunhofer IIS/Philips Submission to the SAOC CfP”, ISO/IEC JTC1/SC29/WG11, MPEG2007/M14696, Lausanne, CH, Jul. 2007, 14 pages. |
Faller, C. et al; “Binaural Cue Coding Applied to Audio Compression with Flexible Rendering”; Oct. 2002; AES 113th Convention, LA CA Preprint 5686. |
Faller, C.; “Parametric Joint-Coding of Audio Sources”; May 20-23, 2006; Convention Paper 6752 presented at the 120th AES Convention, Paris, France. |
Faller, C. et al.; “Efficient representation of spatial audio using perceptual parametrization”; Oct. 2001; IEEE WASPAA, Mohonk, NY. |
Baumgarte, F. et al.; “Estimation of auditory spatial cues for binaural cue coding”; May 2002; ICASSP, Orlando, Florida. |
Faller, C. et al.; “Binaural cue coding: a novel and efficient representation of spatial audio”; May 2002; ICASSP; Orlando, Florida. |
Faller, C. et al.; “Binaural cue coding applied to stereo and multi-channel audio compression”; May 2002; AES 112th Convention, Munich Germany, Preprint 5574. |
Faller, C. et al.; “Binaural Cue Coding—Part II: Schemes and applications”; Nov. 2003, IEEE Trans. on Speech and Audio Proc., vol. 11, No. 6. |
Breebaart, J. et al.; “High-Quality Parametric Spatial Audio Coding at Low Bitrates”; May 2004; AES 16th Convention; Berlin, Germany, Preprint 6072. |
Schuijers, E. et al.; “Low Complexity Parametric Stereo Coding”; May 2004; AES 116th Convention, Berlin, Germany, Preprint 6073. |
ISO/IEC JTC1/SC29/WG11 (MPEG), Document N8324, “Text of ISO/IEC FDIS 23003-1:2006, MPEG Surround”; Jul. 2006; Klagenfurt, Austria. |
Pulkki, V.; “Spatial Sound Generation and Perception by Amplitude Panning Techniques”; 2001; Helsinki University of Technology, Helsinki, Finland. |
Recommendation ITU-R BS. 775-1, “Multichannel Stereophonic Sound System With and Without Accompanying Picture”; 1992-1994. |
Villemoes, L. et al.; “MPEG Surround: The Forthcoming ISO Standard for Spatial Audio Coding”; Jun. 30-Jul. 2, 2006; 28th International AES Conference, The Future of Audio Technology Surround and Beyond, Pitea, SE. |
Breebaart, J. et al.; “Multi-Channel goes Mobile: MPEG Surround Binaural Rendering”; Sep. 2-4, 2006; 29th International AES Conference, Audio for Mobile and Handheld Devices, Seoul. |
Faller, C.; “Parametric Joint-Coding of Audio Sources”; May 20-23, 2006; AES 120th Convention, Paris France, Convention Paper 6752. |
Herre, et al.; “The Reference Model Architecture for MPEG Spatial Audio Coding”; May 28, 2005; Audio Engineering Society Convention paper, New York, NY pp. 1-13. |
Breebaart, et al.; “MPEG spatial audio coding/MPEG surround: Overview and current status”; Oct. 7, 2005; Audio Engineering Society Convention Paper, New York, NY, pp. 1-15 XP002364486 ,pp. 1-6. |
International Organization for Standardization “Concepts of Object-Oriented Spatial Audio Coding”; Jul. 21, 2006; Video Standards and Drafts; XP030014821. |
Herre, et al.; “Thoughts on an SAOC Architecture”; Oct. 18, 2006; Video Standards and Drafts, No. M13935, XP030042603. |
ISO/IEC 23003-1:2006/FDIS, “Information technology—MPEG audio technologies—Part 1: MPEG Surround”; Jul. 21, 2006, XP030014816, pp. 79-81 and pp. 253-257. |
Number | Date | Country | |
---|---|---|---|
20110013790 A1 | Jan 2011 | US |
Number | Date | Country | |
---|---|---|---|
60829653 | Oct 2006 | US |