The present disclosure relates to the field of audio coding, and in particular to an audio decoder having at least two decoding modes, and associated decoding methods and decoding software for such audio decoder. The present disclosure further relates to a corresponding audio encoder, and associated encoding methods and encoding software for such audio encoder.
An audio scene may generally comprise audio objects. An audio object is an audio signal which has an associated spatial position. If the spatial position of an audio object can vary with time, the audio object is typically called a dynamic audio object. If the position is static, the audio object is typically called a static audio object, or a bed object. A bed object is typically an audio signal which corresponds directly to a channel of a multichannel speaker configuration, such as a classical stereo configuration with a left and a right speaker, or a so-called 5.1 speaker configuration with three front speakers, two surround speakers, and a low frequency effects speaker, etc. A bed can contain one to many bed objects. It's a set of bed objects which thus can match a multichannel speaker configuration.
Since the number of audio objects typically may be very large, for instance in the order of tens or hundreds of audio objects, there is a need for encoding methods which allow the audio objects to be efficiently compressed at an encoder side, e.g. for transmission as a bitstream (data stream, etc.), especially when targeting low bit rates for the transmission. The clusters of dynamic audio objects may then, in certain decoding modes in an audio decoder, be parametrically reconstructed into individual audio objects again to be rendered into a set of output audio signals depending on the configuration of the output device (e.g. speakers, headphones, etc.) employed for playback of the audio signal. However, in some cases, the decoder is forced to work in a core mode, meaning that parametric reconstruction of individual dynamic audio objects from clusters of dynamic audio objects is not possible, e.g. due to restrictions of processing power of the decoder, or for other reasons. This may cause a problem, especially when an immersive audio experience (e.g. 3D audio) is expected from a user who is listening to the output audio.
There is thus a need for improvements in this context.
In view of the above, it is thus an object of the present invention to overcome or mitigate at least some of the problems discussed above. In particular, it is an object of the present disclosure to provide a, preferably immersive, audio output from received dynamic audio objects in a decoder in a core decoding mode. Moreover, it is an object of the present disclosure to provide an encoder for encoding an audio bitstream from a set of dynamic audio objects in a way that may allow for the decoding of the audio bitstream into a, preferably immersive, audio output according to the above. Further and/or alternative objects of the present invention will be clear for a reader of this disclosure.
According to a first aspect of the invention, there is provided an audio decoder comprising one or more buffers for storing a received audio bitstream, and a controller coupled to the one or more buffers.
The controller is configured to operate in a decoding mode selected from a plurality of different decoding modes, the plurality of different decoding modes comprising a first decoding mode and a second decoding mode, wherein of the first and second decoding modes only the first decoding mode allows full decoding of one or more encoded dynamic audio objects in the bitstream, into reconstructed individual audio objects.
When the selected decoding mode is the second decoding mode, the controller is configured to access the received audio bitstream, to determine whether the received audio bitstream includes one or more dynamic audio objects, and responsive at least to determining that the received audio bitstream includes one or more dynamic audio objects, to map at least one of the one or more dynamic audio objects to a set of static audio objects, the set of static audio objects corresponding to a predefined speaker configuration.
By including the step of mapping at least one of the one or more dynamic audio objects to a set of static audio objects, immersive audio output can be achieved from a low bit rate bitstream, for example restricted to only include up to 10 audio objects (dynamic and static), or up to 7, 5, etc., audio objects, even in a decoder operating in a low complexity decoding mode (core decoding) where parametric reconstruction of individual dynamic audio objects from clusters of dynamic audio objects is not possible (full decoding is not possible).
By the term “immersive audio output” should, in the context of present specification, be understood a channel output configuration which contains channels for top speakers.
By the term “immersive speaker configuration” a similar meaning should be understood, i.e., a speaker configuration which contains top speakers.
Furthermore, the present embodiment provides a flexible decoding method, since not all received dynamic audio objects are necessarily mapped to the set of static audio objects corresponding to a predefined speaker configuration. This e.g. allows for inclusion of additional dialogue objects in the audio bitstream which serve a different purpose, for example dialog or associated audio.
Moreover, the present embodiment allows for a flexible process of providing and later rendering the set of static audio objects, which will be further discussed below, to achieve for example a lower computational complexity, or permitting reuse of existing software code/functions used for implementing a decoder.
Generally, the present embodiment enables decoder-side flexibility in a low bit-rate, low-complexity scenario.
The step of determining, by the controller, that the received audio bitstream includes one or more dynamic audio objects may be accomplished in different ways. According to some embodiments, this is determined from the bitstream, e.g. metadata such as integer values or flag values etc. In other embodiments, this may be determined by analysis of the audio object, or associated object metadata.
The controller may select the decoding mode in different ways. For example, the selection may be done using a bitstream parameter, and/or in view of the output configuration for the rendered output audio signals, and/or by checking the number of dynamic audio objects (downmix audio objects, clusters, etc.) in the audio bitstream, and/or based on a user parameter, etc.
It should be noted that the decision to map at least one of the one or more dynamic audio objects to a set of static audio objects may be made using more information than just determining whether the received audio bitstream includes one or more dynamic audio objects. According to some embodiments, the controller bases such decision also on further data such as bitstream parameters.
By way of example, if it is determined that the received audio bitstream does not comprise dynamic audio objects, or otherwise determined that the mapping of dynamic audio objects discussed above should not be performed, the controller may decide to render the received static audio objects (bed objects) directly to a set of output audio channels, using e.g. received rendering coefficients (e.g. downmix coefficients) applicable to the configuration of the output audio channels. In this operational mode of the controller, any received dynamic audio objects are conventionally rendered to the output audio channels.
According to some embodiments, when the selected decoding mode is the second decoding mode, the controller is further configured to render the set of static audio objects to a set of output audio channels. Any other static audio objects received in the audio bitstream (such as an LFE) are also rendered to the set of output audio channels, advantageously in the same rendering step.
According to some embodiments, the configuration of the set of output audio channels differs from the predefined speaker configuration used for mapping the dynamic audio objects to a set of static audio objects as described above. Since the predefined speaker configuration is not limited to the configuration of the output audio channels, increased flexibility is achieved.
According to some embodiments, the audio bitstream comprises a first set of downmix coefficients, wherein the controller is configured to utilize the first set of downmix coefficients for rendering the set of static audio objects to a set of output audio channels. In case of further received static audio objects in the bitstream, the downmix coefficients will be applied to both the set of static audio objects and the further static audio objects.
The controller may in some embodiments use the received first set of downmix coefficients as is for rendering the set of static audio objects to a set of output audio channels. However, in other embodiments, the first set of downmix coefficients first needs to be processed based on what type of downmix operation on the encoder side that resulted in the one or more dynamic audio objects received in the bitstream.
In some embodiments, the controller is further configured to receive information pertaining to attenuation applied in at least one of the one or more dynamic audio objects on an encoder side. The information may be received in the bitstream, or may be predefined in the decoder. The controller may then be configured to modify the first set of downmix coefficients accordingly when utilizing the first set of downmix coefficients for rendering the set of static audio objects to a set of output audio channels. Consequently, attenuation included in the downmix coefficients but already having been applied on the encoder side is not applied twice, resulting in a better listening experience.
In some embodiments, the controller is further configured to receive information pertaining to a downmix operation performed on an encoder side, wherein the information defines an original channel configuration of an audio signal, wherein the downmix operation results in downmixing the audio signal to the one or more dynamic audio objects. In this case, the controller may be configured to select a subset of the first set of downmix coefficients based on the information pertaining to the downmix information, wherein the utilizing of the first set of downmix coefficients for rendering the set of static audio objects to a set of output audio channels comprises utilizing the subset of the first set of downmix coefficients for rendering the set of static audio objects to a set of output audio channels. This may result in a more flexible decoding method which handles all types of downmix operations performed on the encoder side and resulting in the received one or more dynamic audio objects.
According to some embodiments, the controller is configured to perform the mapping of the at least one of the one or more dynamic audio objects and the rendering of the set of static audio objects in a combined calculation using a single matrix. Advantageously, this may reduce the computational complexity of the rendering of the audio objects in the received audio bitstream.
According to some embodiments, the controller is configured to perform the mapping of the at least one of the one or more dynamic audio objects and the rendering of the set of static audio objects in individual calculations using respective matrices. In this embodiment, the one or more dynamic audio objects are pre-rendered into a set of static audio objects, i.e. defining an intermediate bed representation of the one or more dynamic audio objects. Advantageously, this permits reuse of existing software code/function used for implementing a decoder which is adapted to render a bed representation of the audio scene into a set of output audio channels. Moreover, this is embodiment reduces the additional complexity of implementation of the invention described herein in a decoder.
According to some embodiments, the received audio bitstream comprises metadata identifying the at least one of the one or more dynamic audio objects. This allows for an increased flexibility of the decoder method, since not all of the received one or more dynamic audio objects need to be mapped to the set of static audio objects, and the controller can easily determine, using said metadata, which of the received one or more dynamic objects that should be mapped, and which that should be forwarded directly to the rendering of the set of output audio channels.
According to some embodiments, the metadata indicates that N of the one or more dynamic audio objects are to be mapped to the set of static audio objects, wherein responsive to the metadata the controller is configured to map, to the set of static audio objects, N of the one or more dynamic audio objects selected from a predefined location or predefined locations in the received audio bitstream. For example, the N dynamic audio objects may be the first N received dynamic audio objects, or the last N received dynamic audio objects. Consequently, in some embodiments, responsive to the metadata the controller is configured to map, to the set of static audio objects, the first N of the one or more dynamic audio objects in the received audio bitstream. This allows for less metadata to identify the at least one of the one or more dynamic audio objects, e.g. an integer value.
According to some embodiments, the one or more dynamic audio objects included in the received audio bitstream comprises more than N dynamic audio objects. As mentioned above, for example for audio comprising dialogue in different languages, it may be advantageous to provide a dynamic audio object for each of the supported languages.
According to some embodiments, the one or more dynamic audio objects included in the received audio bitstream comprises the N dynamic audio objects and K further dynamic audio objects, wherein the controller is configured to render the set of static audio objects and the K further audio objects to a set of output audio channels. Accordingly, for example the selected language (i.e. the corresponding dynamic audio object) according to the above example may thus be rendered along with the set of static audio objects to the set of output audio signals.
According to some embodiments, the set of static audio objects consists of M static audio objects, and M>N>0. Advantageously, bitrate may be saved since the number of dynamic audio objects to be mapped can be reduced. Alternatively, the number (K) of further dynamic audio objects in the audio bitstream may be increased.
According to some embodiments, the received audio bitstream further comprises one or more further static audio objects. The further static objects may comprise an LFE, or other bed or Intermediate Spatial Format (ISF) objects.
According to some embodiments, the set of output audio channels is one of: stereo output channels; 5.1 surround sound output channels, 5.1.2 immersive sound output channels; or 5.1.4 immersive sound output channels.
According to some embodiments, the predefined speaker configuration is a 5.0.2 speaker configuration. In this embodiment, N may be equal to 5.
According to a second aspect of the invention, at least some of the above objects are achieved by a method in a decoder comprising the steps of:
According to a third aspect of the invention, at least some of the above objects are obtained by a computer program product comprising a computer-readable medium with computer code instructions adapted to carry out the method of the second aspect when executed by a device having processing capability.
The second and third aspects may generally have the same features and advantages as the first aspect.
According to a fourth aspect of the invention, at least some of the above objects are obtained by an audio encoder comprising:
a receiving component configured for receiving a set of audio objects;
a downmixing component configured for downmixing the set of audio objects to one or more downmixed dynamic audio objects, wherein at least one of the one or more downmixed dynamic audio objects is intended to, in at least one of a plurality of decoding modes on a decoder side, be mapped to a set of static audio objects, the set of static audio objects corresponding to a predefined speaker configuration;
a downmix coefficients-providing component configured for determining a first set of downmix coefficients to be utilized for rendering the set of static audio objects corresponding to the predefined speaker configuration to a set of output audio channels at the decoder side;
a bitstream multiplexer configured for multiplexing the at least one downmixed dynamic audio object and the first set of downmix coefficients into an audio bitstream.
According to some embodiments, the downmixing component further is configured for providing metadata identifying the at least one of the one or more downmixed dynamic audio objects to the bitstream multiplexer, wherein the bitstream multiplexer is further configured for multiplexing the metadata into the audio bitstream.
According to some embodiments, the encoder is further adapted to determine information pertaining to attenuation applied in at least one of the one or more dynamic audio objects when downmixing the set of audio objects to one or more downmixed dynamic audio objects, wherein the bitstream multiplexer is further configured for multiplexing the information pertaining to attenuation into the audio bitstream.
According to some embodiments, the bitstream multiplexer is further configured for multiplexing information pertaining to a channel configuration of the audio objects received by the receiving component.
According to a fifth aspect of the invention, at least some of the above objects are obtained by a method in an encoder comprising the steps of:
According to a sixth aspect of the invention, at least some of the above objects are obtained by a computer program product comprising a computer-readable medium with computer code instructions adapted to carry out the method of the fifth aspect when executed by a device having processing capability.
The fifth and sixth aspects may generally have the same features and advantages as the fourth aspect. Moreover, the fourth, fifth and sixth aspect may generally have the corresponding features (but from an encoder side) as the first, second and third aspect. For example, the encoder may be adapted to include static audio objects (such as an LFE) in the audio bitstream.
It is further noted that the invention relates to all possible combinations of features unless explicitly stated otherwise.
The above, as well as additional objects, features and advantages of the present invention, will be better understood through the following illustrative and non-limiting detailed description of preferred embodiments of the present invention, with reference to the appended drawings, where the same reference numerals will be used for similar elements, wherein:
The present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown. The systems and devices disclosed herein will be described during operation.
In the below, the Dolby AC-4 audio format (as published in document ETSI TS 103 190-2 V1.2.1 (2018-02)) will be used as context for exemplifying the present invention. However, it should be noted that the scope of the invention is not limited to AC-4, and the different embodiments described herein may be employed for any suitable audio format.
Due to computational restrictions in some audio decoders, parametric reconstruction of individual dynamic audio objects from clusters of dynamic audio objects is not possible. Moreover, restrictions in the target bitrate for an audio bitstream may set restriction of the content of the audio bitstream, for example limiting the number of transmitted audio objects/audio channels to 10. A further restriction may originate from the encoding standard used, for example restricting the use of certain coding tools in some specific cases. For example, an AC-4 decoder is configured at different levels, where a level three decoder restricts the use of coding tools such as A-JCC (Advanced Joint Channel Coding) and A-CPL (Advanced Coupling) which otherwise may advantageously be used for achieving an immersive audio experience under certain circumstances. Such circumstances may include an essential channel encoding mode, but where the decoder does not have the coding tools to decode such content (e.g. the use of A-JCC is not permitted). In this case, the present invention may be used to “imitate” channel based immersive as described below. Further possible restrictions comprise the possibility to include both channel based content and dynamic/static audio objects (discrete audio objects) in the same bitstream, which may not be allowed under certain circumstances.
In this document the term ‘clusters’ refer to audio objects which are downmixed in the encoder as it will be described later with reference to
The present invention is aimed at circumventing some of the above restrictions, and providing an advantageous listening experience to the listener of audio output at low bitrate and decoder complexity.
Advanced Joint Object Coding (A-JOC) is a parametric coding tool to code a set of objects efficiently. A-JOC relies on a parametric model of the object-based content. This coding tool may determine dependencies among audio objects and utilize a perceptually based parametric model to achieve high coding efficiency.
The audio decoder 100 further comprises a controller 104 coupled to the one or more buffers 102. The controller 104 can thus extract at least parts of the audio bitstream 110 from the buffer(s) 102, to decode the encoded audio bitstream into a set of audio output channels 118. The set of audio output channels 118 may then be used for playback by a set of speakers 120.
As described above, the audio decoder 100, or the controller 104, can operate in different decoding modes. In the following, two decoding modes will exemplify this. However, further decoding modes may be employed.
In a first decoding mode (full decoding mode, complex decoding mode, etc.) the parametric reconstruction of individual dynamic audio objects from clusters of dynamic audio objects is possible. In the context of AC-4, the first decoding mode may be called A-JOC full decoding. In the non-limiting example given above with 10 individual dynamic objects and 5 clusters (dynamic downmixed audio objects), full decoding mode allows to reconstruct the 10 original individual dynamic objects (or an approximation thereof) from the 5 clusters.
In a second decoding mode (core decoding, low complexity decoding, etc.), such reconstruction is not carried out due to restrictions in the decoder 100. In the context of AC-4, the second decoding mode may be called A-JOC core decoding. In the non-limiting example given above with 10 individual dynamic objects and 5 clusters (dynamic downmixed audio objects), core decoding mode is not able to reconstruct the 10 original individual dynamic objects (or approximation thereof) from the 5 clusters.
The controller is thus configured to select a decoding mode, either the first or the second decoding mode. Such decision may be made based on internal parameters 116 of the decoder 100, for example stored in a memory of the decoder. Alternatively, or additionally, the decision may also be made based on input 114 from e.g. a user. Alternatively, or additionally, the decision may further be based on the content of the audio bitstream 110. For example, if the received audio bitstream comprises more than a threshold number of dynamic downmixed audio objects (e.g. more than 6, or more than 10, or any other suitable number depending on the context), the controller may select the second decoding mode. The audio bitstream 110 may in some embodiments comprise a flag value indicating to the controller which decoding mode to select.
For example, in the context of AC-4, according to one embodiment, the selection of the first decoding mode may be one or many of the following:
In the following, the second decoding mode (core decoding) will be exemplified in conjunction with
The controller 104 is configured to determine whether the received audio bitstream 110 includes one or more dynamic audio objects (which in this embodiment are all mapped to a set of static audio objects), and to base the decision, how to decode the received audio bitstream, thereon. According to some embodiments, the controller bases such decision also on further data such as bitstream parameters. For example, in AC-4, the controller may determine to decode the received audio bitstream as described in
1. “num_bed_obj_ajoc” is greater than zero (e.g. 1 to 7) or
2. “num_bed_obj_ajoc” is not present in the bitstream and “n_fullband_dmx_signals” is smaller than 6.
In case the controller 104 determines that one or more dynamic audio objects 210 should be taken into account, and optionally also in view of other data as described above, the controller is configured to map at least one 210 of the one or more dynamic audio objects to a set of static audio objects. In
According to some embodiments, M>N>0.
An advantage of actually rendering the N dynamic audio objects 210 into a bed 222 is that the remaining operations of the decoder 100 (i.e. producing a set of output audio signals 118) may be achieved by reusing existing software code/functions used for implementing a decoder which is adapted to render a bed 222 (and optionally further dynamic audio objects as described in
The decoder produces a set of further OAMD 214. These OAMD 214 define the positions and the gains for the intermediately rendered bed 222.
The OAMD 214 is thus not conveyed in the bitstream but instead locally “generated” in the decoder to describe the (typically 5.0.2) channel configuration generated at the output of the pre-rendering 202. For example, if the intermediate bed 222 is configured as a 5.0.2, the OAMD 214 define the positions (L, R, C, Ls, Rs, Ltm, Rtm) and the gains for the 5.0.2 bed 222. If another configuration of the intermediate bed is employed, e.g. 3.0.0, the positions would be L, R, C. The number of OAMD 214 in this embodiment thus corresponds to the number of static audio objects 222, for example 7 in the case of 5.0.2 bed 222. In some embodiments, the gain in each of the OAMD 214 is unity (1). The OAMD 214 thus comprise properties for the set of static audio objects 222, e.g. gain and position for each static audio object 222. In other words, the OAMD 214 indicate the predefined configuration of the bed 222.
The audio bitstream 110 further comprises downmix coefficients 216. Depending on the configuration of the set of output channels 118, the controller selects the corresponding downmix coefficients 216 to be utilized when calculating a second gain matrix 220. By way of example, the set of output audio channels is one of: stereo output channels; 5.1 surround sound output channels 5.1.2 immersive sound output channels (immersive audio output configuration); 5.1.4 immersive sound output channels (immersive audio output configuration); 7.1 surround sound output channels; or 9.1 surround sound output channels. The resulting gain matrix is thus Ch (number of output channels)×M in size. The selected downmix coefficients may be used as is when calculating the second gain matrix 220. However, as will be described further below in conjunction with
The second gain matrix is used at a rendering stage 204 of the decoder 100, to render the set of static audio objects 222 to the set of output audio channels 118.
It should be noted that in
In
Accordingly, in some embodiments, the dynamic audio objects included in the received audio bitstream count more than N dynamic audio objects 210. For example, dynamic audio objects included in the received audio bitstream comprise the N dynamic audio objects and K further dynamic audio objects. According to some embodiments, the received audio bitstream comprises M&E+D. In that case, if a separate dialogue is to be added when rendering the set of output channels 118, this may cause a problem in the low rate case where only 10 audio objects may be included in the received audio bitstream 110. In the case of the set of output channels 118 is in a 5.1.2 configuration, and bed objects were used (i.e. the legacy solution), 8 bed objects would be needed to be transmitted. This would leave only two possible audio objects representing the dialogue, which may be too few, e.g. if five different dialogue objects should be supported. Using the invention herein, immersive output audio may be achieved in this case by e.g. transmitting four (N) dynamic audio objects for M&E, which are mapped 202 to the set of static audio objects 222, one additional static object 302 for the LFE, and five (K) additional dynamic objects for the dialogue.
In the embodiment of
For the rendering 204, a set of OAMD 214 is employed. The received audio bitstream comprises, in this example, 6 OAMD 214, one for each additional audio object 302. These 6 OAMD are thus included in the audio bitstream on an encoder side, to be used at the decoder 100 for the decoding process described herein. Moreover, as described above in conjunction with
The audio bitstream 110 further comprises downmix coefficients 216 which are utilized for rendering the set of output channels 118 similar to what was described above in conjunction with
The second gain matrix 220 is used at a rendering stage 204 of the decoder 100, to render the set of static audio objects 222, and the set of further audio objects 302 (which may include dynamic audio objects and/or static audio objects and/or ISF objects as defined above) to the set of output audio channels 118.
In the case described in
In one embodiment, the metadata indicates that N of the one or more dynamic audio objects are to be mapped to the set of static audio objects, whereby the controller knows that these N dynamic audio objects should be selected from a predefined location or predefined locations in the received audio bitstream. The dynamic audio objects 210 to be mapped may for example be the first, or the last, N audio objects in the audio bitstream 110. The number of audio objects to be mapped may be indicated by the flag value Num_bed_obj_ajoc (may also be called num_obj_with_bed_render_info) and/or n_fullband_dmx_signals in the AC-4 standard (as published in document ETSI TS 103 190-2 V1.2.1 (2018-02)). In other standards, other names of the flag values may be used. It should also be noted that flag values may be renamed for newer versions of the AC-4 standard referred above. According to some embodiments, if num_bed_obj_ajoc is greater than zero this means that num_bed_obj_ajoc dynamic objects are mapped to the set of static audio objects. According to some embodiments, if num_bed_obj_ajoc is not present and n_fullband_dmx_signals is smaller than six, this means that all dynamic objects are mapped to the set of static audio objects.
In some embodiments, dynamic audio objects are received prior to any static audio objects in the received bitstream 110. In other embodiments, the LFE is received first in the bitstream 110, prior to the dynamic audio objects and any further static audio objects.
The encoder 500 further comprises a downmix coefficients providing component 504 configured for determining a first set of downmix coefficients to be utilized for rendering the set of static audio objects corresponding to the predefined speaker configuration to a set of output audio channels at the decoder side. As described later in conjunction with
The encoder further comprises a bitstream multiplexer 506 configured for multiplexing the at least one downmixed dynamic audio object 510 and the first set of downmix coefficients 516 into an audio bitstream 110.
In some embodiments, the downmixing component 502 also provides metadata 514 identifying the at least one downmixed audio object 510 of the one or more downmixed dynamic audio objects to the bitstream multiplexer 506. In this case, the bitstream multiplexer 506 is further configured for multiplexing the metadata 514 into the audio bitstream 110.
In some embodiments, the downmixing component 502 receives a target bit rate 509, to determine specifics of the downmixing operation, e.g. how many downmixed audio objects that should be computed from the set of dynamic audio objects 508. In other words, the target bit rate may determine a clustering parameter for the downmix operation.
As understood, in case the one or more downmixed dynamic audio objects 510 comprise more than the dynamic audio object that is intended for being mapped to the set of static audio objects on a decoder side, downmixing coefficients need to be computed also for them. Furthermore, static audio objects (e.g. LFE, etc.) may also be transmitted by the bitstream multiplexer 506 for inclusion in the audio bitstream 110, along with corresponding downmix coefficients. Moreover, each audio object included in the audio bitstream 110 will have an associated OAMD, for example OAMD associated with all dynamic audio objects 510 which are intended to be mapped to the set of static audio objects at a decoder side, which will be multiplexed into the audio bitstream 110.
The functionality of the DC selection and modification unit 606 will now be exemplified for particular setups of encoded and decoded audio.
In some embodiment, attenuation is applied in/to some of the transmitted audio objects 210 by the encoder. Such attenuation is the result of a downmixing process of an original audio signal to a downmix audio signal in the encoder. For example, if the format of the original audio signal is 7.1.4 (L, R, C, LFE, Ls, Rs, Lb, Rb, Tfl, Tfr, Tbl, Tbr), which is downmixed to a 5.1.2 (Ld, Rd, Cd, LFE, Lsd, Rsd, Tld, Trd) format in the encoder, the Lsd signal is determined in the encoder as:
NdB(Ls+Lb),
and the Tld signal is determined in the encoder as:
MdB(Tfl+Tbl)
Typically, N=M=3, but other attenuation levels may be applied.
In this setup, a 3 dB attenuation is thus already applied in the Lsd and the Tld. In these examples, only the channels on the left side are described, while the channels on the right side are handled correspondingly.
It should be noted that the downmix (e.g. 5.1.2 channel audio) is then further reduced in the encoder to for example five dynamic audio objects (210 in
The relevant downmix coefficients 216 transmitted in the bitstream in this case are
However, if the above downmix coefficients are directly applied for the case when the audio format of the output channels 118 is 5.1, this will result in that top channels Tfl and Tbl are attenuated with 6 dB in the surround output, i.e. the M=3 dB already applied in the encoder and the 3 dB of the gain_t2b downmix coefficient received in the bitstream. The same goes for the lower channels Ls and Lb which also will be attenuated with 6 dB in the surround output, i.e. the N=3 dB already applied in the encoder and the 3 dB of the gain_b4_to_b2 downmix coefficient received in the bitstream. To compensate for the attenuation already made on the encoder side, the DC selection and modification unit 606 is configured to, in this case, determine downmix coefficients 608 such that the output channels will be rendered as:
Lout=Ld+(+MdB+gain_t2a)Tld=L+gain_t2a(Tfl+Tbl), and
Lsout=(+NdB+gain_b4_to_b2)Lsd+(+MdB+gain_t2b)Tld=gain_b4_to_b2(Ls+Lb)+gain_t2b(Tfl+Tbl).
In this embodiment, the decoder selects gain_t2a, gain_t2b which are gains for top front channel to respective front and surround channels. These may thus be preferred over gain_t2d, gain_t2e which are the gains for top back channels. It should also be noted that the above equations are for conveying the idea of compensation of attenuation made by the encoder at the decoder, and that in reality, the equations to achieve this would be designed to make sure that the e.g. conversion from gains/attenuations in the logarithmic dB domain to linear gains is handled correctly.
To achieve the above, the decoder needs to be aware of attenuation made by the encoder. In some embodiments, the value of the N (dB) and the M (dB) are indicated in the bitstream as additional metadata 602. The additional metadata 602 thus define information pertaining to attenuation applied in at least one of the one or more dynamic audio objects on an encoder side. In other embodiments, the decoder is preconfigured (in a memory 604) with the attenuation 603 applied in the encoder. For example, the decoder may be aware of that 3 dB attenuation is always performed in the case of the 7.1.4 (or 5.1.4) to 5.1.2 downmix in the encoder. In the embodiments, the decoder is receiving information 602, 603 pertaining to attenuation applied in at least one of the one or more dynamic audio objects on an encoder side. This information 602, 603, in conjunction with the received data 612 indicating what type of downmix that has been performed in the encoder, may be used to select and/or adjust the downmix coefficients in the DC selection and modification unit 606 The selected and/or adjusted coefficients 608 will as mentioned above be used by the gain matrix calculation unit 610, in conjunction with the OAMD 214 and the configuration of the output audio signal 118 to form the second gain matrix 220.
In another exemplary setup, the original audio signal at the encoder is 5.1.2 with top front channels (L, R, C, LFE, Ls, Rs, Tfl, Tfr) which is downmixed to a 5.1.2 format with top middle channels instead (Ld, Rd, Cd, LFE, Lsd, Rsd, Tld, Trd). In this embodiment, no attenuation is made at the encoder. However, in this case, the DC selection and modification unit 606 needs to know what was the original signal configuration at the encoder side in order to select the appropriate downmix coefficients for the 5.1 output signal 118. The relevant downmix coefficients 216 transmitted in the bitstream in this case are: gain_t2a, gain_t2b which are gains for top front channels to respective front and surround channels. The DC selection and modification unit 606 is configured to, in this case, determine downmix coefficients 608 such that the output channels 118 will be rendered as:
Lout=Ld+gain_t2a(Tld)=L+gain_t2a(Tfl) and
Lsout=Lsd+gain_t2b(Tld)=Ls+gain_t2b(Tfl).
Further embodiments of the present disclosure will become apparent to a person skilled in the art after studying the description above. Even though the present description and drawings disclose embodiments and examples, the disclosure is not restricted to these specific examples. Numerous modifications and variations can be made without departing from the scope of the present disclosure, which is defined by the accompanying claims. Any reference signs appearing in the claims are not to be understood as limiting their scope.
Additionally, variations to the disclosed embodiments can be understood and effected by the skilled person in practicing the disclosure, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measured cannot be used to advantage.
The systems and methods disclosed hereinabove may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks between functional units referred to in the above description does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation. Certain components or all components may be implemented as software executed by a digital signal processor or microprocessor, or be implemented as hardware or as an application-specific integrated circuit. Such software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
Various aspects of the present invention may be appreciated from the following enumerated example embodiments (EEEs):
Number | Date | Country | Kind |
---|---|---|---|
18204046 | Nov 2018 | EP | regional |
This application claims priority of the following priority applications: U.S. provisional application 62/754,758 (reference: D18053USP1), filed Nov. 2, 2018, EP application 18204046.9 (reference: D18053EP), filed Nov. 2, 2018, and U.S. provisional application 62/793,073 (reference D18053USP2), filed Jan. 16, 2019, which are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2019/079683 | 10/30/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2020/089302 | 5/7/2020 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
8538766 | Hellmuth | Sep 2013 | B2 |
8634577 | Breebaart | Jan 2014 | B2 |
8958566 | Hellmuth | Feb 2015 | B2 |
9489954 | Hooks | Nov 2016 | B2 |
9805725 | Crockett | Oct 2017 | B2 |
9883309 | Samuelsson | Jan 2018 | B2 |
9940938 | Dick | Apr 2018 | B2 |
9947326 | Ghido | Apr 2018 | B2 |
20080071530 | Ehara | Mar 2008 | A1 |
20090271015 | Oh | Oct 2009 | A1 |
20110040397 | Kraemer | Feb 2011 | A1 |
20140023197 | Xiang | Jan 2014 | A1 |
20140025386 | Xiang | Jan 2014 | A1 |
20150245153 | Malak | Aug 2015 | A1 |
20150255076 | Fejzo | Sep 2015 | A1 |
20160125887 | Purnhagen et al. | May 2016 | A1 |
20160163321 | Arnott | Jun 2016 | A1 |
20160295216 | Aaron | Oct 2016 | A1 |
20160337776 | Breebaart | Nov 2016 | A1 |
20170047071 | Melkote et al. | Feb 2017 | A1 |
20170098452 | Tracey | Apr 2017 | A1 |
20170180905 | Purnhagen | Jun 2017 | A1 |
20170301355 | Hirvonen | Oct 2017 | A1 |
20170339506 | Chen | Nov 2017 | A1 |
20170366911 | Borss | Dec 2017 | A1 |
20170374484 | Lando | Dec 2017 | A1 |
20180008141 | Krueger | Jan 2018 | A1 |
20180053515 | Mehta | Feb 2018 | A1 |
20180108364 | Purnhagen | Apr 2018 | A1 |
20190289417 | Tomlin | Sep 2019 | A1 |
20200005801 | Peichl | Jan 2020 | A1 |
20210006922 | Swaminathan | Jan 2021 | A1 |
20220005484 | Friedrich | Jan 2022 | A1 |
Number | Date | Country |
---|---|---|
6129348 | May 2017 | JP |
2630754 | Sep 2017 | RU |
2662407 | Jul 2018 | RU |
2015150384 | Oct 2015 | WO |
2016168408 | Oct 2016 | WO |
2017165837 | Sep 2017 | WO |
Entry |
---|
“Dolby AC-4: Audio Delivery for Next-Generation Entertainment Services” Jun. 2015. |
ETSI “Digital Audio Compression (AC-4) Standard Part 2: Immersive and Personalized Audio” Sep. 2015, ETSI TS 103 190-2. |
Poers, Peter “Metadata Based Audio Production for Next Generation Audio Formats” SMPTE 2017 Annual Technical Conference and Exhibition, 2017. |
Purnhagen, H. et al “Immersive Audio Delivery Using Joint Object Coding” AES presented at the 140th Convention, Jun. 4-7, 2016, Paris, France. |
Riedmiller, J. “Dolby AC-4 Next-Generation Audio” Mar. 22, 2016. |
Riedmiller, J. et al. “Delivering Scalable Audio Experiences using AC-4” IEEE Transactions on Broadcasting, vol. 63, Issue 1, 2017. |
Number | Date | Country | |
---|---|---|---|
20220005484 A1 | Jan 2022 | US |
Number | Date | Country | |
---|---|---|---|
62793073 | Jan 2019 | US | |
62754758 | Nov 2018 | US |