This disclosure relates generally to audio signal processing, including channel-based audio to object-based audio conversion.
In channel-based audio (CBA) coding, a set of tracks is implicitly assigned to specific loudspeakers by associating the set of tracks with a channel configuration. If the playback speaker configuration is different from the coded channel configuration, downmixing or upmixing specifications are required to redistribute audio to the available speakers. This paradigm is well known and works when the channel configuration at the decoding end can be predetermined, or assumed with reasonable certainty to be 2.0, 5.X or 7.X. However, with the popularity of new speaker setups, no assumption can be made about the speaker setup used for playback. Therefore, CBA does not offer a sufficient method for adapting a representation where the source speaker layout does not match the speaker layout at the decoding end. This presents a challenge when trying to author content that plays back well independently to the speaker configuration.
In object-based audio (OBA) coding, rendering is applied to objects that comprise the object audio essence in conjunction with metadata that contains individually assigned object properties. The properties (e.g., x, y, z position or channel location) more explicitly specify how the content creator intends the audio content to be rendered (that is, they place constraints on how to render the essence into speakers). Because individual sound elements can be associated with a much richer set of metadata, giving meaning to the elements, the method of adaptation to the speaker configuration reproducing the audio can provide better information regarding how to render to fewer speakers.
There are several standardized formats for transmission of CBA content, such as enhanced AC-3 (E-AC-3) defined in ETSI TS 102 366 [1]. To ensure compatibility with pre-existing devices joint object coding (JOC) can be used in conjunction with standardized CBA formats to transport OBA. JOC delivers immersive audio at low bitrates, achieved by conveying a multi-channel downmix of the immersive content using perceptual audio coding algorithms together with parametric side information that enables the reconstruction of the audio objects from the downmix in the decoder. In some applications, such as television broadcasts, it is desired to represent CBA content as OBA content so that the content is compatible with an install base of OBA playback devices. However, the standardized bitstream formats for CBA and OBA are not entirely compatible.
Embodiments are disclosed for converting CBA content to OBA content, and in particular embodiment converting 22.2-channel content to OBA content for playback on OBA compatible playback devices.
In an embodiment, a method comprises: receiving, by one or more processors of an audio processing apparatus, a bitstream including channel-based audio and associated channel-based audio metadata; the one or more processors configured to: parse a signaling parameter from the channel-based audio metadata, the signaling parameter indicating one of a plurality of different object audio metadata (OAMD) representations; each one of the OAMD representations mapping one or more audio channels of the channel-based audio to one or more audio objects; convert the channel-based metadata into OAMD associated with the one or more audio objects using the OAMD representation that is indicated by the signaling parameter; generate channel shuffle information based on channel ordering constraints of the OAMD; reorder the audio channels of the channel-based audio based on the channel shuffle information to generate reordered, channel-based audio; and render the reordered, channel-based audio into rendered audio using the OAMD; or encode the reordered channel-based audio and the OAMD into an object-based audio bitstream and transmit the object-based audio bitstream to a playback device or source device.
In an embodiment, the channel-based audio and metadata are included in a native audio bitstream, and the method further comprises decoding the native audio bitstream to recover (i.e. determine, or extract) the channel-based audio and metadata.
In an embodiment, the channel-based audio and metadata are N.M channel-based audio and metadata, where N is a positive integer greater than nine and M is a positive integer greater than or equal to zero.
In an embodiment, the method further comprises: determining a first set of channels of the channel-based audio that are capable of being represented by OAMD bed channels; assigning OAMD bed channel labels to the first set of channels; determining a second set of channels of the channel-based audio that are not capable of being represented by OAMD bed channels; and assigning static OAMD position coordinates to the second set of channels.
In embodiment, a method comprises: receiving, by one or more processors of an audio processing apparatus, a bitstream including channel-based audio and metadata; the one or more processors configured to: encode the channel-based audio into a native audio bitstream; parse a signaling parameter from the metadata, the signaling parameter indicating one of a plurality of different object audio metadata (OAMD) representations; convert the channel-based metadata into OAMD using the OAMD representation that is indicated by the signaling parameter; generate channel shuffle information based on channel ordering constraints of the OAMD; generate a bitstream package that includes the native audio bitstream, the channel shuffle information and the OAMD; multiplex the package into a transport layer bitstream; and transmit the transport layer bitstream to a playback device or source device.
In an embodiment, the channel-based audio and metadata are N.M channel-based audio and metadata, where N is a positive integer greater than seven and M is a positive integer greater than or equal to zero.
In an embodiment, the channels in the channel-based audio that can be represented by OAMD bed channel labels use the OAMD bed channel labels, and the channels in the channel-based audio that cannot be represented by OAMD bed channel labels use static object positions, where each static object position is described in OAMD position coordinates.
In an embodiment, the transport bitstream is a moving pictures experts group (MPEG) audio bitstream that includes a signal that indicates the presence of OAMD in an extension field of the MPEG audio bitstream.
In an embodiment, the signal that indicates the presence of OAMD in the MPEG audio bitstream is included in a reserved field of metadata in the MPEG audio bitstream for signaling a surround sound mode.
In an embodiment, a method comprises: receiving, by one or more processors of an audio processing apparatus, a transport layer bitstream including a package; the one or more processors configured to: demultiplex the transport layer bitstream to recover (i.e. determine or extract) the package; decode the package to recover (i.e. determine or extract) a native audio bitstream, channel shuffle information and an object audio metadata (OAMD); decode the native audio bitstream to recover a channel-based audio bitstream and metadata; reorder the channels of the channel-based audio based on the channel shuffle information; and render the reordered, channel-based audio into rendered audio using the OAMD; or encode the channel-based audio and OAMD into an object-based audio bitstream and transmit the object-based audio bitstream to a source device.
In an embodiment, the channel-based audio and metadata are N.M channel-based audio and metadata, where N is a positive integer greater than seven and M is a positive integer greater than or equal to zero.
In an embodiment, a method further comprises: determining a first set of channels of the channel-based audio that are capable of being represented by OAMD bed channels; assigning OAMD bed channel labels to the first set of channels; determining a second set of channels of the channel-based audio that are not capable of being represented by OAMD bed channels; and assigning static OAMD position coordinates to the second set of channels.
In an embodiment, the transport bitstream is a moving pictures experts group (MPEG) audio bitstream that includes a signal that indicates the presence of OAMD in an extension field of the MPEG audio bitstream.
In an embodiment, the signal that indicates the presence of OAMD in the MPEG audio bitstream is included in a reserved field of a data structure in metadata of the MPEG audio bitstream for signaling a surround sound mode.
In an embodiment, an apparatus comprises: one or more processors; and a non-transitory, computer-readable storage medium having instructions stored thereon that when executed by the one or more processors, cause the one or more processors to perform the methods described herein.
Other embodiments disclosed herein are directed to systems, apparatus and computer-readable media. The details of the disclosed implementations are set forth in the accompanying drawings and the description below. Other features, objects and advantages are apparent from the description, drawings and claims.
Particular embodiments disclosed herein provide one or more of the following advantages. An existing installed base of OBA compatible playback devices can convert CBA content to OBA content using existing standards-based native audio and transport bitstream formats without replacing hardware components of the playback devices.
In the accompanying drawings referenced below, various embodiments are illustrated in block diagrams, flow charts and other diagrams. Each block in the flowcharts or block may represent a module, a program, or a part of code, which contains one or more executable instructions for performing specified logic functions. Although these blocks are illustrated in particular sequences for performing the steps of the methods, they may not necessarily be performed strictly in accordance with the illustrated sequence. For example, they might be performed in reverse sequence or simultaneously, depending on the nature of the respective operations. It should also be noted that block diagrams and/or each block in the flowcharts and a combination of thereof may be implemented by a dedicated software-based or hardware-based system for performing specified functions/operations or by a combination of dedicated hardware and computer instructions.
The same reference symbol used in various drawings indicates like elements.
Object Audio Metadata (OAMD) is the coded bitstream representation of the metadata for OBA processing, such as for example, metadata described in ETSI TS 103 420 v1.2.1 (2018 October). The OAMD bitstream may be carried inside an Extensible Metadata Delivery Format (EMDF) container, such as, for example, as specified in ETSI TS 102 366 [1]. OAMD is used for rendering an audio object. The rendering information may dynamically change (e.g. gain and position). The OAMD bitstream elements may include content description metadata, object properties metadata, property update metadata and other metadata.
In an embodiment, the content description metadata includes the version of OAMD payload syntax, the total number of objects, the types of objects and the program composition. The object properties metadata includes object position in room-anchored, screen-anchored or speaker-anchored coordinates, object size (width, depth, height), priority (imposes an ordering by importance on objects where higher priority indicates higher importance for an object), gain (used to apply a custom gain value to an object), channel lock (used to constrain rendering of an object to a single speaker, providing a non-diffuse, timbre-neutral reproduction of the audio), zone constraints (specifies zones or sub-volume in the listening environment where an object is excluded or included), object divergence (used to convert object into two objects, where the energy is spread along the X-axis) and object trim (used to lower the level of out-of-screen elements that are indicated in the mix).
In an embodiment, the property update metadata signals timing data applicable to updates for all transmitted objects. The timing data of a transmitted property update specifies a start time for the update, along with the update context with preceding or subsequent updates and the temporal duration for an interpolation process between successive updates. The OAMD bitstream syntax supports up to eight property updates per object in each codec frame. The number of signaled updates or the start and stop time of each property update is identical for all objects. The metadata indicates the value of a ramp duration value in the OAMD that specifies a time period in audio samples for an interpolation from signaled object property values of the previous property update to values of the current update.
In an embodiment, the timing data also includes a sample offset value and a block offset value which are used by the decoder to calculate a start sample value offset and a frame offset. The sample offset is a temporal offset in samples to the first pulse code modulated (PCM) audio sample that the data in the OAMD payload applies to, such as, for example, as specified in ETSI TS 102 366 [1], clauses H.2.2.3.1 and H.2.2.3.2. The block offset value indicates a time period in samples as offset from the sample offset common for all property updates.
In an embodiment, a decoder provides an interface for the OBA comprising object audio essence audio data and time-stamped metadata updates for the corresponding object properties. At the interface the decoder provides the decoded per-object metadata in time stamped updates. For each update the decoder provides the data specified in a metadata update structure.
In the following disclosure, techniques are disclosed for converting CBA content into OBA using OAMD. In an exemplary embodiment, 22.2-channel (“22.2-ch”) content is converted to OBA using OAMD. In this embodiment, the 22.2-ch content has two defined methods by which channels are positioned and hence downmixed/rendered. The choice of method may be dependent on the value of a parameter, such as dmix_pos_adj_idx parameter embedded in the 22.2-ch bitstream. The format converter that converts 22.2-ch locations to an OAMD representation selects one of two OAMD representations based on the value of this parameter. The selected representation is carried in an OBA bitstream (e.g., Dolby® MAT bitstream) that is input to the playback device (e.g., a Dolby® Atmos® playback device). An example 22.2-ch system is Hamasaki 22.2. Hamasaki 22.2 is the surround sound component of Super Hi-Vision, which is a television standard developed by NHK Science & Technical Research Laboratories that uses 24 speakers (including two subwoofers) arranged in three layers.
Although the following disclosure is directed to an embodiment where 22.2-ch content is converted to OBA content using OAMD, the disclosed embodiments are applicable to any CBA or OBA bitstream format, including standardized or proprietary bitstream formats, and any playback device or system. Additionally, the following disclosure is not limited to 22.2-ch to OBA conversion but is also applicable to conversion of any N.M channel-based audio, where N is a positive integer greater than seven and M is a positive integer greater than or equal to zero.
As used herein, the term “includes” and its variants are to be read as open-ended terms that mean “includes, but is not limited to.” The term “or” is to be read as “and/or” unless the context clearly indicates otherwise. The term “based on” is to be read as “based at least in part on.” The term “one example embodiment” and “an example embodiment” are to be read as “at least one example embodiment.” The term “another embodiment” is to be read as “at least one other embodiment.” In addition, in the following description and claims, unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skills in the art to which this disclosure belongs.
In this application, 22.2-ch content 305 (e.g., a file or live stream) is received by format converter 301. The content 305 includes audio and associated metadata. The metadata includes the dmix_pos_adj_idx parameter for selecting one of two OAMD representations based on the value of this parameter. Channels that can be represented by OAMD bed channel labels use the OAMD bed channel labels. Channels that cannot be represented by OAMD bed channel labels use static object positions, where each static object position is described in OAMD [x, y, z] position coordinates, such as, for example, as described in ETSI TS 103 420 v1.2.1 (2018 October). As used herein, a “bed channel” is a group of multiple bed objects and a “bed object” is a static object whose spatial position is fixed by an assignment to a loudspeaker of a playback system.
Referring to the table in
OAMD assumes that bed objects precede dynamic objects. Additionally, bed objects appear in a specific order. For these reasons, the audio for the 22.2-ch content is reordered by audio channel shuffler 303 to satisfy the OAMD order constraints. Audio channel shuffler 303 receives channel shuffle information from metadata generator 304 and uses the channel shuffle information to reorder the 22.2 channels.
Referring to the Table in
In an embodiment, a first metadata field includes the parameter warp_mode which if set to the value “0” indicates normal rendering (i.e., no warping) of objects in 5.1.X output configurations. If the warp_mode is set to the value “1” warping is applied to the objects in the 5.1.X output configuration. Warp refers to how the renderer deals with content that is panned between the midpoint and rear of a listening environment (e.g., a room). With warp, the content is presented at a constant level in the surround speakers between the rear and midpoint of the listening environment, avoiding any need for phantom imaging until it is in the front half of the listening environment.
A second metadata field in the dimensional trim metadata table includes per-configuration trims/balance controls for eight speaker configurations (e.g., 2.0, 5.1.0, 7.1.0, 2.1.2, 5.1.2, 7.1.2, 2.1.4, 5.1.4, 7.1.4), as shown in
With reference to the table of
OAMD allows each object to have an individual object gain (described by an object_gain field). This gain is applied by the object audio renderer 302. Object gain allows compensation of differences between downmix values of the 22.2-ch content and the rendering of the OAMD representations of the 22.2-ch content. In an embodiment, the object gain is set to −3 dB for objects with a bed channel assignment of LFE1 or LFE2 and 0 dB for all other objects. Other values for object gain can be used depending on the application.
System 300 includes format converter 301 and object audio renderer 302. Format converter 301 further includes audio channel shuffler 303 and OAMD metadata generator 304. Some examples of OAMD metadata include but are not limited to content description metadata, property update metadata and trim data. The 22.2-ch content 305 (e.g., a file or live stream) includes 22.2-ch audio and metadata which is input into format converter 301. OAMD metadata generator 304 maps the 22.2-ch metadata to OAMD, such as, for example, in conformance with principles as described in reference to
System 400 includes format converter 401 and OBA encoder 402. Format converter 401 further includes OAMD metadata generator 404 and audio channel shuffler 403. Some examples of OAMD metadata include but are not limited to content description metadata, property update metadata and trim data. The 22.2-ch content 405 (e.g., a file or live stream) includes 22.2-ch audio and metadata which is input into format converter 401. OAMD metadata generator 404 maps the 22.2-ch metadata to OAMD, such as, for example, in conformance with principles as described in reference to
The output of format converter 401 is the reordered channels of audio and OAMD, which is input into OBA encoder 402. OBA encoder 402 encodes the audio using the OAMD (e.g., using JOC) to generate an OBA bitstream 406, which can be sent to an OBA playback device downstream, where it is rendered by an object audio renderer that processes the audio to adapt it to a particular loudspeaker layout.
System 500 includes format converter 501 and object audio renderer 502 and decoder 506. Format converter 501 further includes OAMD metadata generator 504 and audio channel shuffler 503. Some examples of OAMD metadata include but are not limited to content description metadata, property update metadata and trim data. The audio bitstream 505 (e.g., AAC/MP4) includes 22.2-ch audio and metadata which is input into decoder 506 (e.g., an AAC/MP4 decoder). The output of decoder 506 is the 22.2-ch audio and metadata, which input into format converter 501. OAMD metadata generator 504 maps the 22.2-ch metadata to OAMD, such as, for example, in conformance with principles as described in reference to
Referring to
Referring to
Referring to
The native audio bitstream 707 (e.g., AAC/MP4) includes 22.2-ch audio and metadata. The audio is input into core encoder 702 of encoder 701 which encodes the audio into the native audio format and outputs the encoded audio to bitstream packager 705. The OAMD metadata generator 704 maps the 22.2-ch metadata to OAMD, such as, for example, in conformance with principles as described in reference to
Referring to
Referring to
Referring to
The native audio bitstream 806 (e.g., AAC/MP4) includes 22.2-ch audio and metadata. The audio is input into core encoder 803 of encoder 801 which encodes the audio into the native audio format and outputs the encoded audio to bitstream packager 805. The OAMD metadata generator 804 maps the 22.2-ch metadata to OAMD, such as, for example, in conformance with principles as described in reference to
Referring to
The OAMD used to represent 22.2-ch content is static for a program. For this reason, it is desirable to avoid sending OAMD frequently to avoid data rate increases in the audio bitstream. This can be achieved by sending the static OAMD and channel shuffle information within a transport layer and transmitted in a transport layer. When received, the OAMD and channel shuffle information are used by the OBA encoder to subsequent transmission over HDMI. An example transport layer is base media file format (BMFF) described in ISO/IEC 14496-12-MPEG-4 Part 12, which defines a general structure for time-based multimedia files, such as video and audio. In an embodiment that uses MPEG-DASH, the OAMD is included in a manifest.
Referring to
The native audio bitstream 901 (e.g., AAC/MP4) includes 22.2-ch audio and metadata. The audio is input into encoder 902 which encodes the audio into the native audio format and outputs the encoded audio to transport layer multiplexer 903. The OAMD metadata generator 904 maps the 22.2-ch metadata to OAMD, such as, for example, in conformance with principles as described in reference to
Referring to
Referring to
Referring to
The native audio bitstream 1005 (e.g., AAC/MP4) includes 22.2-ch audio and metadata. The audio is input into encoder 1001 which encodes the audio into the native audio format and outputs the encoded audio to transport layer multiplexer 1004. The OAMD metadata generator 1003 maps the 22.2-ch metadata to OAMD, such as, for example, in conformance with principles as described in reference to
Referring to
Transmitting Pre-Computed OAMD within MPEG-4 Audio or MPEG-D Audio Bitstreams
In an embodiment, OAMD representing 22.2 content is carried within a native audio bitstream, such as an MPEG-4 audio (ISO/IEC 14496-3) bitstream. An example syntax for three embodiments is provided below.
In the above example syntax, the element element_instance_tag is a number to identify the data stream element, and the element extension_payload(int) may be contained inside a fill element (ID_FIL). Each of the above three syntax embodiments describe a “tag” or “extension type” to indicate the meaning of additional data. In an embodiment, a signal can be inserted in the bitstream signaling that additional OAMD and channel shuffle information are present in one of the three extension areas of the bitstream to avoid having the decoder check those areas of the bitstream. For example, the MPEG4_ancillary_data field contains a dolby_surround_mode field with the following semantics. A similar signaling syntax can be used to indicate to a decoder that OAMD is present in the bitstream.
In an embodiment, the reserved field in the table above is used to indicate that a pre-computed OAMD payload is embedded somewhere in the extension data of the bitstream. The reserved value of (dolby_surround_mode=“11”) is used to indicate to a decoder that the extension data fields contain the required OAMD and channel information needed to convert 22.2 to OBA (e.g., Dolby® Atmos®). Alternatively, the reserved field indicates that the content is OBA compatible (e.g., Dolby® Atmos® compatible), and converting the 22.2-ch content to OBA is possible. Thus, if the dolby_surround_mode signal is set to the reserved value “11”, the decoder will know that the content is OBA compatible and convert the 22.2-ch content to OBA for further encoding and/or rendering.
In an embodiment, OAMD representing 22.2 content is carried within a native audio bitstream, such as MPEG-D USAC (ISO/IEC 23003-3) audio bitstream. An example syntax for such an embodiment is provided below.
In an embodiment, a low-noise block collects radio waves from a satellite dish and converts them to an analog signal that is sent through a coaxial cable to input port 1701 of STB/AVR 1700. The analog signal is converted to a digital signal by ADC 1702. The digital signal is demodulated by demodulator 1703 (e.g., QPSK demodulator) and synchronized and decoded by synchronizer/decoder 1704 (e.g., synchronizer plus Viterbi decoder) to recover the MPEG transport bitstream, which is demodulated by MPEG demultiplexer 1707 and decoded by MPEG decoder 1706 to recover channel-based audio and video audio bitstreams and metadata, including channel shuffle information and OAMD. Audio channel shuffler 1705 reorders the audio channels in accordance with the channel shuffle information, such as, for example, in conformance with principles as described in reference to
Note that the architecture described in reference to
While this document contains many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can, in some cases, be excised from the combination, and the claimed combination may be directed to a sub combination or variation of a sub combination. Logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. In addition, other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Accordingly, other implementations are within the scope of the following claims.
Various aspects of the present invention may be appreciated from the following enumerated example embodiments (EEEs):
EEE 1. A method comprising:
receiving, by one or more processors of an audio processing apparatus, a bitstream including channel-based audio and metadata;
the one or more processors configured to:
determining a first set of channels of the channel-based audio that are capable of being represented by OAMD bed channels;
assigning OAMD bed channel labels to the first set of channels;
determining a second set of channels of the channel-based audio that are not capable of being represented by OAMD bed channels; and
assigning static OAMD position coordinates to the second set of channels.
EEE 7. The method of any of EEEs 1-6, wherein the OAMD includes dimensional trim data to lower loudness levels of one or more out-of-screen audio objects in the rendered audio.
EEE 8. The method of any of EEEs 1-7, wherein the OAMD includes object gains used to compensate for differences between downmix values of the channel-based audio and rendering of OAMD representations of the channel-based audio.
EEE 9. A method comprising:
receiving, by one or more processors of an audio processing apparatus, a bitstream including channel-based audio and metadata;
the one or more processors configured to:
receiving, by one or more processors of an audio processing apparatus, a transport layer bitstream including a package;
the one or more processors configured to:
determining a first set of channels of the channel-based audio that are capable of being represented by OAMD bed channels;
assigning OAMD bed channel labels to the first set of channels;
determining a second set of channels of the channel-based audio that are not capable of being represented by OAMD bed channels; and
assigning static OAMD position coordinates to the second set of channels.
EEE 22. The method of any of EEEs 18-21, wherein the OAMD includes dimensional trim data to lower loudness levels of one or more out-of-screen objects in the rendered audio.
EEE 23. The method of any of EEEs 18-22, wherein the OAMD includes object gains used to compensate for differences between downmix values of the channel-based audio and rendering of OAMD representations of the channel-based audio.
EEE 24. The method of any of EEEs 18-23, wherein the transport bitstream is an moving pictures experts group (MPEG) audio bitstream that includes a signal that indicates the presence of OAMD in an extension field of the MPEG audio bitstream.
EEE 25. The method of any of EEEs 18-24, wherein the signal that indicates the presence of OAMD in the MPEG audio bitstream is included in a reserved field of a data structure in metadata of the MPEG audio bitstream for signaling a surround sound mode.
EEE 26. An apparatus comprising:
one or more processors; and
a non-transitory, computer-readable storage medium having instructions stored thereon that when executed by the one or more processors, cause the one or more processors to perform the methods of any of the proceeding EEEs 1-25.
EEE 27. A non-transitory, computer-readable storage medium having instructions stored thereon that when executed by one or more processors, cause the one or more processors to perform the methods of any of the proceeding EEEs 1-25.
Number | Date | Country | Kind |
---|---|---|---|
19212906.2 | Dec 2019 | EP | regional |
This application claims priority of U.S. Provisional Patent Application No. 62/942,322, filed Dec. 2, 2019, and EP Patent Application No. 19212906.2, filed Dec. 2, 2019, both of which are hereby incorporated by reference in their entireties.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/062873 | 12/2/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62942322 | Dec 2019 | US |