The present disclosure generally relates to processing audio coded in different formats. More particularly, embodiments of the present disclosure relate to a method that generates a plurality of output frames by performing rendering based on audio coded in an object-based format and audio coded in a channel-based format.
Media content is deliverable via one or more communication networks (e.g., wifi, Bluetooth, LTE, USB) to many different types of playback systems/devices (e.g., televisions, computers, tablets, smart phones, home audio systems, streaming devices, automotive infotainment systems, portable audio systems, and the like) where it is consumed by a user (e.g., viewed or heard by one or more users of a media playback system). In the media delivery chain, adaptive streaming (adaptive bit rate or ABR streaming) allows for improved resource management through adaptive selection of bit rate on a media ladder based on network conditions, playback buffer status, shared network capacity, and other factors influenced by the network.
In a typical ABR streaming application, as network conditions degrade during playback of a media asset (e.g., a video or audio file), the playback device adapts by requesting lower bit rate frames of content (e.g. to maintain Quality of Experience; avoid buffering, etc.). In certain streaming applications, bit rate may be adjusted by delivering lower resolution portions of content (e.g., frames of audio) or by delivering content in a different format which preserves bandwidth (e.g., frames of a lower bit rate audio file format are delivered in place of frame of higher bit rate format).
It is an object of the present disclosure to provide methods for processing object-based audio and channel-based audio content.
According to one aspect of the present disclosure such method enables switching between object-based audio content (such as Dolby Atmos) and channel-based audio content (such as 5.1 or 7.1 content). This is for example advantageous in the context of adaptive streaming. As an example, while object-based audio content is being streamed (e.g., Dolby Atmos content) to a compatible playback system, such as an automotive playback system or mobile device, the playback system may request and begin receiving lower bit rate channel-based audio frames in response to a reduction in available network bandwidth. Conversely, while channel-based audio content is being streamed (e.g., 5.1 content) to a compatible playback system, the playback system may request and begin receiving object-based audio frames in response to an improvement in available network bandwidth.
However, the inventors have found that, without any special handling of the transitions, discontinuities, mixing of unrelated channels, and unwanted gaps may occur when switching between channel-based audio to object-based audio and vice versa. For example, when transitioning from object-based audio (e.g., Dolby Digital Plus (DD+) with Dolby Atmos content, e.g. DD+ Joint Object Coding (JOC)) to channel-based audio (e.g., Dolby Digital Plus 5.1, 7.1, etc.), a hard end of rear surround/height signals and a hard start of mixed-in signals may occur. Likewise, when transitioning from channel-based audio (e.g., Dolby Digital Plus 5.1, 7.1, etc.) to object-based audio (e.g., Dolby Digital Plus with Dolby Atmos content), a hard end of the mixed-in rear surround/height signals in the 5.1 subset of speakers and a hard start of the rear/surround height speaker feeds may occur. Additionally, when switching from channel-based audio to object-based audio, the channels may not be ordered correctly, leading to audio being rendered in the wrong positions and a mix of unrelated channels for a brief time period.
The present disclosure describes strategies for alleviating the issues when switching between object-based and a channel-based audio that address some of the issues described above and provides several advantages including:
The method of the present disclosure is advantageous when switching between an object-based audio format and a channel-based audio format, particularly in the context of adaptive streaming of object-based audio. However, the invention is not limited to adaptive streaming and can also be applied in other scenarios wherein switching between object-based audio and channel-based audio is desirable.
According to an embodiment of the invention, a method is provided that comprises: receiving a first frame of audio of a first format and receiving a second frame of audio of a second format different from the first format. The second frame is for playback subsequent to the first frame. The first format is an object-based audio format and the second format is a channel-based audio format or vice versa. The first frame of audio is decoded into a decoded first frame and the second frame of audio is decoded into a decoded second frame. A plurality of output frames of a third format is generated by performing rendering based on the decoded first frame and the decoded second frame.
The present disclosure further relates to an electronics device comprising one or more processors and a memory storing one or more programs configured to by executed by the one or more processors, the one or more programs including instructions for performing the method of the invention. The present disclosure further relates to a vehicle comprising said electronics device, such as a car comprising said electronics device.
The following description sets forth exemplary methods, parameters and the like. It should be recognized, however, that such description is not intended as a limitation on the scope of the present disclosure but is instead provided as a description of exemplary embodiments.
In
As illustrated in
As illustrated in
Artifacts occurring at the switch points between object-based (Atmos) and channel-based decoding/playback are mitigated by modifying both the audio data as well as object audio metadata (OAMD metadata).
The timing diagram 300 (
The timing diagram 300 includes three columns that indicate the content type of the input frames: either object-based content (first and last column) or channel-based content (middle column). In the example, six input frames 302 are indicated, with input frames 302-1 and 302-2 comprising object-based content, input frames 302-3 and 302-4 comprising channel-based content and input frames 302-5 and 302-6 comprising object-based content. In the example, the object-based content comprises Dolby Atmos content. However, the invention can be used with other object-based formats as well.
The input frames are extracted from one or more bitstreams. For example, a single bitstream that supports an object-based audio format and a channel-based audio format is used, such DD+JOC (Dolby Digital Plus Joint Object Coding) bitstream or an AC-4 bitstream. In an example, the input frames 302 are received in accordance with an adaptive streaming protocol, such as MPEG-DASH, HTTP Live Streaming (HLS), Low-Latency HLS. In such an example, the decoder may request audio in a channel-based format when available bandwidth is relatively low, while requesting audio in an object-based format when available bandwidth is relatively high.
The decoder generates output frames 304 based on the input frames 302. The example of
Each input frame 302 and output frame 304 includes L samples. In the example, L is equal to 1536 samples, corresponding to the number of samples used per input frame in a Dolby Digital Plus (DD+) bitstream or a DD+JOC bitstream. However, the invention is not limited to this specific number of samples or to a specific bitstream format.
The timing diagram 300 indicates decoder delay as D. In the example of
In diagram 300, the output frames 304 have been shifted to the left by D samples with respect to their actual timing, to better illustrate the relation between input frames 302 and output frames 304.
The first D samples of output frame 304-2 are generated based on the last D samples of input frame 302-1. The remaining R samples of output frame 304-2 are generated based on the first R samples of input frame 302-2, wherein R=L-D. In the example, output frame 304-2 is an object-based output frame generated based on object-based input frames 302-1 and 302-2.
For output frame 304-1, the diagram 300 shows the last R samples only. The last R samples of output frame 304-1 are generated from the first R samples of input frame 302-1.
When decoding the first channel-based frame (frame 302-3), the decoder output switches from “OBJ_OUT” to “DMX_OUT”. In the context of the present application, DMX_OUT indicates output that corresponds to a channel-based format, such as 5.1 or 7.1. DMX OUT may or may not involve downmixing at the decoder. In particular, “DMX OUT” may be obtained (a) by downmixing object-based input at the decoder or (b) directly from channel-based input (without downmixing at the decoder).
The decoder generates output frame 304-3 using the object-based input frame 302-2 and channel-based input frame 302-3. The first D samples of frame 304-3 are still generated from object-based content, but already rendered to channels, for example 5.1 or 7.1, i.e. by downmixing the object-based content. The last R samples of output frame 304-3 are generated directly from channel-based input 302-3, i.e. without downmixing by the decoder.
Optionally, an object audio renderer (OAR) is used to render both object-based audio (e.g. frame 304-2) and channel-based audio (e.g. frame 304-3), instead of switching between an OAR and a dedicated channel-based renderer. Using an OAR for both object-based audio and channel-based audio avoids artefacts due to switching between different renderers. When using an OAR for rendering channel-based content, no object audio metadata (OAMD) data is available, so the decoder creates an artificial payload (306) with an offset pointing to the beginning of the frame 304-3 and no ramp (ramp length is set to zero). The artificial payload 106 comprises OAMD with position data that reflects the position of the channels, e.g. according to a 5.1 or 7.1 format. In other words, the decoder generates OAMD for mapping the audio data of frame 304-3 to the positions of the channels of a channel-based format (“bed objects”), e.g. standard speaker positions. In this example, DMX_OUT may thus be considered as channel-based output wrapped in an object-based format, to enable using an OAR to render both channel-based content and object-based content. The artificial payload 306 for channel-based audio generally differs from the preceding OAMD corresponding to object-based audio (“OAMD2” in the example of
Finally, PCM output frames 310 are generated from the OAR output frames. In case of object-based audio frames 304-2, 304-6, generating the PCM output frames may comprise rendering object-based audio to channels of PCM audio including height channels for driving overhead speakers, such as 5.1.2 PCM audio (8 channels) or a 7.1.4 PCM audio (12 channels).
Discontinuities in both the OAMD data and the PCM data are at least partially concealed by multiplying the signal with a “notch window” 312 consisting of a short fade-out followed by a short fade-in around the switching point. In the example, 32 samples prior to the switching point are still available from the last output 308-2 due to the limiter delay dL, therefore a ramp length of 32samples (33 including the 0) is used. The output 308-2 is faded out over 32 samples, while the output 308-3 is faded in over 32 samples. The invention is not limited to 32 samples: shorter or longer ramp lengths can be considered. For example, the fade-in and fade-out may have a ramp length of at least 32 samples, such as between 32 and 64 samples or between 32 and 128 samples.
When decoding the first object-based frame (frame 302-5) after some channel-based frames (302-3, 302-4), the decoder output for frame 304-5 is still in “DMX_OUT” format (e.g. 5.1 or 7.1). In the example, the first D samples of output frame 304-5 are generated from the last D samples of channel-based input frame 302-4, while the last R samples of output frame 304-5 are generated by downmixing the first R samples of object-based input frame 302-5. The next output frame 304-6 is in object-based format. The first D samples of 304-6 are generated from the last D samples of input frame 302-5, while the last R samples are generated from the first R samples of input frame 302-6. Both input frames 302-5 and 302-6 are in an object-based format, so no downmixing is applied for generating frame 304-6.
The OAMD data from the bitstream is modified such that it starts at the beginning of the next frame 302-6 (offset D) and indicates a ramp duration of 0 so that no unwanted crosstalk can occur due to ramping towards the incompatible “OBJ_OUT” channel order.
When generating output frame 310-6, a fading notch 314 (similar to fading notch 312) is applied in order to at least partially conceal discontinuities in the signal and the metadata.
OAMD in a bitstream delivering frames of object-based audio (e.g., Atmos content) contains positional data and a gain data for each object at a certain point in time. In addition, the OAMD contains a ramp duration that indicates to the renderer how much in advance the mixer (mixing input objects to output channels) should start transitioning from the previous mixing coefficients towards the mixing coefficients that got calculated from the (new) OAMD. Disabling the OAMD ramp is done by manipulating the ramp duration in the OAMD from the bitstream (e.g., setting the ramp duration to 0 (zero)).
At block 402, the device (head unit 220 and/or amplifier 230 of
At block 404, the device receives a second frame of audio of a second format different from the first format (e.g., channel-based audio, such as 5.1 or 7.1, such as DD+ 5.1), the second frame for playback subsequent to the first frame (e.g., immediately subsequent or adjacent to the first frame of audio with respect to an intended playback sequence, following the first frame of audio for playback, subsequent in playback order or sequence). In some embodiments, the first format is an object-based audio format and the second format is a channel-based audio format. In some embodiments, the first format is channel-based audio format and the second format is an object-based audio format.
In some embodiments, the first frame of audio and the second frame of audio are received by the device in a first bitstream (e.g., a DD+ bitstream or DD+JOC bitstream). In some embodiments, the first frame of audio and the second frame of audio are delivered in accordance with an adaptive streaming protocol ((e.g., via a bitstream managed by an adaptive streaming protocol). In some embodiments, the adaptive streaming protocol is MPEG-DASH, HTTP Live Streaming (HLS), Low-Latency HLS (LL-HLS), or the like).
At blocks 406 and 408, the device decodes the first frame of audio into a decoded first frame and the second frame of audio into a decoded second frame, respectively (e.g., using decoder 104 of
In some embodiments, decoding an object-based audio frame includes modifying object audio metadata (OAMD) associated with said frame of object-based audio.
In some embodiments, modifying object audio metadata (OAMD) includes modifying one or more values associated with object positional data. For example, when switching from object-based to channel-based, i.e. when the first frame is in an object-based format and the second frame is in a channel-based format, modifying the OAMD may include: providing OAMD that includes position data specifying the positions of the channels of the channel-based format. In other words, the OAMD specifies bed objects. For example, the OAMD of the object-based format is replaced, for a downmixed portion of the object-based format, by OAMD that specifies bed objects.
In some embodiments, modifying object audio metadata (OAMD) includes setting a ramp duration to zero. The ramp duration is provided in the OAMD for specifying a transition duration from previous rendering parameters (such as mixing coefficients) to current rendering parameters, wherein the previous rendering parameters are derived from previous OAMD and the current rendering parameters are derived from said OAMD. The transition may for example be performed by interpolation of rendering parameter over a time span corresponding to the ramp duration. In a further example, the ramp duration is set to zero when switching from channel-based to object-based, i.e. when the first frame is in the channel-based format and the second frame is in the object-based format.
In some embodiments, setting an object audio metadata (OAMD) ramp duration associated with the second frame of audio to zero is performed while the renderer maintains a non-reset state (e.g., while refraining from resetting the renderer).
In some embodiments, modifying object audio metadata (OAMD) includes applying a time offset (e.g., to align associated OAMD with a frame boundary). The time offset for example corresponds to the latency of the decoding process. In a further example, the offset is applied to the OAMD when switching from channel-based to object-based, i.e. when the first frame is in the channel-based format and the second frame is in the object-based format.
At block, 410 the device generates a plurality of output frames of a third format (e.g., PCM 5.1.4, PCM 7.1.4, etc.) by performing rendering (412) based on the decoded first frame and the decoded second frame (e.g., using audio object render of
In some embodiments, after rendering, the device performs one or more fading operations (e.g., fade-ins and/or fade-outs) to resolve output discontinuities (e.g., hard starts, hard ends, pops, glitches, etc.). In some embodiments, the one or more fading operations (e.g., fade-ins and/or fade-outs) are a fixed length (e.g., 32 samples, less than 32 samples, more than 32 samples). In some embodiments, the one or more fading operations are performed on non-LFE (low frequency effects) channels, i.e. the one or more fading operations are not performed on the LFE channel. In a further embodiment, the fading operations are combined with modifying the OAMD of the object-based audio to set a ramp duration to zero.
In some embodiments, generating a plurality of output frames of a third format includes downmixing the frame of audio of the object-based audio format. In some embodiments, generating a plurality of output frames of a third format includes generating a hybrid output frame that includes two portions, wherein said generating the hybrid output frame comprises: obtaining one portion of the hybrid output frame by downmixing a portion of the frame of audio of the object-based audio format while optionally foregoing downmixing on a remaining portion of the frame of audio of the object-based format; and obtaining the other portion of the hybrid output frame from a portion of the frame of audio of the channel-based audio format.
In a first example, the first frame is of an object-based audio format and the second frame is of a channel-based format. In other words, the input switches from object-based to channel-based. In such example, the hybrid output frame starts with a portion that is generated from downmixing a final portion of the first (object-based) frame and ends with a portion that is obtained from a first portion of the second (channel-based) frame. In a more specific example, the hybrid output frame, the first frame and the second frame each include L samples. The first D samples of the hybrid output frame are obtained from the downmixed last D samples of the first (object-based) frame, while the last L-D samples of the hybrid output frame are obtained from the first L-D samples of the second (channel-based) frame.
In a second example, the first frame is of a channel-based audio format and the second frame is of an object-based format. In other words, the input switches from channel-based to object-based. In such example, the hybrid output frame starts with a portion that is generated from the first (channel-based) frame and ends with a portion that is obtained from downmixing a first portion of the second (object-based) frame. In a more specific example, the hybrid output frame includes L samples, of which the first D samples are obtained from the last D samples of the first (channel-based) frame, while the last L-D samples of the hybrid output frame are obtained from the downmixed first L-D frames of the second (object-based frame).
In some embodiments, a duration of the portion of the frame of audio of the object-based format, i.e. the portion that is downmixed, is based on a latency of an associated decoding process.
For example, in the examples above, D may represent a latency or delay of an associated decoding process, and the portion to be downmixed may correspond to e.g. D or L-D.
In some embodiments, the plurality of outputs frames includes PCM audio. In some embodiments, the PCM audio is subsequently processed by the device to generate speaker signals appropriate for a specific reproduction environment (e.g., a particular speaker configuration in a particular acoustic space). For example, a system for reproduction comprises multiple speakers for playback of a 5.1.2 format, a 7.1.4 format or other immersive audio format. A speaker system for playback of a 5.1.2 format may for example include a left (L) speaker, a center (C) speaker and a right (R) speaker, a right surround (Rs) speaker and a left surround (Ls) speaker, a subwoofer (low-frequency effects, LFE) and two height speakers in the form of a Top Left (TL) and a Top Right (TR) speaker. However, the present disclosure is not limited to a specific audio system or a specific number of speakers or speaker configuration.
It should be understood that the particular order in which the operations in
The method described in the present disclosure may be implemented in hardware or software. In an example, an electronics device is provided that comprises one or more processors and a memory storing one or more programs configured to by executed by the one or more processors, the one or more programs including instructions for performing any of the methods described in the present disclosure. Such electronics device may be used for implementing the invention in a vehicle, such as a car.
In an embodiment, the vehicle may comprise a loudspeaker system for playback of audio. For example, the loudspeaker system includes surround loudspeakers and optionally height speakers, for playback. The electronics device implemented in the vehicle is configured to receive an audio stream by means of adaptive streaming, wherein the electronics device requests audio in a channel-based format when available bandwidth is relatively low, while requesting audio in an object-based format when available bandwidth is relatively high. For example, when the available bandwidth is lower than a first threshold, the electronics device requests audio in a channel-based format (e.g. 5.1 audio), while when the available bandwidth exceeds a second threshold, the electronic device requests audio in an object-based format (e.g. DD+ JOC). The electronics device implements the method of the present disclosure for switching between object-based and channel-based audio, and the speaker system of the vehicle is provided with the output frames generated by the method. The output frames may be provided directly to the speaker system of the vehicle, or further audio processing steps may be performed. Such further audio processing steps may for example include speaker mapping or cabin tuning, as exemplified in
As used herein, unless otherwise specified the use of the ordinal adjectives “first”, “second”, “third”, etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
In the claims below and the description herein, any one of the terms comprising, comprised of or which comprises is an open term that means including at least the elements/features that follow, but not excluding others. Thus, the term comprising, when used in the claims, should not be interpreted as being limitative to the means or elements or steps listed thereafter. For example, the scope of the expression a device comprising A and B should not be limited to devices consisting only of elements A and B. Any one of the terms including or which includes or that includes as used herein is also an open term that also means including at least the elements/features that follow the term, but not excluding others. Thus, including is synonymous with and means comprising.
As used herein, the term “exemplary” is used in the sense of providing examples, as opposed to indicating quality. That is, an “exemplary embodiment” is an embodiment provided as an example, as opposed to necessarily being an embodiment of exemplary quality.
It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.
Furthermore, while some embodiments described herein include some, but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.
Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with the necessary instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the invention.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it is to be noticed that the term coupled, when used in the claims, should not be interpreted as being limited to direct connections only. The terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Thus, the scope of the expression a device A coupled to a device B should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means. “Coupled” may mean that two or more elements are either in direct physical or electrical contact, or that two or more elements are not in direct contact with each other but yet still co-operate or interact with each other.
Thus, while there has been described specific embodiments of the invention, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. For example, any formulas given above are merely representative of procedures that may be used. Functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Steps may be added or deleted to methods described within the scope of the present invention. For example, the multiple decoding steps describe with respect to
This application claims the benefit of priority from U.S. Provisional Application No. 63/227,222 (reference: D21069USP1) filed on 29 Jul. 2021, which is hereby incorporated by reference in its entirety.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/EP22/70530 | 7/21/2022 | WO |
| Number | Date | Country | |
|---|---|---|---|
| 63227222 | Jul 2021 | US |