The present disclosure is generally related to audio rendering.
Advances in technology have resulted in smaller and more powerful computing devices. For example, a variety of portable personal computing devices, including wireless telephones such as mobile and smart phones, tablets and laptop computers are small, lightweight, and easily carried by users. These devices can communicate voice and data packets over wireless networks. Further, many such devices incorporate additional functionality such as a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such devices can process executable instructions, including software applications, such as a web browser application, that can be used to access the Internet. As such, these devices can include significant computing and networking capabilities.
A content provider may provide encoded multimedia streams to a decoder of a user device. For example, the content provider may provide encoded audio streams and encoded video streams to the decoder of the user device. The decoder may decode the encoded multimedia streams to generate decoded video and decoded audio. A multimedia renderer may render the decoded video to generate rendered video, and the multimedia renderer may render the decoded audio to generate rendered audio. The rendered audio may be projected (e.g., output) using an output audio device. For example, the rendered audio may be projected using speakers, sound bars, headphones, etc. The rendered video may be displayed using a display device. For example, the rendered video may be displayed using a television, a monitor, a mobile device screen, etc.
However, the rendered audio and the rendered video may be sub-optimal based on user preferences, user location, or both. As a non-limiting example, a user of the user device may move to a location where a listening experience associated with the rendered audio is sub-optimal, a viewing experience associated with the rendered video is sub-optimal, or both. Further, the user device may not provide the user with the capability to adjust the audio to the user's preference via an intuitive interface, such as by modifying the location and audio level of individual sound sources within the rendered audio. As a result, the user may have a reduced user experience.
According to one implementation of the present disclosure, an apparatus includes a network interface configured to receive a media stream from an encoder. The media stream includes encoded audio and metadata associated with the encoded audio. The metadata is usable to determine three-dimensional audio rendering information for different portions of the encoded audio. The apparatus also includes an audio decoder configured to decode the encoded audio to generate decoded audio. The audio decoder is also configured to detect a sensor input and modify the metadata based on the sensor input to generate modified metadata. The apparatus further includes an audio renderer configured to render the decoded audio based on the modified metadata to generate rendered audio having three-dimensional sound attributes. The apparatus also includes an output device configured to output the rendered audio.
According to another implementation of the present disclosure, a method of rendering audio includes receiving a media stream from an encoder. The media stream includes encoded audio and metadata associated with the encoded audio. The metadata is usable to determine three-dimensional audio rendering information for different portions of the encoded audio. The method also includes decoding the encoded audio to generate decoded audio. The method further includes detecting a sensor input and modifying the metadata based on the sensor input to generate modified metadata. The method also includes rendering the decoded audio based on the modified metadata to generate rendered audio having three-dimensional sound attributes. The method also includes outputting the rendered audio.
According to another implementation of the present disclosure, a non-transitory computer-readable medium includes instructions for rendering audio. The instructions, when executed by a processor within a rendering device, cause the processor to perform operations including receiving a media stream from an encoder. The media stream includes encoded audio and metadata associated with the encoded audio. The metadata is usable to determine three-dimensional audio rendering information for different portions of the encoded audio. The operations also include decoding the encoded audio to generate decoded audio. The operations further include detecting a sensor input and modifying the metadata based on the sensor input to generate modified metadata. The operations also include rendering the decoded audio based on the modified metadata to generate rendered audio having three-dimensional sound attributes. The operations also include outputting the rendered audio.
According to another implementation of the present disclosure, an apparatus includes means for receiving a media stream from an encoder. The media stream includes encoded audio and metadata associated with the encoded audio. The metadata is usable to determine three-dimensional audio rendering information for different portions of the encoded audio. The apparatus also includes means for decoding the encoded audio to generate decoded audio. The apparatus further includes means for detecting a sensor input and means for modifying the metadata based on the sensor input to generate modified metadata. The apparatus also includes means for rendering the decoded audio based on the modified metadata to generate rendered audio having three-dimensional sound attributes. The apparatus also includes means for outputting the rendered audio.
According to another implementation of the present disclosure, an apparatus includes a network interface configure to receive an audio bitstream. The audio bitstream includes encoded audio associated with one or more audio objects and audio metadata indicating one or more sound attributes of the one or more audio objects. The apparatus also includes a memory configured to store the encoded audio and the audio metadata. The apparatus further includes a controller configured to receive an indication to adjust a particular sound attribute of the one or more sound attributes. The particular sound attribute is associated with a particular audio object of the one or more audio objects. The controller is also configured to modify the audio metadata, based on the indication, to generate modified audio metadata.
According to another implementation of the present disclosure, a method of processing an encoded audio signal includes receiving an audio bitstream. The audio bitstream includes encoded audio associated with one or more audio objects and audio metadata indicating one or more sound attributes of the one or more audio objects. The method also includes storing the encoded audio and the audio metadata. The method further includes receiving an indication to adjust a particular sound attribute of the one or more sound attributes. The particular sound attribute is associated with a particular audio object of the one or more audio objects. The method also includes modifying the audio metadata, based on the indication, to generate modified audio metadata.
According to another implementation of the present disclosure, a non-transitory computer-readable medium includes instructions for processing an encoded audio signal. The instructions, when executed by a processor, cause the processor to perform operations including receiving an audio bitstream. The audio bitstream includes encoded audio associated with one or more audio objects and audio metadata indicating one or more sound attributes of the one or more audio objects. The operations also include receiving an indication to adjust a particular sound attribute of the one or more sound attributes. The particular sound attribute is associated with a particular audio object of the one or more audio objects. The operations also include modifying the audio metadata, based on the indication, to generate modified audio metadata.
According to another implementation of the present disclosure, an apparatus includes means for receiving an audio bitstream. The audio bitstream includes encoded audio associated with one or more audio objects and audio metadata indicating one or more sound attributes of the one or more audio objects. The apparatus also includes means for storing the encoded audio and the audio metadata. The apparatus also includes means for receiving an indication to adjust a particular sound attribute of the one or more sound attributes. The particular sound attribute is associated with a particular audio object of the one or more audio objects. The apparatus also includes means for modifying the audio metadata, based on the indication, to generate modified audio metadata.
Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims.
Particular implementations of the present disclosure are described below with reference to the drawings. In the description, common features are designated by common reference numbers throughout the drawings. As used herein, various terminology is used for the purpose of describing particular implementations only and is not intended to be limiting. For example, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It may be further understood that the terms “comprise,” “comprises,” and “comprising” may be used interchangeably with “include,” “includes,” or “including.” Additionally, it will be understood that the term “wherein” may be used interchangeably with “where.” As used herein, “exemplary” may indicate an example, an implementation, and/or an aspect, and should not be construed as limiting or as indicating a preference or a preferred implementation. As used herein, an ordinal term (e.g., “first,” “second,” “third,” etc.) used to modify an element, such as a structure, a component, an operation, etc., does not by itself indicate any priority or order of the element with respect to another element, but rather merely distinguishes the element from another element having a same name (but for use of the ordinal term). As used herein, the term “set” refers to a grouping of one or more elements, and the term “plurality” refers to multiple elements.
As used herein, “coupled” may include “communicatively coupled,” “electrically coupled,” or “physically coupled,” and may also (or alternatively) include any combinations thereof. Two devices (or components) may be coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) directly or indirectly via one or more other devices, components, wires, buses, networks (e.g., a wired network, a wireless network, or a combination thereof), etc. Two devices (or components) that are electrically coupled may be included in the same device or in different devices and may be connected via electronics, one or more connectors, or inductive coupling, as illustrative, non-limiting examples. In some implementations, two devices (or components) that are communicatively coupled, such as in electrical communication, may send and receive electrical signals (digital signals or analog signals) directly or indirectly, such as via one or more wires, buses, networks, etc. As used herein, “directly coupled” may include two devices that are coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) without intervening components.
Multimedia content may be transmitted in an encoded formatted from a first device to a second device. The first device may include an encoder that encodes the multimedia content, and the second device may include a decoder that decodes the multimedia content prior to rendering the multimedia content for one or more users. To illustrate, the multimedia content may include encoded audio. Different sound-producing objects may be represented in the encoded audio. For example, a first audio object may produce first audio that is encoded into the encoded audio, and a second audio object may produce second audio that is encoded into the encoded audio. The encoded audio may be transmitted to the second device in an audio bitstream. Audio metadata indicating sound attributes (e.g., location, orientation, volume, etc.) of the first audio and the second audio may also be included in the audio bitstream. For example, the metadata may indicate first sound attributes of the first audio and second sound attributes of the second audio.
Upon reception of the audio bitstream, the second device may decode the encoded audio to generate the first audio and the second audio. The second device may also modify the metadata to change the sound attributes of the first audio and the second audio upon rendering. Thus, the metadata may be modified at a rendering stage (as opposed to an authoring stage) to generate modified metadata. According to one implementation, the metadata may be modified based on a sensor input. An audio renderer of the second device may render the first audio based on the modified metadata to produce first rendered audio having first modified sound attributes and may render the second audio based on the modified metadata to produce second rendered audio having second modified sound attributes. The first rendered audio and the second rendered audio may be output (e.g., played) by an output device. For example, the first rendered audio and the second rendered audio may be output by a virtual reality headset, an augmented reality headset, a mixed reality headset, sound bars, one or more speakers, headphones, a mobile device, a motor vehicle, a wearable device, etc.
Referring to
The content provider 102 includes a media stream generator 103 and a transmitter 115. The content provider 102 may be configured to provide media content to the user device 120 via the network 116. For example, the media stream generator 103 may be configured to generate a media stream 104 (e.g., an encoded bit stream) that is provided to the user device 120 via the network 116. According to one implementation, the media stream 104 includes an audio stream 106 and a video stream 108. For example, the media stream generator 103 may combine the audio stream 106 and the video stream 108 to generate the media stream 104.
According to another implementation, the media stream 104 may be an audio-based media stream. For example, the media stream 104 may include only the audio stream 106, and the transmitter 115 may transmit the audio stream 106 to the user device 120. According to yet another implementation, the media stream 104 may be a video-based media stream. For example, the media stream 104 may include only the video stream 108, and the transmitter 115 may transmit the video stream 108 to the user device 120. It should be noted that the techniques described herein may be applied to audio-based media streams, video-based media streams, or a combination thereof (e.g., media streams including audio and video).
The audio stream 106 may include a plurality of compressed audio frames and metadata corresponding to each compressed audio frame. To illustrate, the audio stream 106 includes a compressed audio frame 110 (e.g., encoded audio) and metadata 112 corresponding to the compressed audio frame 110. The compressed audio frame 110 may be one frame of the plurality of compressed audio frames in the audio stream 106. The metadata 112 includes binary data that is indicative of characteristics of sound-producing objects associated with decoded audio in the compressed audio frame 110, as further described with respect to
The video stream 108 may include a plurality of compressed video frames. According to one implementation, each compressed video frame of the plurality of compressed video frames may provide video, upon decompression, for corresponding audio frames of the plurality of compressed audio frames. To illustrate, the video stream 108 includes a compressed video frame 114 that provides video, upon decompression, for the compressed audio frame 110. For example, the compressed video frame 114 may represent a video depiction of the audio environment represented by the compressed audio frame 110.
Referring to
The scene 200 includes multiple sound-producing objects that produce the audio associated with the compressed audio frame 110. For example, the scene 200 includes a first object 210, a second object 220, and a third object 230. The first object 210 may be a foreground object, and the other objects 220, 230 may be background objects. Each object 210, 220, 230 may include different sub-objects. For example, the first object 210 includes a man and a woman. The second object 220 includes two women dancing, two speakers, and a tree. The third object 230 includes a tree and a plurality of birds. It should be understood that the techniques described herein may be implemented using characteristics of each sub-object (e.g., the man, the woman, the speaker, each dancing woman, each bird, etc.); however, for ease of illustration and description, the techniques described herein are implemented using characteristics of each object 210, 220, 230. For example, the metadata 112 may be usable to determine how to spatially pan decoded audio associated with different objects 210, 220, 230, how to adjust the audio level for decoded audio associated with different objects 210, 220, 230, etc.
The metadata 112 may include information associated with each object 210, 220, 230. As a non-limiting example, the metadata 112 may include positioning information (e.g., x-coordinate, y-coordinate, z-coordinate) of each object 210, 220, 230, audio level information associated with each object 210, 220, 230, orientation information associated with each object 210, 220, 230, frequency spectrum information associated with each object 210, 220, 230, etc. It should be understood that the metadata 112 may include alternative or additional information and should not be limited to the information described above. As described below, the metadata 112 may be usable to determine 3D audio rendering information for different encoded portions (e.g., different objects 210, 220, 230) of the compressed audio frame 110.
Referring to
To illustrate, the decoded audio identifier 304 for the first object 210 is binary number “01”, the positioning identifier 306 of the first object is binary number “00001”, the orientation identifier 308 of the first object 210 is binary number “0110”, the level identifier 310 of the first object 210 is binary number “1101” and the spectrum identifier 312 of the first object 210 is binary number “110110”. The decoded audio identifier 304 for the second object 220 is binary number “10”, the positioning identifier 306 for the second object 220 is binary number “00101”, the orientation identifier 308 for the second object 220 is binary number “0011”, the level identifier 310 for the second object 220 is binary number “0011”, and the spectrum identifier 312 for the second object 220 is binary number “010010”. The decoded audio identifier 304 for the third object 230 is binary number “11”, the positioning identifier 306 for the third object 230 is binary number “00111”, the orientation identifier 308 for the third object 230 is binary number “1100”, the level identifier 310 for the third object 230 is binary number “0011”, and the spectrum identifier 312 for the third object 230 is binary number “101101”. As described with respect to
Although the metadata 112 is shown to include five fields 304-312, in other implementations, the metadata 112 may include additional or fewer fields.
The metadata 312a includes a position azimuth identifier 314, a position elevation identifier 316, a position radius identifier 318, a gain factor identifier 320, and a spread identifier 322. The metadata 312b includes an object priority identifier 324, a flag azimuth identifier 326, an azimuth difference identifier 328, a flag elevation identifier 330, and an elevation difference identifier 332. The metadata 312c includes a flag radius identifier 334, a position radius difference identifier 336, a flag gain identifier 338, a gain factor difference identifier 340, and a flag spread identifier 342. The metadata 312d includes a spread difference identifier 344, a flag object priority identifier 346, and an object priority difference identifier 348.
Referring back to
The network interface 132 may be configured to receive the media stream 104 from the content provider 102. Upon reception of the media stream 104, the decoder 122 of the user device 120 may extract different components of the media stream 104. For example, the decoder 122 includes a media stream decoder 136 and a spatial decoder 138. The media stream decoder 136 may be configured to decode the encoded audio (e.g., the compressed audio frame 110) to generate decoded audio 142, decode the compressed video frame 114 to generate decoded video 144, and extract the metadata 112 of the media stream 104. According to a scene-based audio implementation, the media stream decoder 136 may be configured to generate an audio frame 146, such as a spatially uncompressed audio frame, from the compressed audio frame 110 of the media stream 104 and configured to generate spatial metadata 148 from the media stream 104. The audio frame 146 may include spatially uncompressed audio, such as higher order ambisonics (HOA) audio signals that are not processed by spatial compression.
To enhance user experience, the metadata 112 (or the spatial metadata 148) may be modified based on one or more user inputs. For example, the input device 124 may detect one or more user inputs. According to one implementation, the input device 124 may include a sensor to detect movements (or gestures) of a user. As a non-limiting example, the input device 124 may detect a location of the user, a head orientation of the user, an eye gaze of a user, hand gestures, body movements of the user, etc. According to some implementations, the sensor (e.g., the input device 124) may be attached to a wearable device (e.g., the user device 120) or integrated into the wearable device. The wearable device may include a virtual reality headset, an augmented reality headset, a mixed reality headset, or headphones.
Referring to
Referring back to
The input device 124 may provide the input information 150 to the controller 126. The controller 126 (e.g., a metadata modifier) may be configured to modify the metadata 112 based on the input information 150 indicative of the detected user input. For example, the controller 126 may modify the binary numbers in the metadata 112 based on the user input to generate modified metadata 152. To illustrate, the controller 126 may determine, based on the input information 150 indicating that the user moved from the first location 402 to the second location 404, to change the binary numbers in the metadata 112 so that upon rendering, the user's experience at the second location 404 is enhanced. For example, playback of 3D audio and playback of video may be modified to complement the user based on the detected input, as described below.
Referring to
Referring back to
According to the scene-based audio implementation, the controller 126 may generate instructions 154 (e.g., codes) that indicate how to modify the spatial metadata 148 based on the input information 150. The spatial decoder 138 may be configured to process the audio frame 146 based on the spatial metadata 148 (modified by the instructions 154) to generate a scene-based audio frame 156. The scene-based audio renderer 172 may be configured to render the scene-based audio frame 156 to generate a rendered scene-based audio frame 164.
The output device 130 may be configured to output the rendered audio 162, the rendered scene-based audio frame 164, or both. According to one implementation, the output device 130 may be an audio-video playback device (e.g., a television, a smartphone, etc.).
According to one implementation, the input device 124 is a standalone device that communicates with another device (e.g., a decoding-rendering device) that includes the decoder 122, the controller 126, the rendering unit 128, the output device 130, and the memory 134. For example, the input device 124 detects the user input (e.g., the gesture) and generates the input information 150 based on the user input. The input device 124 sends the input information 150 to the other device, and the other device modifies the metadata 112 according to the techniques described above.
The techniques described with respect to
Referring to
The media stream decoder 136 may decode the media stream 104 to generate the decoded audio 142, the metadata 112 associated with the decoded audio 142, the spatial metadata 148, and the spatially compressed audio frame 146. The metadata 112 may be usable to determine 3D audio rendering information for different sound-producing objects (e.g., the objects 210, 220, 230) associated with sounds of the decoded audio 142. The metadata 112 is provided to the controller 126, the decoded audio 142 is provided to the object-based renderer 170, the spatial metadata 148 is provided to the spatial decoder 138, and the spatially uncompressed audio frame 146 is also provided to the spatial decoder 138.
The input device 124 may detect a user input 602 and generate the input information 150 based on the user input 602. As a non-limiting example, the input device 124 may detect one of the user inputs described with respect to
Thus, the controller 126 may adjust the metadata 112 to generate the modified metadata 152 to account for the change in the user's head orientation. The modified metadata 152 is provided to the object-based renderer 170. The object-based renderer 170 may render the decoded audio 142 based on the modified metadata 152 to generate the rendered audio 162 having 3D sound attributes. For example, the object-based renderer 170 may spatially pan the decoded audio 142 according to the modified metadata 152 and may adjust the level for the decoded audio 142 according to the modified metadata 152.
The controller 126 may also generate the instructions 154 that are used to modify the spatial metadata 148. The spatial decoder 138 may process the audio frame 146 based on the spatial metadata 148 (modified by the instructions 154) to generate the scene-based audio frame 156. The scene-based audio renderer 172 may render the scene-based audio frame 156 to generate the rendered scene-based audio frame 164 having 3D sound attributes.
The audio generator 610 may combine the rendered audio 162 and the rendered scene-based audio frame 164 to generate rendered audio 606, and the rendered audio 606 may be output at the output device 130.
Referring to
The media stream decoder 136 may decode the media stream 104 to generate the decoded audio 142 and the metadata 112 associated with the decoded audio 142. The metadata 112 may be usable to determine 3D audio rendering information for different sound-producing objects (e.g., the objects 210, 220, 230) associated with sounds of the decoded audio 142. As a non-limiting example, if the decoded audio 142 includes the conversation associated with the first object 210, the music associated with the second object 220, and the bird sounds associated with the third object 230, the metadata 112 may include positioning information for each object 210, 220, 230, level information associated with each object 210, 220, 230, orientation information of each object 210, 220, 230, frequency spectrum information associated with each object 210, 220, 230, etc.
If the metadata 112 is provided to the object-based renderer 170, the object-based renderer 170 may render the decoded audio 142 such that the conversation associated with the first object 210 is output at a position in front of the user at a relatively loud volume, the music associated with the second object 220 is output at a position behind the user at a relatively low volume, and the bird sounds associated with the third object 230 are behind the user at a relatively low volume. For example, referring to
To adjust the way the sounds are projected in the event of the user rotating his body (e.g., in the event of the user input 602), the metadata 112 may be modified to adjust how the decoded audio 142 are rendered. For example, referring back to
The modified metadata 152 and the decoded audio 142 may be provided to the object-based renderer 170. The object-based renderer 170 may render the decoded audio 142 based on the modified metadata 152 to generate the rendered audio 162 having 3D sound attributes. For example, the object-based renderer 170 may spatially pan the different decoded audio 142 according to the modified metadata 152 and may adjust the level for different decoded audio 142 according to the modified metadata 152. The output device 130 may output the rendered audio 162.
For example, referring to
According to one implementation, the input device 124 may detect a location of the user as a user input 602, and the controller 126A may modify the metadata 112 based on the location of the user to generate the modified metadata 152. In this scenario, the object-based renderer 170 may render the decoded audio 142 to generate the rendered audio 162 having 3D sound attributes centered around the location. For example, a sweet spot of the rendered audio 162 (as output by the output device 130) may be projected at the location of the user such that the sweet spot follows the user.
Referring to
Referring to
The media stream decoder 136 may receive the media stream 104 and generate the audio frame 146 and the spatial metadata 148 associated with the audio frame 146. The audio frame 146 and the spatial metadata 148 are provided to the spatial decoder 138.
The input device 124 may detect the user input 602 and generate the input information 150 based on the user input 602. The controller 126B may generate one or more instructions 154 (e.g., codes/commands) based on the input information 150. The instructions 154 may instruct the spatial decoder 138 to modify the spatial metadata 148 (e.g., modify the data of an entire audio scene at once) based on the user input 602. The spatial decoder 138 may be configured to process the audio frame 146 based on the spatial metadata 148 (modified by the instructions 154) to generate the scene-based audio frame 156. The scene-based audio renderer 172 may render the scene-based audio frame 156 to generate the rendered scene-based audio frame 164 having 3D sound attributes. The output device 130 may output the rendered scene-based audio frame 164.
Referring to
The media stream decoder 136 may decode the video stream 108 to generate the decoded video 144, and the video renderer 1102 may render the decoded video 144 to generate rendered video 1112. The rendered video 1112 may be provided to the selection unit 1104.
The input device 124 may detect a location of the user as the user input 602 and may generate the input information 150 indicating the location of the user. For example, the input device 124 may detect whether the user is at the first location 402, the second location 404, or the third location 406. The input device 124 may generate the input information 150 that indicates the user's location. The controller 126C may determine which display device 1106, 1108 is proximate to the user's location and may generate instructions 1154 for the selection unit 1104 based on the determination. The selection unit 1104 may provide the rendered video 1112 to the display device 1106, 1108 that is proximate to the user based on the instructions 1154.
To illustrate, referring to
The techniques described with respect to
Referring to
The input information 150 may be provided to the input mapping unit 1302. According to one implementation, the input information 150 may undergo a smoothing operation and then may be provided to the input mapping unit 1302. The input mapping unit 1302 may be configured to generate mapping information 1350 based on the input information 150. The mapping information 1350 may map one or more sounds (e.g., one or more sounds 810, 820, 830 associated with the objects 210, 220, 230) to a detected input indicated by the input information 150. As a non-limiting example, the mapping information 1350 may map a hand gesture detected by a user to one or more of the sounds 810, 820, 830. To illustrate, if the user moves his hand to the right, the mapping information 1350 may correspondingly map at least one of the sounds 810, 820, 830 to the right. The mapping information 1350 is provided to the state machine 1304, to the transform computation unit 1306, and to the graphical user interface 1308. According to one implementation, the graphical user interface 1308 may provide a graphical representation of the detected input (e.g., the gesture) to the user based on the mapping information 1350.
The transform computation unit 1306 may be configured to generate transform information 1354 to rotate an audio scene associated with a scene-based audio frame based on the mapping information 1350. For example, the transform information 1354 may indicate how to rotate an audio scene associated with the scene-based audio frame 156 to generate the modified scene-based audio frame 604. The transform information 1354 is provided to the metadata modification unit 1310.
The state machine 1304 may be configured to generate, based on the mapping information 1350, state information 1352 that indicates modifications of different objects 210, 220, 230. For example, the state information 1352 may indicate how characteristics (e.g., locations, orientations, frequencies, etc.) of different objects 210, 220, 230 may be modified based on the mapping information 1350 associated with the detected input. The state information 1352 is provided to the metadata modification unit 1310.
The metadata modification unit 1310 may be configured to modify the metadata 112 to generate the modified metadata 152. For example, the metadata modification unit 1310 may modify the metadata 112 based on the state information 1352 (e.g., object-based audio modification), the transform information 1354 (e.g., scene-based audio modification), or both, to generate the modified metadata 152.
Referring to
The media stream decoder 136 may decode the media stream 104 to generate the decoded audio 142 and the metadata 112 associated with the decoded audio 142. The metadata 112 may be usable to determine 3D audio rendering information for different sound-producing objects (e.g., the objects 210, 220, 230) associated with sounds of the decoded audio 142. As a non-limiting example, if the decoded audio 142 includes the conversation associated with the first object 210, the music associated with the second object 220, and the bird sounds associated with the third object 230, the metadata 112 may include positioning information for each object 210, 220, 230, level information associated with each object 210, 220, 230, orientation information of each object 210, 220, 230, frequency spectrum information associated with each object 210, 220, 230, etc.
If the metadata 112 is provided to the object-based renderer 170, the object-based renderer 170 may render the decoded audio 142 such that the conversation associated with the first object 210 is output at a position in front of the user at a relatively loud volume, the music associated with the second object 220 is output at a position behind the user at a relatively low volume, and the bird sounds associated with the third object 230 are behind the user at a relatively low volume. For example, referring to
For example, referring back to
If the gesture unit 1408 finds a stored gesture (having similar properties to the user input 602) in one of the databases 1410, 1412, the gesture unit 1408 may provide the stored gesture to the compare unit 1406. The compare unit 1406 may compare properties of the stored gesture to properties of the user input 602 to determine whether the user input 602 is substantially similar to the stored gesture. If the compare unit 1406 determines that the stored gesture is substantially similar to the user input 602, the compare unit 1406 instructs the gesture unit 1408 to provide the stored gesture to the metadata modification information generator 1414. The metadata modification information generator 1414 may generate the input information 150 based on the stored gesture. The input information 150 is provided to the controller 126A. The controller 126A may modify the metadata 112 based on the input information 150 associated with the detected user input 602 to generate the modified metadata 152. Thus, the controller 126A may adjust the metadata 112 to account for the change in the user's orientation. The modified metadata 152 is provided to the object-based renderer 170.
A buffer 1420 may buffer the decoded audio 142 to generate buffered decoded audio 1422, and the buffered decoded audio 1422 is provided to the object-based renderer 170. In other implementations, buffering operations may be bypassed and the decoded audio 142 may be provided to the object-based renderer 170. The object-based renderer 170 may render the buffered decoded audio 1422 (or the decoded audio 142) based on the modified metadata 152 to generate the rendered audio 162 having 3D sound attributes. For example, the object-based renderer 170 may spatially pan the different buffered decoded audio 1422 according to the modified metadata 152 and may adjust the level for different buffered decoded audio 1422 according to the modified metadata 152.
Referring to
The input device 124A may operate in a substantially similar manner as described with respect to
The media stream decoder 136 may receive the media stream 104 and generate the audio frame 146 and the spatial metadata 148 associated with the audio frame 146. The audio frame 146 and the spatial metadata 148 are provided to the spatial decoder 138. The controller 126B may generate one or more instructions 154 (e.g., codes/commands) based on the input information 150. The instructions 154 may instruct the spatial decoder 138 to modify the spatial metadata 148 (e.g., modify the data of an entire audio scene at once) based on the user input 602. The spatial decoder 138 may be configured to process the audio frame 146 based on the spatial metadata 148 (modified by the instructions 154) to generate the scene-based audio frame 156. The scene-based audio renderer 172 may render the scene-based audio frame 156 to generate the rendered scene-based audio frame 164 having 3D sound attributes.
Referring to
According to the process diagram 1600, a custom gesture 1602 may be added to a gesture database 1604. For example, the user of the user device 120 may add the custom gesture 1602 to the gesture database 1604 to update the gesture database 1604. According to one implementation, the custom gesture 1602 may be one of the user inputs described with respect to
For object-based audio rendering, one or more audio channels 1612 (e.g., audio channels associated with each object 210, 220, 230) are provided to control logic 1614. For example, the one or more audio channels 1612 may include a first audio channel associated with the first object 210, a second audio channel associated with the second object 220, and a third audio channel associated with the third object 230. For scene-based audio rendering, a global audio scene 1610 is provided to the control logic 1614. The global audio scene 1610 may audibly depict the scene 200 of
The control logic 1614 may select one or more particular audio channels of the one or more audio channels 1612. As a non-limiting example, the control logic 1614 may select the first audio channel associated with the first object 210. Additionally, the control logic 1614 may select a time marker or a time loop associated with the particular audio channel. As a result, metadata associated with the particular audio channel may be modified at the time marker or during the time loop. The particular audio channel (e.g., the first audio channel) and the time marker may be provided to the dictionary of translations 1616.
A sensor 1606 may detect one or more user inputs (e.g., gestures). For example, the sensor 1606 may detect the user input 602 and provide the detected user input 602 to a smoothing unit 1608. The smoothing unit 1608 may be configured to smooth the detected input 602 and provide the smoothed detected input 602 to the dictionary of translations 1616.
The dictionary of translations 1616 may be configured to determine whether the smoothed detected input 602 corresponds to a gesture in the gesture database 1604. Additionally, the dictionary of translations 1616 may translate data associated with the smoothed detected input 602 into control parameters that are usable to modify metadata (e.g., the metadata 112). The control parameters may be provided to a global audio scene modification unit 1618 and to an object-based audio unit 1620. The global audio scene modification unit 1618 may be configured to modify the global audio scene 1610 based on the control parameters associated with the smoothed detected input 602 to generate a modified global audio scene. The object-based audio unit 1620 may be configured to attach the metadata modified by the control parameters (e.g., the modified metadata 152) to the particular audio channel. A rendering unit 1622 may perform 3D audio rendering on the modified global audio scene, perform 3D audio rendering on the particular audio channel using the modified metadata 152, or a combination thereof, as described above.
Referring to
The method 1700 includes receiving a media stream from an encoder, at 1702. The media stream may include encoded audio and metadata associated with the encoded audio, and the metadata may be usable to determine three-dimensional audio rendering information for different portions of the encoded audio. For example, referring to
The method 1700 also includes decoding the encoded audio to generate decoded audio, at 1704. For example, referring to
The method 1700 also includes detecting a sensor input, at 1706. For example, referring to
The method 1700 also includes modifying the metadata based on the sensor input to generate modified metadata, at 1708. For example, referring to
The method 1700 also includes rendering decoded audio based on the modified metadata to generate rendered audio having three-dimensional sound attributes, at 1710. For example, referring to
The method 1700 also includes outputting the rendered audio, at 1712. For example, referring to
According to one implementation of the method 1700, the sensor input may include a user location, and the three-dimensional sound attributes of the rendered audio (e.g., the rendered audio 162) may be centered around the user location. Thus, the output device 130 may output the rendered audio 162 in such a manner that the rendered audio 162 appears to “follow” the user as the user moves. Additionally, according to one implementation of the method 1700, the media stream 104 may include the encoded video (e.g., video stream 108), and the method 1700 may include decoding the encoded video to generate rendered video. For example, the media stream decoder 136 may decode the video stream 108 to generate decoded video 144. The method 1700 may also include rendering the decoded video to generate rendered video and selecting, based on the user location, a particular display device to display the rendered video from a plurality of display devices. For example, the selection unit 1104 may select a particular display device to display the rendered video from a plurality of display devices 1106, 1108, 1202. The method 1700 may also include displaying the rendered video on the particular display device.
According to one implementation, the method 1700 includes detecting a second sensor input. The second sensor input may include audio content from a remote device. For example, the input device 124 may detect the second sensor input (e.g., audio content) from a mobile phone, a radio, a television, or a computer. According to some implementations, the audio content may include an audio advertisement or an audio emergency message. The method 1700 may also include generating additional metadata for the audio content. For example, the controller 126 may generate metadata that indicates a potential location for the audio content to output based on the rendering. The method 1700 may also include rendering audio associated with the audio content to generate second rendered audio having second three-dimensional sound attributes that are different from the three-dimensional sound attributes of the rendered audio 162. For example, the three-dimensional sound attributes of the rendered audio 162 may enable sound reproduction according to a first angular position, and the second three-dimensional sound attributes of the second rendered audio may enable sound reproduction according to a second angular position. The method 1700 may also include outputting the second rendered audio concurrently with the rendered audio 162.
The method 1700 of
Referring to
The external device 1802 may generate an audio bitstream 1804. The audio bitstream 1804 may include audio content 1806. Non-limiting examples of the audio content 1806 may include virtual object audio 1810 associated with a virtual audio object, an audio emergency message 1812, an audio advertisement 1814, etc. The external device 1802 may transmit the audio bitstream 1804 to the user device 120.
The network interface 132 may be configured to receive the audio bitstream 1804 from the external device 1802. The decoder 122 may be configured to decode the audio content 1806 to generate decoded audio 1820. For example, the decoder 122 may decode the virtual object audio 1810, the audio emergency message 1812, the audio advertisement 1814, or a combination thereof.
The controller 126 may be configured to generate second metadata 1822 (e.g., second audio metadata) associated with the audio bitstream 1804. The second metadata 1822 may indicate one or more locations of the audio content 1806 upon rendering. For example, the rendering unit 128 may render the decoded audio 1820 (e.g., the decoded audio content 1806) to generate rendered audio 1824 having sound attributes based on the second metadata 1822. As a non-limiting example, the virtual object audio 1810 associated with the virtual audio object (or the other audio content 1806) may be inserted in a different spatial location than the other objects 210, 220, 230.
Referring to
The method 1900 includes receiving an audio bitstream, at 1902. The audio bitstream may include encoded audio associated with one or more audio objects. The audio bitstream may also include audio metadata indicating one or more sound attributes of the one or more audio objects. For example, referring to
The method 1900 also includes storing the encoded audio and the audio metadata, at 1904. For example, referring to
The method 1900 also includes receiving an indication to adjust a particular sound attribute of the one or more sound attributes, at 1906. The particular sound attribute may be associated with a particular audio object of the one or more audio objects. For example, referring to
According to some implementations, the method 1900 includes detecting a sensor movement, a sensor location, or both. The method 1900 may also include generating the indication to adjust the particular sound attribute based on the detected sensor movement, the detected sensor location, or both. For example, referring to
The method 1900 also includes modifying the audio metadata based on the indication to generate modified audio metadata, at 1908. For example, the controller 126 may modify the metadata 112 (e.g., the audio metadata) based on the input information 150 to generate the modified metadata 152 (e.g., the modified audio metadata).
According to one implementation, the method 1900 may include rendering the decoded audio based on the modified audio metadata to generate loudspeaker feeds. For example, the rendering unit 128 (e.g., an audio renderer) may render the decoded audio based on the modified metadata to generate the rendered audio 162. According to one implementation, the rendered audio 162 may include loudspeaker feeds that are played by the output device 130. According to another implementation, the rendered audio 162 may include binauralized audio, and the output device 130 may include at least two loudspeakers that output the binauralized audio.
According to one implementation, the method 1900 includes receiving audio content from an external device. For example, referring to
The techniques described with respect to
Referring to
In a particular implementation, the user device 120 includes a processor 2006, such as a central processing unit (CPU), coupled to the memory 134. The memory 134 includes instructions 2060 (e.g., executable instructions) such as computer-readable instructions or processor-readable instructions. The instructions 2060 may include one or more instructions that are executable by a computer, such as the processor 2006. The user device 120 may include one or more additional processors 2010 (e.g., one or more digital signal processors (DSPs)). The processors 2010 may include a speech and music coder-decoder (CODEC) 2008. The speech and music CODEC 2008 may include a vocoder encoder 2014, a vocoder decoder 2012, or both. In a particular implementation, the speech and music CODEC 2008 may be an enhanced voice services (EVS) CODEC that communicates in accordance with one or more standards or protocols, such as a 3rd Generation Partnership Project (3GPP) EVS protocol.
The processor 2006 includes the media stream decoder 136, the controller 126, and the rendering unit 128. The media stream decoder 136 may be configured to decode audio received by the network interface 132 to generate the decoded audio 142. The media stream decoder 136 may also be configured to extract the metadata 112 that indicates one or more sound attributes of the audio objects 210, 220, 230. The controller 126 may be configured to receive an indication to adjust a particular sound attribute of the one or more sound attributes. The controller 126 may also modify the metadata 112 based on the indication to generate the modified metadata 152. The rendering unit 128 may render the decoded audio based on the modified metadata 152 to generate rendered audio.
The device 2000 may include a display controller 2026 that is coupled to the processor 2006 and to a display 2028. A coder/decoder (CODEC) 2034 may also be coupled to the processor 2006 and the processors 2010. The output device 130 (e.g., one or more loudspeakers) and a microphone 2048 may be coupled to the CODEC 2034. The CODEC 2034 may include a DAC 2002 and an ADC 2004. In a particular implementation, the CODEC 2034 may receive analog signals from the microphone 2048, convert the analog signals to digital signals using the ADC 2004, and provide the digital signals to the speech and music CODEC 2008. The speech and music CODEC 2008 may process the digital signals. In a particular implementation, the speech and music CODEC 2008 may provide digital signals to the CODEC 2034. The CODEC 2034 may convert the digital signals to analog signals using the DAC 2002 and may provide the analog signals to the output device 130.
In some implementations, the processor 2006, the processors 2010, the display controller 2026, the memory 2032, the CODEC 2034, the network interface 132, and the transceiver 2050 are included in a system-in-package or system-on-chip device 2022. In some implementations, the input device 124 and a power supply 2044 are coupled to the system-on-chip device 2022. Moreover, in a particular implementation, as illustrated in
The user device 120 may include a virtual reality headset, a mixed reality headset, an augmented reality headset, headphones, a headset, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a digital video disc (DVD) player, a tuner, a camera, a navigation device, a vehicle, a component of a vehicle, or any combination thereof.
In an illustrative implementation, the memory 134 includes or stores the instructions 2060 (e.g., executable instructions), such as computer-readable instructions or processor-readable instructions. For example, the memory 134 may include or correspond to a non-transitory computer readable medium storing the instructions 2060. The instructions 2060 may include one or more instructions that are executable by a computer, such as the processor 2006 or the processors 2010. The instructions 2060 may cause the processor 2006 or the processors 2010 to perform the method 1700 of
In conjunction with the described implementations, a first apparatus includes means for receiving an audio bitstream. The audio bitstream may include encoded audio associated with one or more audio objects and audio metadata indicating one or more sound attributes of the one or more audio objects. For example, the means for receiving the audio bitstream may include the network interface 130 of
The first apparatus may also include means for storing the encoded audio and the audio metadata. For example, the means for storing may include the memory 134 of
The first apparatus may also include means for receiving an indication to adjust a particular sound attribute of the one or more sound attributes. The particular sound attribute may be associated with a particular audio object of the one or more audio objects. For example, the means for receiving the indication may include the controller 126 of
The first apparatus may also include means for modifying the audio metadata based on the indication to generate modified audio metadata. For example, the means for modifying the audio metadata may include the controller 126 of
In conjunction with the described implementations, a second apparatus includes means for receiving a media stream from an encoder. The media stream may include encoded audio and metadata associated with the encoded audio. The metadata may be usable to determine 3D audio rendering information for different portions of the encoded audio. For example, the means for receiving may include the network interface 130 of
The second apparatus may also include means for decoding the encoded audio to generate decoded audio. For example, the means for decoding may include the media stream decoder 136 of
The second apparatus may also include means for detecting a sensor input. For example, the means for detecting the sensor input may include the input device 124 of
The second apparatus may also include means for modifying the metadata based on the sensor input to generate modified metadata. For example, the means for modifying the metadata may include the controller 126 of
The second apparatus may also include means for rendering the decoded audio based on the modified metadata to generate rendered audio having 3D sound attributes. For example, the means for rendering the decoded audio may include the rendering unit 128 of
The second apparatus may also include means for outputting the rendered audio. For example, the means for outputting the rendered audio may include the output device 130 of
One or more of the disclosed aspects may be implemented in a system or an apparatus, such as the user device 120, that may include a communications device, a fixed location data unit, a mobile location data unit, a mobile phone, a cellular phone, a satellite phone, a computer, a tablet, a portable computer, a display device, a media player, or a desktop computer. Alternatively or additionally, the user device 120 may include a set top box, an entertainment unit, a navigation device, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a video player, a digital video player, a digital video disc (DVD) player, a portable digital video player, a satellite, a vehicle, a component integrated within a vehicle, any other device that includes a processor or that stores or retrieves data or computer instructions, or a combination thereof. As another illustrative, non-limiting example, the system or the apparatus may include remote units, such as hand-held personal communication systems (PCS) units, portable data units such as global positioning system (GPS) enabled devices, meter reading equipment, a virtual reality headset, a mixed reality headset, an augmented reality headset, sound bars, headphones, or any other device that includes a processor or that stores or retrieves data or computer instructions, or any combination thereof.
A base station may be part of a wireless communication system and may be operable to perform the techniques described herein. The wireless communication system may include multiple base stations and multiple wireless devices. The wireless communication system may be a Long Term Evolution (LTE) system, a Code Division Multiple Access (CDMA) system, a Global System for Mobile Communications (GSM) system, a wireless local area network (WLAN) system, or some other wireless system. A CDMA system may implement Wideband CDMA (WCDMA), CDMA 1X, Evolution-Data Optimized (EVDO), Time Division Synchronous CDMA (TD-SCDMA), or some other version of CDMA.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or processor executable instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the disclosure herein may be implemented directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description is provided to enable a person skilled in the art to make or use the disclosed implementations. Various modifications to these implementations will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other implementations without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the implementations shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.