AUDIO PROCESSING

Information

  • Patent Application
  • 20250031004
  • Publication Number
    20250031004
  • Date Filed
    July 02, 2024
    8 months ago
  • Date Published
    January 23, 2025
    a month ago
Abstract
A device includes a memory configured to store data associated with an immersive audio environment and one or more processors configured to obtain contextual movement estimate data associated with a portion of the immersive audio environment. The processor(s) are configured to set a pose update parameter based on the contextual movement estimate data. The processor(s) are configured to obtain pose data based on the pose update parameter. The processor(s) are configured to obtain rendered assets associated with the immersive audio environment based on the pose data. The processor(s) are configured to generate an output audio signal based on the rendered assets.
Description
II. FIELD

The present disclosure is generally related to audio processing, especially processing immersive audio.


III. DESCRIPTION OF RELATED ART

Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless telephones such as mobile and smart phones, tablets and laptop computers that are small, lightweight, and easily carried by users. These devices can communicate voice and data packets over wireless networks. Further, many such devices incorporate additional functionality such as a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such devices can process executable instructions, including software applications, such as a web browser application, that can be used to access the Internet. As such, these devices can include significant computing capabilities.


One application of such devices includes providing immersive audio to a user. As an example, a headphone device worn by a user can receive streaming audio data from a remote server for playback to the user. Conventional multi-source spatial audio systems are often designed to use a relatively high complexity rendering of audio streams from multiple audio sources with the goal of ensuring that a worst-case performance of the headphone device still results in an acceptable quality of the immersive audio that is provided to the user. However, real-time local rendering of immersive audio is resource intensive (e.g., in terms of processor cycles, time, power, and memory utilization).


Another conventional approach is to offload local rendering of the immersive audio to the streaming device. For example, the headphone device can detect a rotation of the user's head and transmit head tracking information to a remote server. The remote server updates an audio scene based on the head tracking information, generates binaural audio data based on the updated audio scene, and transmits the binaural audio data to the headphone device for playback to the user.


Performing audio scene updates and binauralization at the remote server enables the user to experience an immersive audio experience via a headphone device that has relatively limited processing resources. However, due to latencies associated with transmitting the head tracking information to the remote server, updating the audio data based on the head rotation, and transmitting the updated binaural audio data to the headphone device, such a system can result in an unnaturally high motion-to-sound latency. In other words, the time delay between a rotation of the user's head and the corresponding modified spatial audio being played out at the user's ears can be unnaturally long, which may diminish the user's immersive audio experience.


IV. SUMMARY

According to one or more aspects of the present disclosure, a device includes a memory configured to store data associated with an immersive audio environment and one or more processors configured to obtain contextual movement estimate data associated with a portion of the immersive audio environment. The one or more processors are configured to set a pose update parameter based on the contextual movement estimate data. The one or more processors are configured to obtain pose data based on the pose update parameter. The one or more processors are configured to obtain rendered assets associated with the immersive audio environment based on the pose data. The one or more processors are configured to generate an output audio signal based on the rendered assets.


According to one or more aspects of the present disclosure, a method includes obtaining contextual movement estimate data associated with a portion of an immersive audio environment. The method includes setting a pose update parameter based on the contextual movement estimate data. The method includes obtaining pose data based on the pose update parameter. The method includes obtaining rendered assets associated with the immersive audio environment based on the pose data. The method includes generating an output audio signal based on the rendered assets.


According to one or more aspects of the present disclosure, a non-transitory computer-readable device stores instructions that are executable by one or more processors to cause the one or more processors to obtain contextual movement estimate data associated with a portion of an immersive audio environment. The instructions cause the one or more processors to set a pose update parameter based on the contextual movement estimate data. The instructions cause the one or more processors to obtain pose data based on the pose update parameter. The instructions cause the one or more processors to obtain rendered assets associated with the immersive audio environment based on the pose data. The instructions cause the one or more processors to generate an output audio signal based on the rendered assets.


According to one or more aspects of the present disclosure, an apparatus includes means for obtaining contextual movement estimate data associated with a portion of an immersive audio environment. The apparatus includes means for setting a pose update parameter based on the contextual movement estimate data. The apparatus includes means for obtaining pose data based on the pose update parameter. The apparatus includes means for obtaining rendered assets associated with the immersive audio environment based on the pose data. The apparatus includes means for generating an output audio signal based on the rendered assets.


Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims.





V. BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram of aspects of a system operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 2 is a block diagram of aspects of the system of FIG. 1 in accordance with some examples of the present disclosure.



FIG. 3 is a block diagram of aspects of the system of FIG. 1 in accordance with some examples of the present disclosure.



FIG. 4 is a block diagram of aspects of the system of FIG. 1 in accordance with some examples of the present disclosure.



FIG. 5 is a block diagram of aspects of the system of FIG. 1 in accordance with some examples of the present disclosure.



FIG. 6 is a diagram of an illustrative aspect of operation of components of the system of FIG. 1, in accordance with some examples of the present disclosure.



FIG. 7 illustrates an example of an integrated circuit operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 8 is a block diagram illustrating an illustrative implementation of a system for processing data associated with an immersive audio environment and including external speakers.



FIG. 9 is a diagram of a mobile device operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 10 is a diagram of a headset operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 11 is a diagram of earbuds that are operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 12 is a diagram of a mixed reality or augmented reality glasses device that are operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 13 is a diagram of a headset, such as a virtual reality, mixed reality, or augmented reality headset, operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 14 is a diagram of a first example of a vehicle operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 15 is a diagram of a second example of a vehicle operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.



FIG. 16 is a diagram of a particular implementation of a method of processing data associated with an immersive audio environment that may be performed by the device of FIG. 1, in accordance with some examples of the present disclosure.



FIG. 17 is a block diagram of a particular illustrative example of a device that is operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure.





VI. DETAILED DESCRIPTION

Systems and methods for providing immersive audio assets are described. The described systems and methods conserve computing resources and power of a user device by setting pose update parameters based on context. For example, a contextual movement estimate can be used to set the pose update parameters such that pose data is updated more frequently during periods with relatively high rates of movement and is updated less frequently during periods with relatively low rates of movement.


Particular aspects of the present disclosure are described below with reference to the drawings. In the description, common features are designated by common reference numbers. As used herein, various terminology is used for the purpose of describing particular implementations only and is not intended to be limiting of implementations. For example, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Further, some features described herein are singular in some implementations and plural in other implementations. To illustrate, FIG. 2 depicts a system 200 including one or more processors (“processor(s)” 210 of FIG. 2), which indicates that in some implementations the system 200 includes a single processor 210 and in other implementations the system 200 includes multiple processors 210. For ease of reference herein, such features are generally introduced as “one or more” features and are subsequently referred to in the singular or optional plural (as indicated by “(s)”) unless aspects related to multiple of the features are being described.


In some drawings, multiple instances of a particular type of feature are used. Although these features are physically and/or logically distinct, the same reference number is used for each, and the different instances are distinguished by addition of a letter to the reference number. When the features as a group or a type are referred to herein e.g., when no particular one of the features is being referenced, the reference number is used without a distinguishing letter. However, when one particular feature of multiple features of the same type is referred to herein, the reference number is used with the distinguishing letter. For example, referring to FIG. 5, multiple pose sensors are illustrated and associated with reference numbers 108A and 108B. When referring to a particular one of these pose sensors, such as a pose sensor 108A, the distinguishing letter “A” is used. However, when referring to any arbitrary one of these pose sensors or to these pose sensors as a group, the reference number 108 is used without a distinguishing letter.


As used herein, the terms “comprise,” “comprises,” and “comprising” may be used interchangeably with “include,” “includes,” or “including.” Additionally, the term “wherein” may be used interchangeably with “where.” As used herein, “exemplary” indicates an example, an implementation, and/or an aspect, and should not be construed as limiting or as indicating a preference or a preferred implementation. As used herein, an ordinal term (e.g., “first,” “second,” “third,” etc.) used to modify an element, such as a structure, a component, an operation, etc., does not by itself indicate any priority or order of the element with respect to another element, but rather merely distinguishes the element from another element having a same name (but for use of the ordinal term). As used herein, the term “set” refers to one or more of a particular element, and the term “plurality” refers to multiple (e.g., two or more) of a particular element.


As used herein, “coupled” may include “communicatively coupled,” “electrically coupled,” or “physically coupled,” and may also (or alternatively) include any combinations thereof. Two devices (or components) may be coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) directly or indirectly via one or more other devices, components, wires, buses, networks (e.g., a wired network, a wireless network, or a combination thereof), etc. Two devices (or components) that are electrically coupled may be included in the same device or in different devices and may be connected via electronics, one or more connectors, or inductive coupling, as illustrative, non-limiting examples. In some implementations, two devices (or components) that are communicatively coupled, such as in electrical communication, may send and receive signals (e.g., digital signals or analog signals) directly or indirectly, via one or more wires, buses, networks, etc. As used herein, “directly coupled” may include two devices that are coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) without intervening components.


In the present disclosure, terms such as “determining,” “calculating,” “estimating,” “shifting,” “adjusting,” etc. may be used to describe how one or more operations are performed. It should be noted that such terms are not to be construed as limiting and other techniques may be utilized to perform similar operations. Additionally, as referred to herein, “obtaining,” “generating,” “calculating,” “estimating,” “using,” “selecting,” “accessing,” and “determining” may be used interchangeably. For example, “obtaining,” “generating,” “calculating,” “estimating,” or “determining” a parameter (or a signal) may refer to actively generating, estimating, calculating, or determining the parameter (or the signal) or may refer to using, obtaining, selecting, reading, receiving, retrieving, or accessing the parameter (or signal) (e.g., from a memory, buffer, container, data structure, lookup table, transmission channel, etc.) that is already generated, such as by another component or device.



FIG. 1 is a block diagram of aspects of a system 100 operable to process data associated with an immersive audio environment in accordance with some examples of the present disclosure. The system 100 includes one or more media output devices 102 coupled to or including an immersive audio renderer 122. Each of the media output device(s) 102 is configured to output media content to a user. For example, each of the media output device(s) includes one or more speakers 104, one or more displays 106, or both. The media content can include sound (e.g., binaural or multichannel audio content) based on an output audio signal 180. Optionally, the media content can also include video content, game content, or other visual content.


The system 100 also includes one or more pose sensors 108. The pose sensor(s) 108 are configured to generate pose data 110 associated with a pose of a user of at least one of the media output device(s) 102. As used herein, a “pose” indicates a location and an orientation of the media output device(s) 102, a location and an orientation of the user of the media output device(s) 102, or both. In some implementations, at least one of the pose sensor(s) 108 is integrated within a wearable device, such that when the wearable device is worn by a user of a media output device 102, the pose data 110 indicates the pose of the user. In some such implementations, the wearable device can include the pose sensor 108 and at least one of the media output device(s) 102. To illustrate, the pose sensor 108 and at least one of the media output device(s) 102 can be combined in a head-mounted wearable device that includes the speaker(s) 104, the display(s) 106, or both. Examples of sensors that can be used as wearable pose sensors include, without limitation, inertial sensors (e.g., accelerometers or gyroscopes), compasses, positioning sensors (e.g., a global positioning system (GPS) receiver), magnetometers, inclinometers, optical sensors, one or more other sensors to detect acceleration, location, velocity, acceleration, angular orientation, angular velocity, angular acceleration, or any combination thereof. To illustrate, the pose sensor(s) 108 can include GPS, electronic maps, and electronic compasses that use inertial and magnetic sensor technology to determine direction, such as a 3-axis magnetometer to measure the Earth's geomagnetic field and a 3-axis accelerometer to provide, based on a direction of gravitational pull, a horizontality reference to the Earth's magnetic field vector.


In some implementations, at least one of the pose sensor(s) 108 is not configured to be worn by the user. For example, at least one of the pose sensor(s) 108 can include one or more optical sensors (e.g., cameras) to track movement of the user or the media output device(s) 102. In some implementations, the pose sensor(s) 108 can include a combination of sensor(s) worn by the user and sensor(s) that are not worn by the user, where the combination of sensors is configured to cooperate to generate the pose data 110.


The pose data 110 indicates the pose of the user or the media output device(s) 102 or indicates movement (e.g., changes in pose) of the user or the media output device(s) 102. In this context, “movement” includes rotation (e.g., a change in orientation without a change in location, such as a change in roll, tilt, or yaw), translation (e.g., non-rotational movement), or a combination thereof.


In FIG. 1, the immersive audio renderer 122 is configured to process immersive audio data to generate the output audio signal 180 based on the pose data 110. The immersive audio data corresponds to or is included within a plurality of immersive audio assets (“assets” in FIG. 1). In this context, an “asset” refers to a data structure (such as a file) that stores data representing at least a portion of an immersive audio environment. Generating the output audio signal 180 based on the pose data 110 includes generating a sound field representation of the immersive audio data in a manner that accounts for a current or predicted listener pose in the immersive audio environment. For example, the immersive audio renderer 122 is configured to perform a rendering operation on an asset (e.g., a remote asset 144, a local asset 142, or both) to generate a rendered asset 126. A rendered asset (whether pre-rendered or rendered as needed, e.g., in real-time) can include, for example, data describing sound from a plurality of sound sources of the immersive audio environment as such sound sources would be perceived by a listener at a particular position in the immersive audio environment or at the particular position and a particular orientation in the immersive audio environment. For example, for a particular listener pose, the rendered asset can include data representing sound field characteristics such as: an azimuth (θ) and an elevation (φ) of a direction of an average intensity vector associated with a set of sources of the immersive audio environment; a signal energy (e) associated with the set of sources of the immersive audio environment; a direct-to-total energy ratio (r) associated with the set of sources of the immersive audio environment; and an interpolated audio signal (ŝ) for the set of sources of the immersive audio environment. In this example, each of these sound field characteristics can be calculated for each, frame (f), sub-frame (k), and frequency bin (b).


The immersive audio renderer 122 includes a binauralizer 128 that is configured to binauralize an output of the rendering operation (e.g., the rendered asset 126) to generate the output audio signal 180. According to an aspect, the output audio signal 180 includes an output binaural signal that is provided to the speaker(s) 104 for playout. The rendering operation and binauralization can include sound field rotation (e.g., three degrees of freedom (3 DOF)), rotation and limited translation (e.g., 3 DOF+), or rotation and translation (e.g., 6 DOF) based on the listener pose.


In FIG. 1, the immersive audio renderer 122 includes or is coupled to an audio asset selector 124 that is configured to select one or more assets based on the pose data 110. In some implementations, the audio asset selector 124 selects, based on a current listener pose indicated by the pose data 110, one or more assets for rendering to generate one of the rendered asset(s) 126. The “current listener pose” refers to the listener's position, the listener's orientation, or both, in the immersive audio environment as indicated by the pose data 110. In another example, the audio asset selector 124 can select, based on a current listener pose indicated by the pose data 110, one or more previously rendered assets 126 for output. To illustrate, the audio asset selector 124 selects one of the rendered assets 126 for binauralization and output via the output audio signal 180 based on the current listener pose indicated by the pose data 110.


In the same or different implementations, the audio asset selector 124 is configured to select one or more assets for rendering based on a predicted listener pose. As explained further below, a pose predictor can determine the predicted listener pose based on, among other things, the pose data 110. One benefit of selecting an asset based on a predicted listener pose is that the immersive audio renderer 122 can retrieve and/or process (e.g., render) the asset before the asset is needed, thereby avoiding delays due to asset retrieval and processing.


After selecting a target asset, the audio asset selector 124 generates an asset retrieval request 138. The asset retrieval request 138 identifies at least one target asset that is to be retrieved for processing by the immersive audio renderer 122. In implementations in which assets are stored in two or more locations, such as at a remote memory 112 and a local memory 170, the system 100 includes an asset location selector 130 configured to receive the target asset retrieval request 138 and determine which of the available memories to retrieve the asset from. In some circumstances, a particular asset may only be available from one of the memories. For example, assets 172 stored at the local memory 170 may include a subset of the assets 114 stored at the remote memory 112. To illustrate, as described further below, some of the assets 114 can be retrieved (e.g., pre-fetched) from the remote memory 112 and stored among the assets 172 at the local memory 170 before such assets are to be processed by the immersive audio renderer 122.


In some implementations, the asset location selector 130 is configured to retrieve a target asset from the local memory 170 if the target asset is among the assets 172 stored at the local memory 170. In such implementations, based on a determination that the target asset is not stored at the local memory 170, the asset location selector 130 selects to obtain the target asset from the remote memory 112. For example, the asset location selector 130 may send the asset retrieval request 138 to the client 120, and the client 120 may initiate retrieval of the target asset from the remote memory 112 via an asset request 136. Otherwise, based on a determination that the target asset is stored at the local memory 170, the asset location selector 130 selects to obtain the target asset from the local memory 170. For example, the asset location selector 130 may send the asset retrieval request 138 to the local memory 170 to initiate retrieval of the target asset to the immersive audio renderer 122 as a local asset 142.


In the example illustrated in FIG. 1, a remote device 116 includes the remote memory, which stores multiple assets 114 that correspond to representations of audio content associated with the immersive audio environment. For example, the assets 114 stored at the remote memory 112 can include one or more scene-based assets 114A, one or more object-based assets 114B, one or more channel-based assets 114C, one or more pre-rendered assets 114D, or a combination thereof. The remote memory 112 is configured to provide, to the client 120, a manifest of assets 134 that are available at the remote memory 112, such as a stream manifest. The remote memory 112 is configured to receive a request for one or more particular assets, such as the asset request 136 from the client 120, and to provide the target asset, such as an audio asset 132, to the client 120 in response to the request.


The pre-rendered assets 114D of FIG. 1 can include assets that have been subjected to rendering operations (e.g., as described further with reference to FIG. 6) to generate a sound field representation for a particular listener location or for particular listener location and orientation. The scene-based assets 114A of FIG. 1 can include various versions, such as a first ambisonics representation 114AA, a second ambisonics representation 114AB, a third ambisonics representation 114AC, and one or more additional ambisonics representations including an Nth ambisonics representation 114AN. One or more of the ambisonics representations 114AA-114AN can correspond to a full set of ambisonics coefficients corresponding to a particular ambisonics order, such as first order ambisonics, second order ambisonics, third order ambisonics, etc. Alternatively, or in addition, one or more of the ambisonics representations 114AA-114AN can correspond to a set of mixed order ambisonics coefficients that provides an enhanced resolution for particular listener orientations (e.g., for higher resolution in the listener's viewing direction as compared to away from the listener's viewing direction) while using less bandwidth than a full set of ambisonics coefficients corresponding to the enhanced resolution.


In some implementations, the assets 172 can include the same types of assets as the assets 114. For example, the assets 172 can include scene-based assets, object-based assets, channel-based assets, pre-rendered assets, or a combination thereof. For example, as noted above, in some implementations, one or more of the assets 114 can be retrieved from the remote memory 112 and stored among the assets 172 at the local memory 170 before such assets are to be processed by the immersive audio renderer 122. When the remote memory 112 provides an asset to the client 120, the asset can be encoded and/or compressed for transmission (e.g., over one or more networks). In some implementations, the client 120 includes or is coupled to a decoder 121 that is configured to decode and/or decompress the asset for storage at the local memory 170, for communication to the immersive audio renderer 122 as a remote asset 144, or both. In some such implementations, one or more of the assets 172 are stored at the local memory 170 in an encoded and/or compressed format, and decoder 121 is operable to decode and/or decompress the a selected one of the asset(s) 172 before the selected asset is communicated to the immersive audio renderer 122 as a local asset 142. To illustrate, when the target asset identified in the asset retrieval request 138 is among the assets 172 stored at the local memory 170, the asset location selector 130 can determine whether the asset is stored in an encoded and/or compressed format. The asset location selector 130 can selectively cause the decoder 121 to decode and/or decompress the asset based on the determination.


In FIG. 1, the system 100 includes a movement estimator 150 configured to generate contextual movement estimate data 152. The contextual movement estimate data 152 indicates how much movement or a type of movement in a listener's pose that is expected at a particular time. The movement estimator 150 can base the contextual movement estimate data 152 on various types of information. For example, the movement estimator 150 can generate the movement estimate data 152 based on the pose data 110. To illustrate, the pose data 110 can indicate a current listener pose, and the movement estimator 150 can generate the movement estimate data 152 based on the current listener pose or a recent set of changes in the current listener pose over time. As one example, the movement estimator 150 can generate the contextual movement estimate data 152 based on a recent rate and/or a recent type of change of the listener pose, based on a set of recent listener pose data, where “recent” is determined based on some specified time limit (e.g., the last one minute, the last five minutes, etc.) or based on a specified number of samples of the pose data 110 (e.g., the most recent ten samples, the most recent one hundred samples, etc.).


As another example, the movement estimator 150 can generate the movement estimate data 152 based on a predicted pose. For example, a pose predictor can generate a predicted listener pose based at least partially on the pose data 110. The predicted listener pose can indicate a location and/or orientation of the listener in the immersive audio environment at some future time. In this example, the movement estimator 150 can generate the movement estimate data 152 based on movement that will occur (e.g., that is predicted to occur) to change from the current listener pose to the predicted listener pose.


As another example, the movement estimator 150 can generate the movement estimate data 152 based on historical interaction data 158 associated with an asset, associated with an immersive audio environment, associated with a scene of the immersive audio environment, or a combination thereof. The historical interaction data 158 can be indicative of interaction of a current user of the media output device(s) 102, interaction of other users who have consumed specific assets or interacted with the immersive audio environment, or a combination thereof. For example, the historical interaction data 158 can include movement trace data descriptive of movements of a set of users (which may include the current user) who have interacted with the immersive audio environment. In this example, the movement estimator 150 can use the historical interaction data 158 to estimate how much the current user is likely to move in the near future (e.g., during consumption of a portion of an asset or scene that the user is currently consuming). To illustrate, when the immersive audio environment is related to game content, a scene of the game content can depict (in sound, video, or both) a startling event (e.g., an explosion, a crash, a jump scare, etc.) that historically has caused users to quickly look in a particular direction or to pan around the environment, as indicated by the historical interaction data 158. In this illustrative example, the contextual movement estimate data 152 can indicate, based on the historical interaction data 158, that a rate of movement and/or a type of movement of the listener pose is likely to increase when the startling event occurs.


As another example, the movement estimator 150 can generate the movement estimate data 152 based on one or more context cues 154 (also referred to herein as “movement cues”) associated with the immersive audio environment. One or more of the context cue(s) 154 can be explicitly provided in metadata of the asset(s) representing the immersive audio environment. For example, metadata associated with an asset can include a field that indicates the contextual movement estimate data 152. To illustrate, a game creator or distributor can indicate in metadata associated with a particular asset that the asset or a portion of the asset is expected to result in a change in the rate of listener movement. As one example, if a scene of a game includes an event that is likely to cause the user to move more (or less), metadata of the game can indicate when the event occurs during playout of an asset, where the event occurs (e.g., a sound source location in the immersive audio environment), a type of event, an expected result of the event (e.g., increased or decreased translation in a particular direction, increased or decreased head rotation, etc.), a duration of the event, etc.


In some implementations, one or more of the context cue(s) 154 are implicit rather than explicit. For example, metadata associated with an asset can indicate a genre of the asset, and the movement estimator 150 can generate the contextual movement estimate data 152 based on the genre of the asset. To illustrate, the movement estimator 150 may expect less rapid head movement during play out of an immersive audio environment representing a classical music genre than is expected during play out of an immersive audio environment representing a first-person shooter game.


The movement estimator 150 is configured to set one or more pose update parameter(s) 156 based on the contextual movement estimate data 152. In a particular aspect, the pose update parameter(s) 156 indicate a pose data update rate for the pose data 110. For example, the movement estimator 150 can set the pose update parameter(s) 156 by sending the pose update parameter(s) 156 to the pose sensor(s) 108 to cause the pose sensor(s) 108 to provide the pose data 110 at a rate associated with the pose update parameter(s) 156. In some implementations, the system 100 includes two or more pose sensor(s) 108. In such implementations, the movement estimator 150 can send the same pose update parameter(s) 156 to each of the two or more pose sensor(s) 108, or the movement estimator 150 can send the different pose update parameter(s) 156 to different pose sensor(s) 108. To illustrate, the system 100 can include a first pose sensor 108 configured to generate pose data 110 indicating a translational position of a listener in the immersive audio environment and a second pose sensor 108 configured to generate pose data 110 indicating a rotational orientation of the listener in the immersive audio environment. In this example, the movement estimator 150 can send the different pose update parameter(s) 156 to the first and second pose sensors 108. For example, the contextual movement estimate data 152 can indicate that a rate of head rotation is expected to increase whereas a rate of translation is expected to remain unchanged. In this example, the movement estimator 150 can send first pose update parameter(s) 156 to cause the second pose sensor to increase the rate of generation of the pose data 110 indicating the rotational orientation of the listener and can refrain from sending pose update parameter(s) 156 to the first pose sensor (or can send second pose update parameter(s) 156) to cause the first pose sensor to continue generation of the pose data 110 indicating the translational orientation at the same rate as before.


One technical advantage of using the contextual movement estimate data 152 to set the pose update parameter(s) 156 is that pose data 110 update rates can be set based on user movement rates, which can enable conservation of resources and improved user experience. For example, when relatively high movement rates are expected (as indicated by the contextual movement estimate data 152), the pose update parameter(s) 156 can be set to increase the rate at which the pose data 110 is updated. The increased update rate for the pose data 110 reduces motion/sound latency of the output audio signal 180. To illustrate, in this example, user movement (e.g., head rotation) is reflected in the output audio signal 180 more quickly because pose data 110 reflecting the user movement is available to the immersive audio renderer 122 more quickly. Conversely, when relatively low movement rates are expected (as indicated by the contextual movement estimate data 152) the pose update parameter(s) 156 can be set to decrease the rate at which the pose data 110 is updated. The decreased update rate for the pose data 110 conserves resources (e.g., computing cycles, power, memory) associated with rendering and binauralization by the immersive audio renderer 122, resources (e.g., bandwidth, power) associated with transmission of the pose data 110, or a combination thereof.


Although FIG. 1 illustrates both the local memory 170 storing assets 172 and the remote memory 112 storing assets 114, in other implementations, only one or the other of the memories 112, 170 stores assets for playout. For example, in some implementations or in some modes of operation of the system 100, the assets 172 are downloaded to the local memory 170 for use, and the asset location selector 130 always retrieves assets 172 from the local memory 170. To illustrate, the system 100 can operate in a local-only mode when a network connection to the remote memory 112 is not available (e.g., when a device is in “airplane mode”). As another example, in some implementations or in some modes of operation of the system 100, the assets 114 are downloaded from the remote memory 112 for use, and the asset location selector 130 always retrieves assets 114 from the remote memory 112 via the client 120. To illustrate, the system 100 can operate in a remote-only mode when streaming content from a streaming service associated with the remote memory 112.



FIG. 1 illustrates one particular, non-limiting, arrangement of the components of the system 100. In other implementations, the components can be arranged and interconnected in a different manner than illustrated in FIG. 1. For example, the decoder 121 can be distinct from and external to the client 120. As another example, the audio asset selector 124 can be distinct from and external to the immersive audio renderer 122. To illustrate, the audio asset selector 124 and the asset location selector 130 can be combined. As another example, the movement estimator 150 can be combined with the audio asset selector 124.


In some implementations, many of the components of the system 100 are integrated within the media output device(s) 102. For example, the media output device(s) 102 can include a head-mounted wearable device, such as a headset, a helmet, earbuds, etc., that include the client 120, the local memory 170, the asset location selector 130, the immersive audio renderer 122, the movement estimator 150, the pose sensor(s) 108, or any combination thereof. As another example, the media output device(s) 102 can include a head-mounted wearable device and a separate player device, such as a game console, a computer, or a smart phone. In this example, at least one pair of the speaker(s) 104 and at least one of the pose sensor(s) 108 can be integrated within the head-mounted wearable device and other components of the system 100 can be integrated into the player device, or divided between the player device and the head-mounted wearable device.



FIG. 2 is a block diagram of a system 200 that includes aspects of the system 100 of FIG. 1. For example, the system 200 includes the media output device(s) 102, the immersive audio renderer 122, the asset location selector 130, the client 120, the local memory 170, and the remote memory 112 of FIG. 1, each of which operates as described with reference to FIG. 1. In the system 200, the immersive audio renderer 122, the asset location selector 130, the client 120, and the local memory 170 are included in an immersive audio player 202 that is configured to communicate (e.g., via a modem 220) with the pose sensor(s) 108 and the media output device(s) 102. In other examples, the media output device(s) 102 and the immersive audio player 202 are integrated within a single device, such as a wearable device, which can include the speaker(s) 104, the display(s) 106, the pose sensor(s) 108, or a combination thereof.


In FIG. 2, the immersive audio player 202 includes one or more processors 210 configured to execute instructions (e.g., instructions 174 from the local memory 170) to perform the operations of the immersive audio renderer 122, the asset location selector 130, the client 120, the movement estimator 150, the audio asset selector 124, a pose predictor 250, or a combination thereof.



FIG. 2 illustrates an example of the system 100 of FIG. 1 in which the movement estimator 150 is an aspect of the immersive audio renderer 122. For example, the movement estimate 150 can be integrated within or coupled to the pose predictor 250.


In some implementations, the movement estimator 150 is configured to determine the contextual movement estimate data 152 based on one or more prior poses 252, a current pose 254, one or more predicted pose 256, or a combination thereof. For example, the movement estimator 150 can determine the contextual movement estimate data 152 based on a historical movement rate, where the historical movement rate is determined based on differences between the prior pose(s) 252, between the prior pose(s) 252 and the current pose 254, between the prior pose(s) 252 or current pose 254 and the predicted pose(s) 256, or combinations thereof. In this context, the prior pose(s) 252 can include historical pose data 110; whereas the current pose 254 refers to a pose indicated by a most recent set of samples of the pose data 110.


In a particular aspect, the pose predictor 250 is configured to determine the predicted pose(s) 256 using predictive techniques such as extrapolation based on the prior pose(s) 252 and/or the current pose 254; inference using one or more artificial intelligence models; probability-based estimates based on the prior pose(s) 252 and/or the current pose 254; probability-based estimates based on the historical interaction data 158 of FIG. 1; the context cues 154 of FIG. 1; or combinations thereof. The predicted pose(s) 256 can be used to reduce motion-to-sound latency of the output audio signal 180. For example, the immersive audio renderer 122 can generate asset retrieval requests 138 for one or more assets associated with the predicted pose(s) 256. In this example, the immersive audio renderer 122 can process an asset associated with a particular predicted pose 256 to generate a rendered asset. The rendered asset represents a sound field of the immersive audio environment as the sound field would be perceived by a listener having the particular predicted pose 256. In this example, the rendered asset is used to generate the output audio signal 180 when (or if) the pose data 110 indicates that the particular predicted pose 256 used to render the asset is the current pose 254. By using the predicted pose(s) 256 to select and/or render assets, the immersive audio renderer 122 is able to perform many complex rendering operations in advance, leading to reduced latency of providing the output audio signal 180 representing a particular asset and pose as compared to selecting, requesting, receiving, and rendering assets on an as-needed basis (e.g., rending an asset exclusively based on the current pose 254).


In some implementations, the immersive audio renderer 122 can render two or more assets based on the predicted pose(s) 256. For example, in some circumstances, there can be significant uncertainty as to which of a set of possible poses the user will move to in the future. To illustrate, in a game environment, the user can be faced with several choices, and the specific choice the user makes can change the asset to be rendered, a future listener pose, or both. In this example, the predicted pose(s) 256 can include multiple poses for a particular future time, and the immersive audio renderer 122 can render one asset based on two or more predicted pose(s) 256, can renderer two or more different assets based on the two or more predicted poses 256, or both. In this example, when the current pose 254 aligns with one of the predicted pose(s) 256, the corresponding rendered asset is used to generate the output audio signal 180.


In some implementations, the immersive audio renderer 122 can render assets in stages, as described with reference to FIG. 6. For example, the immersive audio renderer 122 can perform a first set of operations to localize a sound field representation of the immersive audio environment to a listener location and a second set of operations to rotate the sound field representation of the immersive audio environment to a listener orientation. In some such implementations, the immersive audio renderer 122 can perform only the first set of operations (e.g., localization operation) or only the second set of operations (e.g., rotation operations) based on the predicted pose(s) 256. In such implementations, the remaining operations (e.g., localization or rotation operations) are performed to generate the output audio signal 180 when (or if) one of the predicted pose(s) 256 becomes the current pose 254.


As described with reference to FIG. 1, the movement estimator 150 is configured to set the pose update parameter(s) 156 based on the contextual movement estimate data 152. In a particular aspect, a mode or rate of pose prediction by the pose predictor 250 can be related to the pose update parameter(s) 156. For example, the pose predictor 250 can be turned off when the pose update parameter(s) 156 have a particular value. To illustrate, when the contextual movement estimate data 152 indicates that little or no user movement is expected for a particular period of time, the pose update parameter(s) 156 can be set such that the pose sensor(s) 108 are turned off or provide pose data 110 at a low rate and the pose predictor 250 is turned off. Conversely, when the contextual movement estimate data 152 indicates that rapid user movement is expected for a particular period of time, the pose update parameter(s) 156 can be set such that the pose sensor(s) 108 provide pose data 110 at a high rate and the pose predictor 250 generates predicted poses 256. Additionally, or alternatively, the pose predictor 250 can generate predicted pose(s) 256 for times a first distance in the future in a first mode and a second distance in the future in a second mode, where the mode is selected based on the pose update parameter(s) 156.


A technical advantage of adjusting the mode or rate of pose prediction by the pose predictor 250 based on the pose update parameter(s) 156 is that the pose predictor 250 can generate more predicted pose(s) 256 for periods when more movement is expected and fewer predicted pose(s) 256 for periods when less movement is expected. Generating more predicted pose(s) 256 for periods when more movement is expected enables the immersive audio renderer 122 to have a higher likelihood of rendering in advance an asset that will be used to generate the output audio signal 180. For example, the immersive audio renderer 122 can render assets associated with the predicted pose(s) 256 and use a particular one of the rendered assets to generate the output audio signal 180 when the current pose 254 corresponds to the predicted pose used to render the particular asset. In this example, having more predicted poses 256 and corresponding rendered assets means that there is a higher likelihood that the current pose 254 at some point in the future will correspond to one of the predicted poses 256, enabling use of the corresponding rendered asset to generate the output audio signal 180 rather than performing real-time rendering operations. On the other hand, pose prediction and rendering assets based on predicted poses 256 is resource intensive, and can be wasteful if the assets rendered based on the predicted poses 256 are not used. Accordingly, generating fewer predicted pose(s) 256 for periods when less movement is expected enables the immersive audio renderer 122 to conserve resources.



FIG. 3 is a block diagram of a system 300 that includes aspects of the system 100 of FIG. 1. For example, the system 300 includes the immersive audio player 202, the media output device(s) 102, and the remote memory 112. Further, the immersive audio player 202 includes the processor(s) 210, the modem 220, and the local memory 170, and the processor(s) 210 are configured to execute the instructions 174 to perform the operations of the immersive audio renderer 122, the asset location selector 130 and the client 120. Except as described below, the immersive audio player 202, the media output device(s) 102, the remote memory 112, the immersive audio player 202, the processor(s) 210, the modem 220, the local memory 170, the immersive audio renderer 122, the asset location selector 130, and the client 120 of FIG. 3 each operate as described with reference to FIGS. 1 and 2.


In the system 300, the pose sensor(s) 108, the pose predictor 250 and the movement estimator 150 are onboard (e.g., integrated within) the media output device(s) 102. To enable the immersive audio renderer 122 to render certain assets before they are needed (e.g., based on predicted pose(s) 256), the pose data 110 of FIG. 3 includes the predicted pose(s) 256 and the current pose 254. As described with reference to FIG. 2, the movement estimator 150 can determine contextual movement estimate data 152 which can be used to set pose update parameter(s) 156 that affect the rate at which the pose sensor(s) 108 send updated pose data 110 to the immersive audio player 202, and optionally can affect operation of the pose predictor 250. As described with reference to FIGS. 1 and 2, setting the pose update parameter(s) 156 based on the contextual movement estimate data 152 can enable conservation of resources and improved user experience.



FIG. 4 is a block diagram of a system 400 that includes aspects of the system 100 of FIG. 1. For example, the system 400 includes the immersive audio player 202, the media output device(s) 102, the remote memory 112. Further, the immersive audio player 202 includes the processor(s) 210, the modem 220, and the local memory 170, and the processor(s) 210 are configured to execute the instructions 174 to perform the operations of the immersive audio renderer 122, the asset location selector 130 and the client 120. Except as described below, the immersive audio player 202, the media output device(s) 102, the remote memory 112, the immersive audio player 202, the processor(s) 210, the modem 220, the local memory 170, the immersive audio renderer 122, the asset location selector 130, and the client 120 of FIG. 4 each operate as described with reference to FIGS. 1 and 2.



FIG. 4 illustrates an example of the system 100 of FIG. 1 in which the movement estimator 150 and the pose predictor 250 are aspects of the client 120. FIG. 4 also illustrates an example of the system 100 in which the historical interaction data 158 is based on movement trace data 402, movement trace data 406, or both.


As described with reference to FIG. 2, the movement estimator 150 is configured to determine the contextual movement estimate data 152 and to set the pose update parameter(s) 156 based on the contextual movement estimate data 152. In the example illustrated in FIG. 4, the pose update parameter(s) 156 are provided to the pose predictor 250, to the immersive audio renderer 122, to the pose sensor(s) 108, or a combination thereof. As described with reference to FIG. 1, the movement estimator 150 is configured to set the pose update parameter(s) 156 based on the contextual movement estimate data 152. In a particular aspect, a mode or rate of pose prediction by the pose predictor 250 can be related to the pose update parameter(s) 156.


In a particular aspect, the pose predictor 250 is configured to determine the predicted pose(s) 256 using the predictive technique(s) described with reference to FIG. 2. The client 120 provides the predicted pose(s) 256 to the immersive audio renderer 122. The immersive audio renderer 122 issues asset retrieval requests 138 for assets associated with one or more of the predicted pose(s) 256 and processes retrieved asset(s) associated with the predicted pose(s) 256 to generate rendered asset(s) 126. By using the predicted pose(s) 256 to select and/or render assets, the immersive audio renderer 122 is able to perform many complex rendering operations in advance, leading to reduced latency of providing, via the output audio signal, representing a particular asset and pose as compared to rendering assets on an as-needed basis (e.g., rending an asset exclusively based on the current pose 254).


In FIG. 4, the movement estimator 150 can use the historical interaction data 158 (optionally, with other information) to determine the contextual movement estimate data 152. Additionally, or alternatively, the pose predictor 250 can use the historical interaction data 158 (optionally, with other information) to determine the predicted pose(s) 256. The historical interaction data 158 can include or correspond to movement trace data associated with the immersive audio environment. For example, the local memory 170 can store the movement trace data 406, which can indicate how a user (or users) of the immersive audio player 202 have moved during playback of the immersive audio environment, during playback of other immersive audio environments, or both. In this example, the movement trace data 406 can include information describing the immersive audio environment (e.g., by title, genre, etc.), specific movements or listener poses detected during playback along with time indices at which such movements or poses were detected, other user interactions (e.g., game inputs) detected during playback and associated time indices, etc.


In some implementations, the movement trace data 402 stored at the remote memory 112 is a copy of (e.g., the same as) the movement trace data 406 stored at the local memory 170. In some implementations, the movement trace data 402 stored at the remote memory 112 includes the same types of information (e.g., data fields) as the movement trace data 406 stored at the local memory 170, but include information describing how users of other immersive audio player device have interacted with the immersive audio environment. For example, the movement trace data 402 can aggregate historical user interaction associated with the immersive audio environment across a plurality of users of the immersive audio player 202 and other immersive audio players.


In implementations in which the movement estimator 150 determines the contextual movement estimate data 152 based on the historical interaction data 158, the historical interaction data 158 can indicate, or be used to determine, movement probability information associated with a particular scene or a particular asset of the immersive audio environment. For example, the movement probability information can indicate how likely a particular movement rate is during a particular portion of the immersive audio environment based on how the user or other users have moved during playback of the particular portion. As another example, the movement probability information can indicate how likely movement of a particular type (e.g., translation in a particular direction, rotation in a particular direction, etc.) is during a particular portion of the immersive audio environment based on how the user or other users have moved during playback of the particular portion. As a result, the movement estimator 150 can set the pose update parameter(s) 156 to prepare for expected movement associated with playback of the immersive audio environment. For example, when the historical interaction data 158 indicates that an upcoming portion of the immersive audio environment has historically been associated with rapid rotation of the listener pose, the movement estimator 150 can set the pose update parameter(s) 156 to increase the rate at which rotation related pose data 110 is provided by the pose sensor(s) 108 to decrease the motion-to-sound latency associated with the playout of the upcoming portion. Conversely, when the historically interaction data 158 indicates that an upcoming portion of the immersive audio environment has historically been associated with little or no change of the listener pose, the movement estimator 150 can set the pose update parameter(s) 156 to decrease the rate at which the pose data 110 is provided by the pose sensor(s) 108 to conserve power and computing resources.


In implementations in which the pose predictor 250 determines the predicted pose(s) 256 based on the historical interaction data 158, the historical interaction data 158 can indicate, or be used to determine, pose probability information associated with a particular scene or a particular asset of the immersive audio environment. For example, the pose probability information can indicate the likelihood of particular listener locations, particular listener orientations, or particular listener poses during playback of a particular portion of the immersive audio environment based on historic listener poses during playback of the particular portion.


A technical benefit of determining the historical interaction data 158 based on the movement trace data 402, 406 is that the movement trace data 402, 406 provides an accurate estimate of how real users interact with the immersive audio environment, thereby enabling more accurate pose prediction, more accurate contextual movement estimation, or both. Further, the movement trace data 402, 406 can be captured readily. To illustrate, during use of the immersive audio player 202 to playout content associated with a particular immersive audio environment, the immersive audio player 202 can store the movement trace data 406 at the local memory 170. The immersive audio player 202 can send the movement trace data 406 to the remote memory 112 to update the movement trace data 402 at any convenient time, such as after playout of the content associated with the particular immersive audio environment is complete or when the immersive audio player 202 is connected to the remote memory 112 and the connection to the remote memory 112 has available bandwidth. The movement trace data 402 can include an aggregation of historical interaction data from a user of the immersive audio player 202 and other users.



FIG. 5 is a block diagram of a system 500 that includes aspects of the system 100 of FIG. 1. For example, the system 500 includes the immersive audio player 202, the media output device(s) 102, and the remote memory 112. Further, the immersive audio player 202 includes the processor(s) 210, the modem 220, and the local memory 170, and the processor(s) 210 are configured to execute the instructions 174 to perform the operations of the immersive audio renderer 122, the asset location selector 130 and the client 120. Except as described below, the immersive audio player 202, the media output device(s) 102, the remote memory 112, the immersive audio player 202, the processor(s) 210, the modem 220, the local memory 170, the immersive audio renderer 122, the asset location selector 130, and the client 120 of FIG. 5 each operate as described with reference to any of FIGS. 1-4.



FIG. 5 illustrates an example of the system 100 of FIG. 1 that includes at least two pose sensors 108, e.g., a pose sensor 108A and a pose sensor 108B. In the example illustrated in FIG. 5, the pose sensor 108A is integrated within one of the media output device(s) 102 and the pose sensor 108B is shown external to the media output device(s) 102; however, in other implementations, the pose sensor 108A is integrated within a first of the media output device(s) 102 and the pose sensor 108B is integrated within a second of the media output device(s) 102. For example, the pose sensor 108A can be included in a head-mounted media output device, such as a headset or earbuds, and the pose sensor 108B can be included in a non-head-mounted media output device, such as a game console, a computer, or a smartphone.


In a particular aspect, the pose sensors 108A and 108B are used together to determine a listener pose. For example, in some implementations, the pose sensor 108A provides pose data 110A representing rotation (e.g., a user's head rotation), and the pose sensor 108B provides pose data 110B indicating translation (e.g., a user's body movement). As another example, the pose data 110A can include first translation data, and the pose data 110B can include second translation data. In this example, the first and second translation data can be combined (e.g., subtracted) to determine a change in the listener pose in the immersive audio environment. Additionally, or alternatively, the pose data 110A can include first rotation data, and the pose data 110B can include second rotation data. In this example, the first and second rotation data can be combined (e.g., subtracted) to determine a change in the listener pose in the immersive audio environment.


In the example illustrated in FIG. 5, the movement estimator 150 can set pose update parameter(s) 156A for the pose sensor 108A separately from pose update parameter(s) 156B for the pose sensor(s) 108B. For example, a user's perception of a sound field may change more rapidly due to head rotation than due to translation while seated or walking. Accordingly, it may be desirable to set a higher update rate for pose data 110 indicating rotation than for pose data 110 that indicates translation.



FIG. 6 depicts an example of operations 600 that may be implemented in the immersive audio renderer 122 of any of FIGS. 1-5. In FIG. 6, the operations are divided between rendering operation 620 and mixing and binauralization operations 622.


In a particular aspect, the mixing and binauralization operations 622 can be performed by a mixer and binauralizer 614 which includes, corresponds to, or is included within the binauralizer 128 of any of FIGS. 1-5. In FIG. 6, the rendering operations 620 can be performed by one or more of a pre-processing module 602, a position pre-processing module 604, a spatial analysis module 606, a spatial metadata interpolation module 608, and a signal interpolation module 610. In a particular implementation, the operations 600 generate the output audio signal 180, which in FIG. 6 corresponds to a binaural output signal sout(j) based on processing an asset that represents an immersive audio environment using ambisonics representations.


When an asset is received for rendering, the pre-processing module 602 is configured to receive head-related impulse response information (HRIRs) and audio source position information pi (where boldface lettering indicates a vector, and where i is an audio source index), such as (x, y, z) coordinates of the location of each audio source in an audio scene. The pre-processing module 602 is configured to generate HRTFs and a representation of the audio source locations as a set of triangles T1 . . . NT (where NT denotes the number of triangles) having an audio source at each triangle vertex.


The position pre-processing module 604 is configured to receive the representation of the audio source locations T1 . . . NT, the audio source position information pi, listener position information pL(j) (e.g., x, y, z coordinates) that indicates a listener location for a frame j of the audio data to be rendered. The position pre-processing module 604 is configured to generate an indication of the location of the listener relative to the audio sources, such as an active triangle TA(j), of the set of triangles, that includes the listener location; an audio source selection indication mC(j) (e.g., an index of a chosen source (e.g., a higher order ambisonics (HOA) source) for signal interpolation); and spatial metadata interpolation weights {tilde over (w)}c(j, k) (e.g., chosen spatial metadata interpolation weights for a subframe k of frame j).


The spatial analysis module 606 receives the audio signals of the audio streams, illustrated as sESD(i,j) (e.g., an equivalent spatial domain representation of the signals for each source i and frame j) and also receives the indication of the location of the active triangle TA(j) that includes the listener. The spatial analysis module 606 can convert the input audio signals to an HOA format and generates orientation information for the HOA sources (e.g., θ(i, j, k, b) representing an azimuth parameter for HOA source i for sub-frame k of frame j and frequency bin b, and φ(i, j, k, b) representing an elevation parameter) and energy information (e.g., r(i, j, k, b) representing a direct-to-total energy ratio parameters and e(i, j, k, b) representing an energy value). The spatial analysis module 606 also generates a frequency domain representation of the input audio, such as S(i, j, k, b) representing a time-frequency domain signal of HOA source i.


The spatial metadata interpolation module 608 performs spatial metadata interpolation based on source orientation information oi, listener orientation information oL(j), the HOA source orientation information and energy information from the spatial analysis module 606, and the spatial metadata interpolation weights from the position pre-processing module 604. The spatial metadata interpolation module 608 generates energy and orientation information including {tilde over (e)}(i, j, b) representing an average (over sub-frames) energy for HOA source i and audio frame j for frequency band {tilde over (θ)}(i, j, b) representing an azimuth parameter for HOA source i for frame j and frequency bin b, {tilde over (φ)}(i, j, b) representing an elevation parameter for HOA source i for frame j and frequency bin b, and {tilde over (r)}(i, j, b) representing a direct-to-total energy ratio parameter for HOA source i for frame j and frequency bin b.


The signal interpolation module 610 receives energy information (e.g., (i, j, b)) from the spatial metadata interpolation module 608, energy information (e.g., e(i, j, k, b)) and a frequency domain representation of the input audio (e.g., S(i, j, k, b)) from the spatial analysis module 606, and the audio source selection indication mC(j) from the position pre-processing module 604. The signal interpolation module 610 generates an interpolated audio signal Ŝ(j, k, b). Completion of the rendering operation 620 results in a rendered asset (e.g., the rendered asset 126 of any of FIGS. 1-5) corresponding to the source orientation information oi, the interpolated audio signal Ŝ(j, k, b), and interpolated orientation and energy parameters from the signal interpolation module 610 and the spatial metadata interpolation module 608.


The mixer and binauralizer 614 receives the source orientation information oi, the listener orientation information oL(j), the HRTFs, and the interpolated audio signal Ŝ(j, k, b) and interpolated orientation and energy parameters from the signal interpolation module 610 and the spatial metadata interpolation module 608, respectively. When the asset is a pre-rendered asset 624, the mixer and binauralizer 614 receives the source orientation information oi, the HRTFs, and the interpolated audio signal Ŝ(j, k, b) and interpolated orientation and energy parameters as part of the pre-rendered asset 624. Optionally, if the listener pose associated with a pre-rendered asset 624 is specified in advance, the pre-rendered asset 624 also includes the listener orientation information oL(j). Alternatively, if the listener pose associated with a pre-rendered asset 624 is not specified in advance, the pre-rendered asset 624 receives the listener orientation information oL(j) based on the listener pose.


The mixer and binauralizer 614 is configured to apply one or more rotation operations based on an orientation of each interpolated signal and the listener's orientation; to binauralize the signals using the HRTFs; if multiple interpolated signals are received, to combine the signals (e.g., after binauralization); to perform one or more other operations; or any combination thereof, to generate the output audio signal 180.



FIG. 7 is a block diagram illustrating an implementation 700 of an integrated circuit 702. The integrated circuit 702 includes one or more processors 720, such as the one or more processors 210. The one or more processors 720 include immersive audio components 722. In FIG. 7, the immersive audio components 722 include the immersive audio renderer 122 and the movement estimator 150. Optionally, the immersive audio components 722 can include the pose predictor 250, the client 120, the decoder 121, the asset location selector 130, or a combination thereof. Further, optionally, the immersive audio renderer 122 or the movement estimator 150 can include the audio asset selector 124, the pose predictor 250, or both. In some implementations, the integrated circuit 702 also includes one or more of the pose sensor(s) 108.


The integrated circuit 702 also includes a signal input 704, such as bus interfaces and/or the modem 220, to enable the processor(s) 720 to receive input data 706, such as a target asset (e.g., a local asset 142 or a remote asset 144), the pose data 110, the historical interaction data 158, the context cue(s) 154, contextual movement estimate data 152, pose update parameter(s) 156, the manifest of assets 134, the audio assets 132, or combinations thereof. The integrated circuit 702 also includes a signal output 712, such as one or more bus interfaces and/or the modem 220, to enable the processor(s) 720 to provide output data 714 to one or more other devices. For example, the output data 714 can include the output audio signal 180, the pose update parameter(s) 156, the asset retrieval request 138, the asset request 136, or combinations thereof.


The integrated circuit 702 enables implementation of immersive audio processing as a component in one of a variety of devices, such as a speaker array as depicted in FIG. 8, a mobile device as depicted in FIG. 9, a headset device as depicted in FIG. 10, earbuds as depicted in FIG. 11, extended reality glasses as depicted in FIG. 12, an extended reality headset as depicted in FIG. 13, or a vehicle as depicted in FIG. 14 or 15.



FIG. 8 is a block diagram illustrating an implementation of a system 800 for immersive audio processing in which the immersive audio components 722 are integrated within a speaker array, such as a soundbar device 802. The soundbar device 802 is configured to perform a beam steering operation to steer binaural signals to a location associated with a user. The soundbar device 802 may receive audio assets 132 (e.g., ambisonics representations of an immersive audio environment) from a remote streaming server via a wireless network 806. The soundbar device 802 may include the one or more processors 720 of FIG. 7 (e.g., including the immersive audio renderer 122, the movement estimator 150, or both). Optionally, in FIG. 8, the soundbar device 802 includes or is coupled to the pose sensor(s) 108 to generate pose data 110 which is used to render and binauralize the one or more assets to generate a sound field of the immersive audio environment and to output binaural audio using beam steering operation.


The soundbar device 802 includes or is coupled to the pose sensors 108 (e.g., cameras, structured light sensors, ultrasound, lidar, etc.) to enable detection of a pose of the listener 820 and generation of head-tracker data of the listener 820. For example, the soundbar device 802 may detect a pose of the listener 820 at a first location 822 (e.g., at a first angle from a reference 824), adjust the sound field based on the pose of the listener 820, and perform a beam steering operation to cause emitted sound 804 to be perceived by the listener 820 as a pose-adjusted binaural signal. In an example, the beam steering operation is based on the first location 822 and a first orientation of the listener 820 (e.g., facing the soundbar device 802). In response to a change in the pose of the listener 820, such as movement of the listener 820 to a second location 832, the soundbar device 802 adjusts the sound field (e.g., according to a 3 DOF/3 DOF+or a 6 DOF operation) and performs a beam steering operation to cause the resulting emitted sound 804 to be perceived by the listener 820 as a pose-adjusted binaural signal at the second location 832.



FIG. 9 depicts an implementation 900 in which a mobile device 902 is configured to perform immersive audio processing. In FIG. 9, the mobile device 902 can include, as non-limiting examples, a phone or tablet. In the example illustrated in FIG. 9, the mobile device 902 includes a microphone 904, multiple speakers 104, and a display 106. The immersive audio components 722 and optionally one or more pose sensors 108 are integrated in the mobile device 902 and are illustrated using dashed lines to indicate internal components that are not generally visible to a user of the mobile device 902. In a particular example, the immersive audio components 722 include the immersive audio renderer 122, the movement estimator 150, and optionally other components described with reference to FIGS. 1-8. For example, in some implementations, the mobile device 902 is configured to perform operations described with reference to the immersive audio player 202 of any of FIGS. 2-5. To illustrate, the mobile device 902 can obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets. In such implementations, the output audio signal can be provided to another device, such as the soundbar device 802 of FIG. 8, the headset of FIG. 10, the earbuds of FIG. 11, the extended reality glasses of FIG. 12, the extended reality headset of FIG. 13, or speakers of one of the vehicles of FIG. 14 or 15. The mobile device 902 obtains the pose data used to render an asset from the pose sensor 108 of the mobile device 902, from a pose sensor of another device, or from a combination of pose data from the pose sensor 108 of the mobile device 902 and pose data from another device.



FIG. 10 depicts an implementation 1000 in which a headset device 1002 is configured to perform immersive audio processing. In the example illustrated in FIG. 10, the headset device 1002 includes the speakers 104, and optionally includes a microphone 1004. The immersive audio components 722 and optionally one or more pose sensors 108 are integrated in the headset device 1002. In a particular example, the immersive audio components 722 include the immersive audio renderer 122, the movement estimator 150, and optionally other components described with reference to FIGS. 1-8. For example, in some implementations, the headset device 1002 is configured to perform operations described with reference to the immersive audio player 202 of any of FIGS. 2-5. To illustrate, the headset device 1002 can obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.


In some implementations, the headset device 1002 is configured to perform operations described with reference to the media output device(s) 102 of any of FIGS. 1-5. For example, the headset device 1002 can generate pose data 110 based on a pose update parameter(s) 156 received from another device (e.g., the immersive audio player 202) and receive an output audio signal 180 representing immersive audio content rendered based on the pose data. In this example, the headset device 1002 can output sound based on the output audio signal 180.



FIG. 11 depicts an implementation 1100 in which a pair of earbuds 1106 (including a first earbud 1102 and a second earbud 1104) are configured to perform immersive audio processing. Although earbuds are described, it should be understood that the present technology can be applied to other in-ear or over-ear playback devices.


In the example illustrated in FIG. 11, the first earbud 1102 includes a first microphone 1120, such as a high signal-to-noise microphone positioned to capture the voice of a wearer of the first earbud 1102, an array of one or more other microphones configured to detect ambient sounds and spatially distributed to support beamforming, illustrated as microphones 1122A, 1122B, and 1122C, an “inner” microphone 1124 proximate to the wearer's ear canal (e.g., to assist with active noise cancelling), and a self-speech microphone 1126, such as a bone conduction microphone configured to convert sound vibrations of the wearer's ear bone or skull into an audio signal. The second earbud 1104 can be configured in a substantially similar manner as the first earbud 1102.


The immersive audio components 722 and optionally one or more pose sensors 108 are integrated in at least one of the earbuds 1106 (e.g., in the first earbud 1102, the second earbud 1104, or both). In a particular example, the immersive audio components 722 include the immersive audio renderer 122, the movement estimator 150, and optionally other components described with reference to FIGS. 1-8. For example, in some implementations, the earbuds 1106 are configured to perform operations described with reference to immersive audio player 202 of any of FIGS. 2-5. To illustrate, the earbuds 1106 can obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.


In some implementations, the earbuds 1106 are configured to perform operations described with reference to the media output device(s) 102 of any of FIGS. 1-5. For example, the earbuds 1106 can generate pose data 110 based on a pose update parameter(s) 156 received from another device (e.g., the immersive audio player 202) and receive an output audio signal 180 representing immersive audio content rendered based on the pose data. In this example, the earbuds 1106 can output sound based on the output audio signal 180 via the speakers 104.



FIG. 12 depicts an implementation 1200 in which extended reality (e.g., augmented reality or mixed reality) glasses 1202 are configured to perform immersive audio processing. The glasses 1202 include a holographic projection unit 1204 configured to project visual data onto a surface of a lens 1206 or to reflect the visual data off of a surface of the lens 1206 and onto the wearer's retina. The immersive audio components 722 and optionally one or more pose sensors 108 are integrated in the glasses 1202. In a particular example, the immersive audio components 722 include the immersive audio renderer 122, the movement estimator 150, and optionally other components described with reference to FIGS. 1-8. For example, in some implementations, the glasses 1202 are configured to perform operations described with reference to immersive audio player 202 of any of FIGS. 2-5. To illustrate, the glasses 1202 can obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.


In some implementations, the glasses 1202 are configured to perform operations described with reference to the media output device(s) 102 of any of FIGS. 1-5. For example, the glasses 1202 can generate pose data 110 based on a pose update parameter(s) 156 received from another device (e.g., the immersive audio player 202) and receive an output audio signal 180 representing immersive audio content rendered based on the pose data. In this example, the glasses 1202 can output sound based on the output audio signal 180 via the speakers 104.



FIG. 13 depicts an implementation 1300 in which an extended reality (e.g., a virtual reality, mixed reality, or augmented reality) headset 1302 is configured to perform immersive audio processing. In the example illustrated in FIG. 13, the headset 1302 includes the speakers 104 and the display(s) 106. The immersive audio components 722 and optionally one or more pose sensors 108 are integrated in the headset 1302. In a particular example, the immersive audio components 722 include the immersive audio renderer 122, the movement estimator 150, and optionally other components described with reference to FIGS. 1-8. For example, in some implementations, the headset 1302 is configured to perform operations described with reference to immersive audio player 202 of any of FIGS. 2-5. To illustrate, the headset 1302 can obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.


In some implementations, the headset 1302 is configured to perform operations described with reference to the media output device(s) 102 of any of FIGS. 1-5. For example, the headset 1302 can generate pose data 110 based on a pose update parameter(s) 156 received from another device (e.g., the immersive audio player 202) and receive an output audio signal 180 representing immersive audio content rendered based on the pose data. In this example, the headset 1302 can output sound based on the output audio signal 180.



FIG. 14 depicts another implementation 1400 in which a vehicle 1402 is configured to perform immersive audio processing. In FIG. 14, the vehicle 1402 is illustrated as a car. The immersive audio components 722 are integrated in the vehicle 1402. In a particular example, the immersive audio components 722 include the immersive audio renderer 122, the movement estimator 150, and optionally other components described with reference to FIGS. 1-8. For example, in some implementations, the vehicle 1402 is configured to perform operations described with reference to immersive audio player 202 of any of FIGS. 2-5. To illustrate, the vehicle 1402 can obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.


In some implementations, the vehicle 1402 is configured to perform operations described with reference to the media output device(s) 102 of any of FIGS. 1-5. For example, the vehicle 1402 can generate pose data 110 based on a pose update parameter(s) 156 received from another device (e.g., the immersive audio player 202) and receive an output audio signal 180 representing immersive audio content rendered based on the pose data. In this example, the vehicle 1402 can output sound based on the output audio signal 180 via a set of speakers.



FIG. 15 depicts an implementation 1500 in which a vehicle 1502 is configured to perform immersive audio processing. In FIG. 15, the vehicle 1502 is illustrated as an unmanned aerial vehicle, such as a personal drone or a package delivery drone. The immersive audio components 722 are integrated in the vehicle 1502. In a particular example, the immersive audio components 722 include the immersive audio renderer 122, the movement estimator 150, and optionally other components described with reference to FIGS. 1-8. For example, in some implementations, the vehicle 1502 is configured to perform operations described with reference to immersive audio player 202 of any of FIGS. 2-5. To illustrate, the vehicle 1502 can obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.


In some implementations, the vehicle 1502 is configured to perform operations described with reference to the media output device(s) 102 of any of FIGS. 1-5. For example, the vehicle 1502 can generate pose data 110 based on a pose update parameter(s) 156 received from another device (e.g., the immersive audio player 202) and receive an output audio signal 180 representing immersive audio content rendered based on the pose data. In this example, the vehicle 1502 can output sound based on the output audio signal 180 via the speakers 104.


Referring to FIG. 16, a particular implementation of a method 1600 of processing immersive audio data is shown. In a particular aspect, one or more operations of the method 1600 are performed by one or more of the components of any of the systems 100-500 of FIGS. 1-5.


The method 1600 includes, at block 1602, obtaining contextual movement estimate data associated with a portion of the immersive audio environment. For example, the movement estimator 150 of FIG. 1 can determine the contextual movement estimate data 152 based on the historical interaction data 158, the context cue(s) 154, metadata associated with the immersive audio environment, a genre associated with the immersive audio environment, the pose data 110 (e.g., the prior pose(s) 252, the current pose 254, and/or the predicted pose(s) 256), the movement trace data 402 or 406, other data associated with an immersive audio environment, or a combination thereof.


The method 1600 includes, at block 1604, setting a pose update parameter based on the contextual movement estimate data. For example, the movement estimator 150 of FIG. 1 can set the pose update parameter(s) 156 based on the contextual movement estimate data 152. To illustrate, the movement estimator 150 of FIG. 1 can send the pose update parameter(s) 156 to the pose sensor(s) 108 to set the pose update parameter(s) 156. The pose update parameter(s) 156 can indicate, for example, a pose data update rate, an operational mode associated with a pose sensor, etc. In a particular aspect, setting the pose update parameter(s) 156 includes sending the pose update parameter(s) 156 to the pose sensor(s) 108 to cause the pose sensor(s) 108 to provide the pose data 110 at a rate associated with the pose update parameter(s) 156.


The method 1600 includes, at block 1606, obtaining pose data based on the pose update parameter. For example, the pose sensor(s) 108 can send the pose data 110 to the immersive audio renderer 122, the audio asset selector 124, the movement estimator 150, the immersive audio player 202, the pose predictor 250, the client 120, the processor(s) 210, or a combination thereof. The pose data 110 can be used to determine a current listener pose or a predicted listener pose. Further, in some implementations, the current listener pose, the predicted listener pose, or both, can be used to determine or update the contextual movement estimate data 152.


In some implementations, the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment. In such implementations, the first data and the second data can be received from the same device, from different devices, or combinations thereof. For example, the first data can be received from a first device and the second data can be received from a second device distinct from the first device. As another example, first translation data can be obtained from a first device, second translation data can be obtained from a second device distinct from the first device, and the first data indicating the translational position of the listener in the immersive audio environment can be determined based on the first translation data and the second translation data.


The method 1600 includes, at block 1608, obtaining rendered assets associated with the immersive audio environment based on the pose data. For example, the immersive audio renderer 122 can perform rendering operations to generate a rendered asset based on a local asset 142 or a remote asset 144. To illustrate, the rendering operations can include one or more of the rendering operations 620 described with reference to FIG. 6. In some implementations, a local asset 142 or a remote asset 144 can include a pre-rendered asset (e.g., one of the pre-rendered assets 114D).


As one example, obtaining rendered assets can include determining a target asset based on the pose data (e.g., a predicted pose or the current pose) and generating an asset retrieval request to retrieve the target asset from a storage location. The target asset can include a pre-rendered asset associated with a particular listener pose or an asset that has not been pre-rendered. For example, when the target asset is a pre-rendered asset, generating the output audio signal can include applying head related transfer functions to the target asset to generate a binaural output signal. When the target asset has not been pre-rendered, obtaining the rendered assets can include rendering the target asset based on the pose data to generate a rendered asset, and applying head related transfer functions to the rendered asset to generate a binaural output signal.


The method 1600 includes, at block 1610, generating an output audio signal based on the rendered assets. For example, the immersive audio renderer 122 can generate the output audio signal 180 based on a rendered asset (e.g., the rendered asset(s) 126). To illustrate, generating the output audio signal 180 can include performing binauralization operations (e.g., by the binauralizer 128) one or more of the mixing and binauralization operations 622 described with reference to FIG. 6.


In some implementations, after an asset is rendered (or a pre-rendered asset is obtained) and used to generate an output audio signal, movement trace data can be updated. For example, the movement trace data can be updated to indicate a listener pose associated with output of the asset, movement or a rate of change of the listener pose prior to output of the asset, a time index or other identifier of position of the asset in playout of the immersive audio environment, user interactions (e.g., user inputs) associated with playout of the asset, etc. The movement trace data can be stored locally (e.g., at the local memory 170) and/or sent to a remote device (e.g., the remote memory 112) for aggregation with movement trace data of other users.


In some implementations, the pose data obtained at block 1606 is associated with a first time, and the method 1600 includes determining a predicted listener pose associated with a second time subsequent to the first time. In such implementations, the rendered asset, at block 1608, can include at least one of the rendered assets associated with the predicted listener pose. In some implementations, more than one predicted listener pose for a particular time can be predicted and each predicted listener pose can be used to obtain a rendered asset. For example, the pose data obtained at block 1606 is associated with a first time, and the method 1600 can include determining two or more predicted listener poses associated with a second time subsequent to the first time, obtaining a first rendered asset associated with a first predicted listener pose, and obtaining a second rendered asset associated with a second predicted listener pose. In this example, the method 1600 can also include selectively generating the output audio signal based on either the first rendered asset or the second rendered asset. To illustrate, selectively generating the output audio signal based on either the first rendered asset or the second rendered asset can include obtaining a first target asset associated with the first predicted listener pose, rendering the first target asset to generate the first rendered asset, obtaining a second target asset associated with the second predicted listener pose, rendering the second target asset to generate the second rendered asset, obtaining pose data associated with the second time, and selecting, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.


The method 1600 of FIG. 16 may be implemented by a field-programmable gate array (FPGA) device, an application-specific integrated circuit (ASIC), a processing unit such as a central processing unit (CPU), a DSP, a controller, another hardware device, firmware device, or any combination thereof. As an example, the method 1600 of FIG. 16 may be performed by a processor that executes instructions, such as described with reference to FIG. 17.


Referring to FIG. 17, a block diagram of a particular illustrative implementation of a device is depicted and generally designated 1700. In various implementations, the device 1700 may have more or fewer components than illustrated in FIG. 17. In an illustrative implementation, the device 1700 may correspond to one or more of the media output device(s) 102, to the immersive audio player 202, or a combination thereof. In an illustrative implementation, the device 1700 may perform one or more operations described with reference to FIGS. 1-16.


In a particular implementation, the device 1700 includes a processor 1706 (e.g., a central processing unit (CPU)). The device 1700 may include one or more additional processors 1710 (e.g., one or more DSPs). In a particular aspect, the processor(s) 210 of any of FIGS. 2-5 correspond to the processor 1706, the processors 1710, or a combination thereof. The processors 1710 may include a speech and music coder-decoder (CODEC) 1708 that includes a voice coder (“vocoder”) encoder 1736, a vocoder decoder 1738, the immersive audio components 722, or a combination thereof. The immersive audio components 722 can include, for example, the immersive audio renderer 122 and the movement estimator 150. Optionally, the immersive audio components 722 can include the pose predictor 250, the client 120, the decoder 121, the asset location selector 130, or a combination thereof. The immersive audio renderer 122 or the movement estimator 150 can include the audio asset selector 124, the pose predictor 250, or both. Optionally, the pose sensor(s) 108 can be included within or coupled to the device 1700.


The device 1700 may include a memory 1786 and a CODEC 1734. The memory 1786 may include instructions 1756 that are executable by the one or more additional processors 1710 (or the processor 1706) to implement the functionality described with reference to any of FIGS. 1-16. In FIG. 17, the device 1700 also includes the modem 220 coupled, via a transceiver 1750, to an antenna 1752.


The device 1700 may include the display(s) 106 coupled to a display controller 1726. The speaker(s) 104 and a microphone 1794 may be coupled to the CODEC 1734. The CODEC 1734 may include a digital-to-analog converter (DAC) 1702, an analog-to-digital converter (ADC) 1704, or both. In a particular implementation, the CODEC 1734 may receive analog signals from the microphone 1794, convert the analog signals to digital signals using the analog-to-digital converter 1704, and provide the digital signals to the speech and music codec 1708. The speech and music codec 1708 may process the digital signals, and the digital signals or other digital signals (e.g., one or more assets associated with an immersive audio environment) may further be processed by the immersive audio components 722. In a particular implementation, the speech and music codec 1708 may provide digital signals to the CODEC 1734. The CODEC 1734 may convert the digital signals to analog signals using the digital-to-analog converter 1702 and may provide the analog signals to the speaker(s) 104.


In a particular implementation, the device 1700 may be included in a system-in-package or system-on-chip device 1722. In a particular implementation, the memory 1786, the processor 1706, the processors 1710, the display controller 1726, the CODEC 1734, and the modem 220 are included in the system-in-package or system-on-chip device 1722. In a particular implementation, the pose sensor(s) 108, an input device 1730, and a power supply 1744 are coupled to the system-in-package or the system-on-chip device 1722. Moreover, in a particular implementation, as illustrated in FIG. 17, the display(s) 106, the input device 1730, the speaker(s) 104, the microphone 1794, the pose sensor(s) 108, the antenna 1752, and the power supply 1744 are external to the system-in-package or the system-on-chip device 1722. In a particular implementation, each of the display(s) 106, the input device 1730, the speaker(s) 104, the microphone 1794, the pose sensor(s) 108, the antenna 1752, and the power supply 1744 may be coupled to a component of the system-in-package or the system-on-chip device 1722, such as an interface (e.g., the signal input 704 or the signal output 712) or a controller.


The device 1700 may include a smart speaker, a speaker bar, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a digital video disc (DVD) player, a tuner, a camera, a navigation device, a vehicle, a headset, an augmented reality headset, a mixed reality headset, a virtual reality headset, an aerial vehicle, a home automation system, a voice-activated device, a wireless speaker and voice activated device, a portable electronic device, a car, a computing device, a communication device, an internet-of-things (IoT) device, a virtual reality (VR) device, a base station, a mobile device, or any combination thereof.


In conjunction with the described implementations, an apparatus includes means for obtaining contextual movement estimate data associated with a portion of an immersive audio environment. For example, the means for obtaining contextual movement estimate data can correspond to the movement estimator 150, the immersive audio renderer 122, the client 120, the immersive audio player 202, the processor(s) 210, the modem 220, the media output device(s) 102, the processor 1706, the one or more processor(s) 1710, one or more other circuits or components configured to obtain contextual movement estimate data, or any combination thereof.


The apparatus includes means for setting a pose update parameter based on the contextual movement estimate data. For example, the means for setting a pose update parameter can correspond to the movement estimator 150, the immersive audio renderer 122, the client 120, the immersive audio player 202, the processor(s) 210, the modem 220, the media output device(s) 102, the processor 1706, the one or more processor(s) 1710, one or more other circuits or components configured to set a pose update parameter, or any combination thereof.


The apparatus includes means for obtaining pose data based on the pose update parameter. For example, the means for obtaining pose data can correspond to the pose sensor(s) 108, the movement estimator 150, the immersive audio renderer 122, the audio asset selector 124, the client 120, the immersive audio player 202, the processor(s) 210, the modem 220, the pose predictor 250, the binauralizer 128, the media output device(s) 102, the processor 1706, the one or more processor(s) 1710, one or more other circuits or components configured to obtain pose data, or any combination thereof.


The apparatus includes means for obtaining rendered assets associated with the immersive audio environment based on the pose data. For example, the means for obtaining rendered assets can correspond to the immersive audio renderer 122, the audio asset selector 124, the asset location selector 130, the client 120, the decoder 121, the immersive audio player 202, the processor(s) 210, the modem 220, the binauralizer 128, the media output device(s) 102, the processor 1706, the one or more processor(s) 1710, one or more other circuits or components configured to obtain rendered assets, or any combination thereof.


The apparatus includes means for generating an output audio signal based on the rendered assets. For example, the means for generating an output audio signal can correspond to the immersive audio renderer 122, the immersive audio player 202, the processor(s) 210, the binauralizer 128, the media output device(s) 102, the processor 1706, the one or more processor(s) 1710, one or more other circuits or components configured to generate an output audio signal, or any combination thereof.


In some implementations, a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as the memory 1786 or the local memory 170) includes instructions (e.g., the instructions 1756 or the instructions 174) that, when executed by one or more processors (e.g., the one or more processors 210, the one or more processors 1710 or the processor 1706), cause the one or more processors to obtain contextual movement estimate data associated with a portion of an immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets


Particular aspects of the disclosure are described below in sets of interrelated Examples:

    • According to Example 1, a device includes a memory configured to store data associated with an immersive audio environment; and one or more processors configured to obtain contextual movement estimate data associated with a portion of the immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.
    • Example 2 includes the device of Example 1, wherein the pose update parameter indicates a pose data update rate.
    • Example 3 includes the device of Example 1 or Example 2, wherein the pose update parameter indicates an operational mode associated with a pose sensor.
    • Example 4 includes the device of any of Examples 1 to 3, wherein, to set the pose update parameter, the one or more processors are configured to send the pose update parameter to a pose sensor to cause the pose sensor to provide the pose data at a rate associated with the pose update parameter.
    • Example 5 includes the device of any of Examples 1 to 4, wherein the one or more processors are configured to determine a current listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the current listener pose.
    • Example 6 includes the device of any of Examples 1 to 5, wherein the one or more processors are configured to determine a predicted listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the predicted listener pose.
    • Example 7 includes the device of any of Examples 1 to 6, wherein the one or more processors are configured to obtain movement trace data associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the movement trace data.
    • Example 8 includes the device of Example 7, wherein the movement trace data is based on historical user interactions associated with the immersive audio environment.
    • Example 9 includes the device of Example 7 or Example 8, wherein the movement trace data is obtained from a remote device that aggregates user interaction associated with the immersive audio environment across a plurality of users.
    • Example 10 includes the device of Example 9, wherein the one or more processors are configured to, after obtaining rendered assets, provide updated movement trace data to the remote device.
    • Example 11 includes the device of any of Examples 1 to 10, wherein the one or more processors are configured to obtain metadata associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the metadata.
    • Example 12 includes the device of Example 11, wherein the metadata indicates a genre associated with the immersive audio environment, and wherein the one or more processors are configured to determine the contextual movement estimate data based on the genre.
    • Example 13 includes the device of Example 11 or Example 12, wherein the metadata includes one or more movement cues associated with the immersive audio environment and wherein the one or more processors are configured to determine the contextual movement estimate data based on the one or more movement cues.
    • Example 14 includes the device of any of Examples 1 to 13, wherein the one or more processors are configured to, based on pose data associated with a first time, determine a predicted listener pose associated with a second time subsequent to the first time, and wherein at least one of the rendered assets is associated with the predicted listener pose.
    • Example 15 includes the device of any of Examples 1 to 14, wherein the one or more processors are configured to, based on pose data associated with a first time: determine two or more predicted listener poses associated with a second time subsequent to the first time; obtain a first rendered asset associated with a first predicted listener pose; obtain a second rendered asset associated with a second predicted listener pose; and selectively generate the output audio signal based on either the first rendered asset or the second rendered asset.
    • Example 16 includes the device of Example 15, wherein, to selectively generate the output audio signal based on either the first rendered asset or the second rendered asset, the one or more processors are configured to obtain a first target asset associated with the first predicted listener pose; render the first target asset to generate the first rendered asset; obtain a second target asset associated with the second predicted listener pose; render the second target asset to generate the second rendered asset; obtain pose data associated with the second time; and select, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.
    • Example 17 includes the device of any of Examples 1 to 16, wherein, to obtain the rendered assets, the one or more processors are configured to determine a target asset based on the pose data; and generate an asset retrieval request to retrieve the target asset from a storage location.
    • Example 18 includes the device of Example 17, wherein the memory includes the storage location.
    • Example 19 includes the device of Example 17, wherein the storage location is at a remote device.
    • Example 20 includes the device of any of Examples 17 to 19, wherein the target asset is a pre-rendered asset and wherein, to generate the output audio signal, the one or more processors are configured to apply head related transfer functions to the target asset to generate a binaural output signal.
    • Example 21 includes the device of any of Examples 17 to 20, wherein, to obtain the rendered assets, the one or more processors are configured to render the target asset based on the pose data to generate a rendered asset, and wherein, to generate the output audio signal, the one or more processors are configured to apply head related transfer functions to the rendered asset to generate a binaural output signal.
    • Example 22 includes the device of any of Examples 1 to 21, wherein the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment.
    • Example 23 includes the device of Example 22, wherein the one or more processors are configured to receive the first data from a first device and to receive the second data from a second device distinct from the first device.
    • Example 24 includes the device of Example 22 or Example 23, wherein the one or more processors are configured to obtain first translation data from a first device; obtain second translation data from a second device distinct from the first device; and determine the first data based on the first translation data and the second translation data.
    • Example 25 includes the device of any of Examples 1 to 24 and further includes a pose sensor coupled to the one or more processors.
    • Example 26 includes the device of Example 25, wherein the pose sensor and the one or more processors are integrated within a head-mounted wearable device.
    • Example 27 includes the device of any of Examples 1 to 26, wherein the one or more processors are integrated within an immersive audio player device.
    • Example 28 includes the device of any of Examples 1 to 27 and further includes a modem coupled to the one or more processors and configured to send the pose update parameter to a device that includes a pose sensor.
    • According to Example 29, a method includes obtaining contextual movement estimate data associated with a portion of an immersive audio environment; setting a pose update parameter based on the contextual movement estimate data; obtaining pose data based on the pose update parameter; obtaining rendered assets associated with the immersive audio environment based on the pose data; and generating an output audio signal based on the rendered assets.
    • Example 30 includes the method of Example 29, wherein the pose update parameter indicates a pose data update rate.
    • Example 31 includes the method of Example 29 or Example 30, wherein the pose update parameter indicates an operational mode associated with a pose sensor.
    • Example 32 includes the method of any of Examples 29 to 31, wherein setting the pose update parameter includes sending the pose update parameter to a pose sensor to cause the pose sensor to provide the pose data at a rate associated with the pose update parameter.
    • Example 33 includes the method of any of Examples 29 to 32 and further includes determining a current listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the current listener pose.
    • Example 34 includes the method of any of Examples 29 to 33 and further includes determining a predicted listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the predicted listener pose.
    • Example 35 includes the method of any of Examples 29 to 34 and further includes obtaining movement trace data associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the movement trace data.
    • Example 36 includes the method of Example 35, wherein the movement trace data is based on historical user interactions associated with the immersive audio environment.
    • Example 37 includes the method of Example 35 or Example 36, wherein the movement trace data is obtained from a remote device that aggregates user interaction associated with the immersive audio environment across a plurality of users.
    • Example 38 includes the method of Example 37 and further includes, after obtaining rendered assets, providing updated movement trace data to the remote device.
    • Example 39 includes the method of any of Examples 29 to 38 and further includes obtaining metadata associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the metadata.
    • Example 40 includes the method of Example 39, wherein the metadata indicates a genre associated with the immersive audio environment, and further comprising determining the contextual movement estimate data based on the genre.
    • Example 41 includes the method of Example 39 or Example 40, wherein the metadata includes one or more movement cues associated with the immersive audio environment and further comprising determining the contextual movement estimate data based on the one or more movement cues.
    • Example 42 includes the method of any of Examples 29 to 41 and further includes, based on pose data associated with a first time, determining a predicted listener pose associated with a second time subsequent to the first time, and wherein at least one of the rendered assets is associated with the predicted listener pose.
    • Example 43 includes the method of any of Examples 29 to 42 and further includes, based on pose data associated with a first time: determining two or more predicted listener poses associated with a second time subsequent to the first time; obtaining a first rendered asset associated with a first predicted listener pose; obtaining a second rendered asset associated with a second predicted listener pose; and selectively generating the output audio signal based on either the first rendered asset or the second rendered asset.
    • Example 44 includes the method of Example 43, wherein selectively generating the output audio signal based on either the first rendered asset or the second rendered asset, comprises: obtaining a first target asset associated with the first predicted listener pose; rendering the first target asset to generate the first rendered asset; obtaining a second target asset associated with the second predicted listener pose; rendering the second target asset to generate the second rendered asset; obtaining pose data associated with the second time; and selecting, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.
    • Example 45 includes the method of any of Examples 29 to 44, wherein obtaining the rendered assets comprises: determining a target asset based on the pose data; and generating an asset retrieval request to retrieve the target asset from a storage location.
    • Example 46 includes the method of Example 45, wherein the target asset is a pre-rendered asset and wherein generating the output audio signal comprises applying head related transfer functions to the target asset to generate a binaural output signal.
    • Example 47 includes the method of Example 45 or Example 46, wherein obtaining the rendered assets comprises rendering the target asset based on the pose data to generate a rendered asset, and wherein generating the output audio signal comprises applying head related transfer functions to the rendered asset to generate a binaural output signal.
    • Example 48 includes the method of any of Examples 29 to 47, wherein the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment.
    • Example 49 includes the method of Example 48 and further includes receiving the first data from a first device and receiving the second data from a second device distinct from the first device.
    • Example 50 includes the method of Example 48 or Example 49, and further includes obtaining first translation data from a first device; obtaining second translation data from a second device distinct from the first device; and determining the first data based on the first translation data and the second translation data.
    • According to Example 51, a non-transitory computer-readable device stores instructions that are executable by one or more processors to cause the one or more processors to obtain contextual movement estimate data associated with a portion of an immersive audio environment; set a pose update parameter based on the contextual movement estimate data; obtain pose data based on the pose update parameter; obtain rendered assets associated with the immersive audio environment based on the pose data; and generate an output audio signal based on the rendered assets.
    • Example 52 includes the non-transitory computer-readable device of Example 51, wherein the pose update parameter indicates a pose data update rate.
    • Example 53 includes the non-transitory computer-readable device of Example 51 or Example 52, wherein the pose update parameter indicates an operational mode associated with a pose sensor.
    • Example 54 includes the non-transitory computer-readable device of any of Examples 51 to 53, wherein, to set the pose update parameter, the instructions cause the one or more processors to send the pose update parameter to a pose sensor to cause the pose sensor to provide the pose data at a rate associated with the pose update parameter.
    • Example 55 includes the non-transitory computer-readable device of any of Examples 51 to 54, wherein the instructions cause the one or more processors to determine a current listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the current listener pose.
    • Example 56 includes the non-transitory computer-readable device of any of Examples 51 to 55, wherein the instructions cause the one or more processors to determine a predicted listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the predicted listener pose.
    • Example 57 includes the non-transitory computer-readable device of any of Examples 51 to 56, wherein the instructions cause the one or more processors to obtain movement trace data associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the movement trace data.
    • Example 58 includes the non-transitory computer-readable device of Example 57, wherein the movement trace data is based on historical user interactions associated with the immersive audio environment.
    • Example 59 includes the non-transitory computer-readable device of Example 57 or Example 58, wherein the movement trace data is obtained from a remote device that aggregates user interaction associated with the immersive audio environment across a plurality of users.
    • Example 60 includes the non-transitory computer-readable device of Example 59, wherein the instructions cause the one or more processors to, after obtaining rendered assets, provide updated movement trace data to the remote device.
    • Example 61 includes the non-transitory computer-readable device of any of Examples 51 to 60, wherein the instructions cause the one or more processors to obtain metadata associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the metadata.
    • Example 62 includes the non-transitory computer-readable device of Example 61, wherein the metadata indicates a genre associated with the immersive audio environment, and wherein the instructions cause the one or more processors to determine the contextual movement estimate data based on the genre.
    • Example 63 includes the non-transitory computer-readable device of Example 61 or Example 62, wherein the metadata includes one or more movement cues associated with the immersive audio environment and wherein the instructions cause the one or more processors to determine the contextual movement estimate data based on the one or more movement cues.
    • Example 64 includes the non-transitory computer-readable device of any of Examples 51 to 63, wherein the instructions cause the one or more processors to, based on pose data associated with a first time, determine a predicted listener pose associated with a second time subsequent to the first time, and wherein at least one of the rendered assets is associated with the predicted listener pose.
    • Example 65 includes the non-transitory computer-readable device of any of Examples 51 to 64, wherein the instructions cause the one or more processors to, based on pose data associated with a first time: determine two or more predicted listener poses associated with a second time subsequent to the first time; obtain a first rendered asset associated with a first predicted listener pose; obtain a second rendered asset associated with a second predicted listener pose; and selectively generate the output audio signal based on either the first rendered asset or the second rendered asset.
    • Example 66 includes the non-transitory computer-readable device of Example 65, wherein, to selectively generate the output audio signal based on either the first rendered asset or the second rendered asset, the instructions cause the one or more processors to obtain a first target asset associated with the first predicted listener pose; render the first target asset to generate the first rendered asset; obtain a second target asset associated with the second predicted listener pose; render the second target asset to generate the second rendered asset; obtain pose data associated with the second time; and select, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.
    • Example 67 includes the non-transitory computer-readable device of any of Examples 51 to 66, wherein, to obtain the rendered assets, the instructions cause the one or more processors to determine a target asset based on the pose data; and generate an asset retrieval request to retrieve the target asset from a storage location.
    • Example 68 includes the non-transitory computer-readable device of Example 67, wherein the target asset is a pre-rendered asset and wherein, to generate the output audio signal, the instructions cause the one or more processors to apply head related transfer functions to the target asset to generate a binaural output signal.
    • Example 69 includes the non-transitory computer-readable device of Example 67 or Example 68, wherein, to obtain the rendered assets, the instructions cause the one or more processors to render the target asset based on the pose data to generate a rendered asset, and wherein, to generate the output audio signal, the instructions cause the one or more processors to apply head related transfer functions to the rendered asset to generate a binaural output signal.
    • Example 70 includes the non-transitory computer-readable device of any of Examples 51 to 69, wherein the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment.
    • Example 71 includes the non-transitory computer-readable device of Example 70, wherein the instructions cause the one or more processors to receive the first data from a first device and to receive the second data from a second device distinct from the first device.
    • Example 72 includes the non-transitory computer-readable device of Example 70 or Example 71, wherein the instructions cause the one or more processors to obtain first translation data from a first device; obtain second translation data from a second device distinct from the first device; and determine the first data based on the first translation data and the second translation data.
    • According to Example 73, an apparatus includes means for obtaining contextual movement estimate data associated with a portion of an immersive audio environment; means for setting a pose update parameter based on the contextual movement estimate data; means for obtaining pose data based on the pose update parameter; means for obtaining rendered assets associated with the immersive audio environment based on the pose data; and means for generating an output audio signal based on the rendered assets.
    • Example 74 includes the apparatus of Example 73, wherein the pose update parameter indicates a pose data update rate.
    • Example 75 includes the apparatus of Example 73 or Example 74, wherein the pose update parameter indicates an operational mode associated with a pose sensor.
    • Example 76 includes the apparatus of Example 73 to 75, wherein the means for setting the pose update parameter is configured to send the pose update parameter to a pose sensor to cause the pose sensor to provide the pose data at a rate associated with the pose update parameter.
    • Example 77 includes the apparatus of any of Examples 73 to 76 and further includes means for determining a current listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the current listener pose.
    • Example 78 includes the apparatus of any of Examples 73 to 77 and further includes means for determining a predicted listener pose in the immersive audio environment, and wherein the contextual movement estimate data is based on the predicted listener pose.
    • Example 79 includes the apparatus of any of Examples 73 to 78 and further includes means for obtaining movement trace data associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the movement trace data.
    • Example 80 includes the apparatus of Example 79, wherein the movement trace data is based on historical user interactions associated with the immersive audio environment.
    • Example 81 includes the apparatus of Example 79 or Example 80, wherein the movement trace data is obtained from a remote device that aggregates user interaction associated with the immersive audio environment across a plurality of users.
    • Example 82 includes the apparatus of Example 81 and further includes means for providing updated movement trace data to the remote device after obtaining rendered assets,.
    • Example 83 includes the apparatus of any of Examples 73 to 82 and further includes means for obtaining metadata associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the metadata.
    • Example 84 includes the apparatus of Example 83, wherein the metadata indicates a genre associated with the immersive audio environment, and further comprising means for determining the contextual movement estimate data based on the genre.
    • Example 85 includes the apparatus of Example 83 or Example 84, wherein the metadata includes one or more movement cues associated with the immersive audio environment and further comprising means for determining the contextual movement estimate data based on the one or more movement cues.
    • Example 86 includes the apparatus of any of Examples 73 to 85 and further includes means for determining, based on pose data associated with a first time, a predicted listener pose associated with a second time subsequent to the first time, and wherein at least one of the rendered assets is associated with the predicted listener pose.
    • Example 87 includes the apparatus of any of Examples 73 to 86, further includes means for determining, based on pose data associated with a first time, two or more predicted listener poses associated with a second time subsequent to the first time; means for obtaining a first rendered asset associated with a first predicted listener pose; means for obtaining a second rendered asset associated with a second predicted listener pose; and means for selectively generating the output audio signal based on either the first rendered asset or the second rendered asset.
    • Example 88 includes the apparatus of Example 87, wherein the means for selectively generating the output audio signal based on either the first rendered asset or the second rendered asset comprises: means for obtaining a first target asset associated with the first predicted listener pose; means for rendering the first target asset to generate the first rendered asset; means for obtaining a second target asset associated with the second predicted listener pose; means for rendering the second target asset to generate the second rendered asset; means for obtaining pose data associated with the second time; and means for selecting, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.
    • Example 89 includes the apparatus of any of Examples 73 to 88, wherein the means for obtaining the rendered assets comprises: means for determine a target asset based on the pose data; and means for generate an asset retrieval request to retrieve the target asset from a storage location.
    • Example 90 includes the apparatus of Example 89, wherein the target asset is a pre-rendered asset and wherein the means for generating the output audio signal comprises means for applying head related transfer functions to the target asset to generate a binaural output signal.
    • Example 91 includes the apparatus of Example 89 or Example 90, wherein the means for obtaining the rendered assets comprises means for rendering the target asset based on the pose data to generate a rendered asset, and wherein the means for generating the output audio signal comprises means for applying head related transfer functions to the rendered asset to generate a binaural output signal.
    • Example 92 includes the apparatus of any of Examples 73 to 91, wherein the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment.
    • Example 93 includes the apparatus of Example 92 and further includes means for receiving the first data from a first device and receiving the second data from a second device distinct from the first device.
    • Example 94 includes the apparatus of Example 92 or Example 93, and further includes means for obtaining first translation data from a first device; means for obtaining second translation data from a second device distinct from the first device; and means for determining the first data based on the first translation data and the second translation data.
    • Example 95 includes the apparatus of any of Examples 73 to 94 and further includes means for generating the pose data.
    • Example 96 includes the apparatus of any of Examples 73 to 95, wherein the means for obtaining contextual movement estimate data, the means for setting a pose update parameter, the means for obtaining pose data, the means for obtaining rendered assets, and means for generating an output audio signal are integrated into a wearable device.


Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or processor executable instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, such implementation decisions are not to be interpreted as causing a departure from the scope of the present disclosure.


The steps of a method or algorithm described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor may read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.


The previous description of the disclosed aspects is provided to enable a person skilled in the art to make or use the disclosed aspects. Various modifications to these aspects will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.

Claims
  • 1. A device comprising: a memory configured to store data associated with an immersive audio environment; andone or more processors configured to: obtain contextual movement estimate data associated with a portion of the immersive audio environment;set a pose update parameter based on the contextual movement estimate data;obtain pose data based on the pose update parameter;obtain rendered assets associated with the immersive audio environment based on the pose data; andgenerate an output audio signal based on the rendered assets.
  • 2. The device of claim 1, wherein the pose update parameter indicates a pose data update rate or an operational mode associated with a pose sensor.
  • 3. The device of claim 1, wherein, to set the pose update parameter, the one or more processors are configured to send the pose update parameter to a pose sensor to cause the pose sensor to provide the pose data at a rate associated with the pose update parameter.
  • 4. The device of claim 1, wherein the one or more processors are configured to determine a listener pose associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the listener pose.
  • 5. The device of claim 1, wherein the one or more processors are configured to obtain movement trace data associated with the immersive audio environment, wherein the contextual movement estimate data is based on the movement trace data, and wherein the movement trace data is based on historical user interactions of one or more users associated with the immersive audio environment.
  • 6. The device of claim 1, wherein the one or more processors are configured to obtain metadata associated with the immersive audio environment, and wherein the contextual movement estimate data is based on the metadata.
  • 7. The device of claim 6, wherein the metadata indicates a genre associated with the immersive audio environment, and wherein the one or more processors are configured to determine the contextual movement estimate data based on the genre.
  • 8. The device of claim 6, wherein the metadata includes one or more movement cues associated with the immersive audio environment and wherein the one or more processors are configured to determine the contextual movement estimate data based on the one or more movement cues.
  • 9. The device of claim 1, wherein the one or more processors are configured to, based on pose data associated with a first time: determine two or more predicted listener poses associated with a second time subsequent to the first time;obtain a first rendered asset associated with a first predicted listener pose;obtain a second rendered asset associated with a second predicted listener pose; andselectively generate the output audio signal based on either the first rendered asset or the second rendered asset.
  • 10. The device of claim 9, wherein, to selectively generate the output audio signal based on either the first rendered asset or the second rendered asset, the one or more processors are configured to: obtain a first target asset associated with the first predicted listener pose;render the first target asset to generate the first rendered asset;obtain a second target asset associated with the second predicted listener pose;render the second target asset to generate the second rendered asset;obtain pose data associated with the second time; andselect, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.
  • 11. The device of claim 1, wherein, to obtain the rendered assets, the one or more processors are configured to: determine a target asset based on the pose data; andgenerate an asset retrieval request to retrieve the target asset from a storage location.
  • 12. The device of claim 11, wherein the target asset is a pre-rendered asset and wherein, to generate the output audio signal, the one or more processors are configured to apply head related transfer functions to the target asset to generate a binaural output signal.
  • 13. The device of claim 11, wherein, to obtain the rendered assets, the one or more processors are configured to render the target asset based on the pose data to generate a rendered asset, and wherein, to generate the output audio signal, the one or more processors are configured to apply head related transfer functions to the rendered asset to generate a binaural output signal.
  • 14. The device of claim 1, wherein the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment.
  • 15. The device of claim 1, further comprising a pose sensor coupled to the one or more processors, wherein the pose sensor and the one or more processors are integrated within a head-mounted wearable device.
  • 16. The device of claim 1, further comprising a modem coupled to the one or more processors and configured to send the pose update parameter to a device that includes a pose sensor.
  • 17. A method comprising: obtaining contextual movement estimate data associated with a portion of an immersive audio environment;setting a pose update parameter based on the contextual movement estimate data;obtaining pose data based on the pose update parameter;obtaining rendered assets associated with the immersive audio environment based on the pose data; andgenerating an output audio signal based on the rendered assets.
  • 18. The method of claim 17, further comprising, based on pose data associated with a first time: determining two or more predicted listener poses associated with a second time subsequent to the first time;obtaining a first rendered asset associated with a first predicted listener pose;obtaining a second rendered asset associated with a second predicted listener pose; and selectively generating the output audio signal based on either the first rendered asset or the second rendered asset, wherein selectively generating the output audio signal based on either the first rendered asset or the second rendered asset comprises:obtaining a first target asset associated with the first predicted listener pose;rendering the first target asset to generate the first rendered asset;obtaining a second target asset associated with the second predicted listener pose;rendering the second target asset to generate the second rendered asset;obtaining pose data associated with the second time; andselecting, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.
  • 19. The method of claim 17, wherein the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment, and further comprising: receiving first translation data from a first device;receiving second translation data from a second device distinct from the first device; anddetermining the first data based on the first translation data and the second translation data.
  • 20. A non-transitory computer-readable device storing instructions that are executable by one or more processors to cause the one or more processors to: obtain contextual movement estimate data associated with a portion of an immersive audio environment;set a pose update parameter based on the contextual movement estimate data;obtain pose data based on the pose update parameter;obtain rendered assets associated with the immersive audio environment based on the pose data; andgenerate an output audio signal based on the rendered assets.
I. CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority from Provisional Patent Application No. 63/514,053, filed Jul. 17, 2023, entitled “AUDIO PROCESSING,” the content of which is incorporated herein by reference in its entirety.

Provisional Applications (1)
Number Date Country
63514053 Jul 2023 US