The present disclosure is generally related to audio processing, especially processing immersive audio.
Advances in technology have resulted in smaller and more powerful computing devices. For example, there currently exist a variety of portable personal computing devices, including wireless telephones such as mobile and smart phones, tablets and laptop computers that are small, lightweight, and easily carried by users. These devices can communicate voice and data packets over wireless networks. Further, many such devices incorporate additional functionality such as a digital still camera, a digital video camera, a digital recorder, and an audio file player. Also, such devices can process executable instructions, including software applications, such as a web browser application, that can be used to access the Internet. As such, these devices can include significant computing capabilities.
One application of such devices includes providing immersive audio to a user. As an example, a headphone device worn by a user can receive streaming audio data from a remote server for playback to the user. Conventional multi-source spatial audio systems are often designed to use a relatively high complexity rendering of audio streams from multiple audio sources with the goal of ensuring that a worst-case performance of the headphone device still results in an acceptable quality of the immersive audio that is provided to the user. However, real-time local rendering of immersive audio is resource intensive (e.g., in terms of processor cycles, time, power, and memory utilization).
Another conventional approach is to offload local rendering of the immersive audio to the streaming device. For example, the headphone device can detect a rotation of the user's head and transmit head tracking information to a remote server. The remote server updates an audio scene based on the head tracking information, generates binaural audio data based on the updated audio scene, and transmits the binaural audio data to the headphone device for playback to the user.
Performing audio scene updates and binauralization at the remote server enables the user to experience an immersive audio experience via a headphone device that has relatively limited processing resources. However, due to latencies associated with transmitting the head tracking information to the remote server, updating the audio data based on the head rotation, and transmitting the updated binaural audio data to the headphone device, such a system can result in an unnaturally high motion-to-sound latency. In other words, the time delay between a rotation of the user's head and the corresponding modified spatial audio being played out at the user's ears can be unnaturally long, which may diminish the user's immersive audio experience.
Conventionally, immersive audio environments are generated based on rendering streaming audio data corresponding to one or more audio sources in the audio environment based on the listener's pose, and the listener's pose is based on pose data that is generated by one or more sensors of the listener's playback device. Inaccuracies in the pose data causes the listener's pose to be inaccurate. An audio playback system using an inaccurate listener's pose to initiate updates to the immersive audio environment can result in wasting of resources of the audio playback system, such as due to requesting, transmitting, and initiating rendering of an unneeded audio stream based on an inaccurate estimation of the listener's location.
According to one or more aspects of the present disclosure, a device includes a memory configured to store data associated with an immersive audio environment and one or more processors configured to obtain pose data for a listener in the immersive audio environment. The one or more processors are configured to determine a current listener pose based on the pose data and one or more pose constraints. The one or more processors are configured to obtain pose data based on the pose update parameter. The one or more processors are configured to obtain, based on the current listener pose, a rendered asset associated with the immersive audio environment. The one or more processors are configured to generate an output audio signal based on the rendered asset.
According to one or more aspects of the present disclosure, a method includes obtaining, at one or more processors, pose data for a listener in an immersive audio environment. The method includes determining, at the one or more processors, a current listener pose based on the pose data and one or more pose constraints. The method includes obtaining, at the one or more processors and based on the current listener pose, a rendered asset associated with the immersive audio environment. The method includes generating, at the one or more processors, an output audio signal based on the rendered asset.
According to one or more aspects of the present disclosure, a non-transitory computer-readable device stores instructions that are executable by one or more processors to cause the one or more processors to obtain pose data for a listener in an immersive audio environment. The instructions cause the one or more processors to determine a current listener pose based on the pose data and one or more pose constraints. The instructions cause the one or more processors to obtain, based on the current listener pose, a rendered asset associated with the immersive audio environment. The instructions cause the one or more processors to generate an output audio signal based on the rendered asset.
According to one or more aspects of the present disclosure, an apparatus includes means for obtaining pose data for a listener in an immersive audio environment. The apparatus includes means for determining a current listener pose based on the pose data and one or more pose constraints. The apparatus includes means for obtaining, based on the current listener pose, a rendered asset associated with the immersive audio environment. The apparatus includes means for generating an output audio signal based on the rendered asset.
Other aspects, advantages, and features of the present disclosure will become apparent after review of the entire application, including the following sections: Brief Description of the Drawings, Detailed Description, and the Claims.
Systems and methods for providing an immersive audio environment based on a listener's pose are described. Often, conventional immersive audio environments are generated based on rendering streaming audio data corresponding to one or more audio sources in the audio environment based on the listener's pose, and the listener's pose is based on pose data that is generated by one or more sensors of the listener's playback device. Inaccuracies in the pose data causes the listener's pose to be inaccurate. An audio playback system using an inaccurate listener's pose to initiate updates to an immersive audio environment can result in wasting of resources of the audio playback system, such as due to requesting, transmitting, and/or initiating rendering of an unneeded audio stream based on an inaccurate estimation of the listener's location.
The described systems and methods improve an accuracy of the immersive audio environment and improves efficiency by identifying and mitigating, based on one or more pose constraints, outlier values of the listener's pose. For example, one or more pose sensors may generate pose data of the listener that indicates a pose of the listener, and a determination is made as to whether the pose violates one or more constraints based on human body movement, or one or more spatial constraints, or a combination thereof. According to an aspect, the constraints based on human movement include one or more body pose constraints, such as a constraint on a pose of the listener's head relative to the listener's hand and/or torso, a velocity constraint, an acceleration constraint, or a combination thereof. The spatial constraints can include one or more spatial boundaries associated with the immersive audio environment, such as location limits corresponding to a 6 degrees-of-freedom (6 DOF) rendering operation.
When an outlier value of the listener's pose is detected that violates one or more of the human body movement constraints or spatial constraints, rather than use the outlier value, the disclosed techniques include determining a value for the listener's pose that does not violate any of the constraints. For example, the listener's pose may be set to the most recent (non-outlier) prior pose of the listener. As another example, the listener's pose may be determined by adjusting the outlier pose to not violate any of the constraints. Detection and mitigation of such pose outliers reduces the inefficiencies experienced by conventional systems as a result of processing inaccurate listener poses, including wasted resources due to requesting transmitting, and initiating rendering of an unneeded audio stream arising from an inaccurate estimation of the listener's location. In addition, detection and mitigation of such pose outliers improves the listener's experience by preventing audio rendering based on an estimate of the listener's movement that is likely erroneous and/or beyond a spatial boundary associated with the immersive audio environment.
According to some aspects, in addition to detecting and mitigating outlier of the listener's current pose, the disclosed techniques also include detecting and mitigating outlier values of a predicted listener pose. For example, a predicted listener pose can be determined based on the listener's current pose and used to pre-fetch assets, such as audio data associated with one or more audio sources, based on a predicted future location of the listener in the immersive audio environment. Detection and mitigation of outliers in predicted listener poses improves the efficiency of an audio rendering system by reducing mis-predictions associated with pre-fetching assets for rendering the immersive audio environment, such as by reducing the consumption of processing resources associated with fetching and processing assets based on incorrect predictions, reducing transmission bandwidth usage associated with pre-fetching of assets based on incorrect predictions, etc.
Particular aspects of the present disclosure are described below with reference to the drawings. In the description, common features are designated by common reference numbers. As used herein, various terminology is used for the purpose of describing particular implementations only and is not intended to be limiting of implementations. For example, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Further, some features described herein are singular in some implementations and plural in other implementations. To illustrate,
In some drawings, multiple instances of a particular type of feature are used. Although these features are physically and/or logically distinct, the same reference number is used for each, and the different instances are distinguished by addition of a letter to the reference number. When the features as a group or a type are referred to herein e.g., when no particular one of the features is being referenced, the reference number is used without a distinguishing letter. However, when one particular feature of multiple features of the same type is referred to herein, the reference number is used with the distinguishing letter. For example, referring to
As used herein, the terms “comprise,” “comprises,” and “comprising” may be used interchangeably with “include,” “includes,” or “including.” Additionally, the term “wherein” may be used interchangeably with “where.” As used herein, “exemplary” indicates an example, an implementation, and/or an aspect, and should not be construed as limiting or as indicating a preference or a preferred implementation. As used herein, an ordinal term (e.g., “first,” “second,” “third,” etc.) used to modify an element, such as a structure, a component, an operation, etc., does not by itself indicate any priority or order of the element with respect to another element, but rather merely distinguishes the element from another element having a same name (but for use of the ordinal term). As used herein, the term “set” refers to one or more of a particular element, and the term “plurality” refers to multiple (e.g., two or more) of a particular element.
As used herein, “coupled” may include “communicatively coupled,” “electrically coupled,” or “physically coupled,” and may also (or alternatively) include any combinations thereof. Two devices (or components) may be coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) directly or indirectly via one or more other devices, components, wires, buses, networks (e.g., a wired network, a wireless network, or a combination thereof), etc. Two devices (or components) that are electrically coupled may be included in the same device or in different devices and may be connected via electronics, one or more connectors, or inductive coupling, as illustrative, non-limiting examples. In some implementations, two devices (or components) that are communicatively coupled, such as in electrical communication, may send and receive signals (e.g., digital signals or analog signals) directly or indirectly, via one or more wires, buses, networks, etc. As used herein, “directly coupled” may include two devices that are coupled (e.g., communicatively coupled, electrically coupled, or physically coupled) without intervening components.
In the present disclosure, terms such as “determining,” “calculating,” “estimating,” “shifting,” “adjusting,” etc. may be used to describe how one or more operations are performed. It should be noted that such terms are not to be construed as limiting and other techniques may be utilized to perform similar operations. Additionally, as referred to herein, “obtaining,” “generating,” “calculating,” “estimating,” “using,” “selecting,” “accessing,” and “determining” may be used interchangeably. For example, “obtaining,” “generating,” “calculating,” “estimating,” or “determining” a parameter (or a signal) may refer to actively generating, estimating, calculating, or determining the parameter (or the signal) or may refer to using, obtaining, selecting, reading, receiving, retrieving, or accessing the parameter (or signal) (e.g., from a memory, buffer, container, data structure, lookup table, transmission channel, etc.) that is already generated, such as by another component or device.
The system 100 also includes one or more pose sensors 108. The pose sensor(s) 108 are configured to generate pose data 110 associated with a pose of a user of at least one of the media output device(s) 102. As used herein, a “pose” indicates a location and an orientation of the media output device(s) 102, a location and an orientation of the user of the media output device(s) 102, or both. In some implementations, at least one of the pose sensor(s) 108 is integrated within a wearable device, such that when the wearable device is worn by a user of a media output device 102, the pose data 110 indicates the pose of the user. In some such implementations, the wearable device can include the pose sensor 108 and at least one of the media output device(s) 102. To illustrate, the pose sensor 108 and at least one of the media output device(s) 102 can be combined in a head-mounted wearable device that includes the speaker(s) 104, the display(s) 106, or both. Examples of sensors that can be used as wearable pose sensors include, without limitation, inertial sensors (e.g., accelerometers or gyroscopes), compasses, positioning sensors (e.g., a global positioning system (GPS) receiver), magnetometers, inclinometers, optical sensors, one or more other sensors to detect acceleration, location, velocity, acceleration, angular orientation, angular velocity, angular acceleration, or any combination thereof. To illustrate, the pose sensor(s) 108 can include GPS, electronic maps, and electronic compasses that use inertial and magnetic sensor technology to determine direction, such as a 3-axis magnetometer to measure the Earth's geomagnetic field and a 3-axis accelerometer to provide, based on a direction of gravitational pull, a horizontality reference to the Earth's magnetic field vector.
In some implementations, at least one of the pose sensor(s) 108 is not configured to be worn by the user. For example, at least one of the pose sensor(s) 108 can include one or more optical sensors (e.g., cameras) to track movement of the user or the media output device(s) 102. In some implementations, the pose sensor(s) 108 can include a combination of sensor(s) worn by the user and sensor(s) that are not worn by the user, where the combination of sensors is configured to cooperate to generate the pose data 110.
The pose data 110 indicates the pose of the user or the media output device(s) 102 or indicates movement (e.g., changes in pose) of the user or the media output device(s) 102. In this context, “movement” includes rotation (e.g., a change in orientation without a change in location, such as a change in roll, tilt, or yaw), translation (e.g., non-rotational movement), or a combination thereof.
In
The immersive audio renderer 122 includes a binauralizer 128 that is configured to binauralize an output of the rendering operation (e.g., the rendered asset 126) to generate the output audio signal 180. According to an aspect, the output audio signal 180 includes an output binaural signal that is provided to the speaker(s) 104 for playout. The rendering operation and binauralization can include sound field rotation (e.g., three degrees of freedom (3 DOF)), rotation and limited translation (e.g., 3 DOF+), or rotation and translation (e.g., 6 DOF) based on the listener pose.
In
In the same or different implementations, the audio asset selector 124 is configured to select one or more assets for rendering based on a predicted listener pose. As explained further below, a pose predictor can determine the predicted listener pose based on, among other things, the pose data 110. One benefit of selecting an asset based on a predicted listener pose is that the immersive audio renderer 122 can retrieve and/or process (e.g., render) the asset before the asset is needed, thereby avoiding delays due to asset retrieval and processing.
After selecting a target asset, the audio asset selector 124 generates an asset retrieval request 138. The asset retrieval request 138 identifies at least one target asset that is to be retrieved for processing by the immersive audio renderer 122. In implementations in which assets are stored in two or more locations, such as at a remote memory 112 and a local memory 170, the system 100 includes an asset location selector 130 configured to receive the target asset retrieval request 138 and determine which of the available memories to retrieve the asset from. In some circumstances, a particular asset may only be available from one of the memories. For example, assets 172 stored at the local memory 170 may include a subset of the assets 114 stored at the remote memory 112. To illustrate, as described further below, some of the assets 114 can be retrieved (e.g., pre-fetched) from the remote memory 112 and stored among the assets 172 at the local memory 170 before such assets are to be processed by the immersive audio renderer 122.
In some implementations, the asset location selector 130 is configured to retrieve a target asset from the local memory 170 if the target asset is among the assets 172 stored at the local memory 170. In such implementations, based on a determination that the target asset is not stored at the local memory 170, the asset location selector 130 selects to obtain the target asset from the remote memory 112. For example, the asset location selector 130 may send the asset retrieval request 138 to the client 120, and the client 120 may initiate retrieval of the target asset from the remote memory 112 via an asset request 136. Otherwise, based on a determination that the target asset is stored at the local memory 170, the asset location selector 130 selects to obtain the target asset from the local memory 170. For example, the asset location selector 130 may send the asset retrieval request 138 to the local memory 170 to initiate retrieval of the target asset to the immersive audio renderer 122 as a local asset 142.
In the example illustrated in
The pre-rendered assets 114D of
In some implementations, the assets 172 can include the same types of assets as the assets 114. For example, the assets 172 can include scene-based assets, object-based assets, channel-based assets, pre-rendered assets, or a combination thereof. For example, as noted above, in some implementations, one or more of the assets 114 can be retrieved from the remote memory 112 and stored among the assets 172 at the local memory 170 before such assets are to be processed by the immersive audio renderer 122. When the remote memory 112 provides an asset to the client 120, the asset can be encoded and/or compressed for transmission (e.g., over one or more networks). In some implementations, the client 120 includes or is coupled to a decoder 121 that is configured to decode and/or decompress the asset for storage at the local memory 170, for communication to the immersive audio renderer 122 as a remote asset 144, or both. In some such implementations, one or more of the assets 172 are stored at the local memory 170 in an encoded and/or compressed format, and decoder 121 is operable to decode and/or decompress the a selected one of the asset(s) 172 before the selected asset is communicated to the immersive audio renderer 122 as a local asset 142. To illustrate, when the target asset identified in the asset retrieval request 138 is among the assets 172 stored at the local memory 170, the asset location selector 130 can determine whether the asset is stored in an encoded and/or compressed format. The asset location selector 130 can selectively cause the decoder 121 to decode and/or decompress the asset based on the determination.
In
For example, as described in further detail with reference to
According to some aspects, the pose constraint(s) 158 include a human body movement constraint. For example, the human body movement constraint can correspond to a velocity constraint or an acceleration constraint, and the pose outlier detector and mitigator 150 can determine a velocity and/or acceleration of the listener based on the listener's current pose indicated by the pose data 110 and based on one or more prior poses 152 of the listener. The pose outlier detector and mitigator 150 can determine whether the human body movement constraint is violated based on comparing the determined velocity to the velocity constraint and/or by comparing the determined acceleration to the acceleration constraint. In another example, the human body movement constraint corresponds to a constraint on a hand or torso pose of the listener relative to a head pose of the listener, and the outlier detector and mitigator 150 can determine whether the human body movement constraint is violated based on determining a relationship (e.g., a rotational offset, a location difference, etc.) between the listener's head and the listener's hand and/or the listener's torso, and comparing the determined head/body relationship to the constraint.
According to some aspects, the pose constraint(s) 158 include a boundary constraint that indicates a boundary associated with the immersive audio environment. The pose outlier detector and mitigator 150 can compare a location of the listener indicated by the pose data 110 to the boundary to determine if the boundary constraint is violated.
If the pose outlier detector and mitigator 150 determines that one or more of the pose constraint(s) 158 are violated, the pose outlier detector and mitigator 150 can generate or update a value of the current listener pose such that the current listener pose satisfies the pose constraint(s) 158. For example, a most recent prior pose 152 that did not violate any of the pose constraint(s) 158 can be used as the current pose 154, as described further with reference to
Similarly, the pose outlier detector and mitigator 150 can obtain a predicted pose 156 that corresponds to a predicted listener pose from a pose predictor, such as described further with reference to
A technical advantage of detecting and mitigating listener pose outliers is that audio rendering based on a listener's movement that is likely erroneous and/or beyond a spatial boundary associated with the immersive audio environment can be reduced or eliminated, which can conserve processing resources of the system 100 as well as improve the listener's experience. Similarly, in addition to detecting and mitigating outlier values of the listener's current pose, detection and mitigation of outliers in predicted listener poses improves the efficiency of the system 100 by reducing mis-predictions associated with pre-fetching assets for rendering the immersive audio environment, such as by reducing the consumption of processing resources associated with fetching and processing assets based on incorrect predictions, reducing usage of bandwidth associated with pre-fetching of assets based on incorrect predictions, etc.
The technical advantages described above can be attained even when the pose sensor(s) 108 perform filtering of pose sensor data to remove outliers during generation of the pose data 110. For example, one or more of the pose sensor(s) 108 may implement filtering (e.g., Kalman filtering) to remove outliers in the pose sensor data and/or in the pose data 110 itself. However, such filtering is conventionally performed without access to the specific pose constraint(s) 158 associated with rendering of the audio scene at the system 100, such as human body movement constraints and audio scene boundaries. Thus, the pose data 110 can still include listener poses that are determined to be outliers by the pose outlier detector and mitigator 150.
Although
In some implementations, many of the components of the system 100 are integrated within the media output device(s) 102. For example, the media output device(s) 102 can include a head-mounted wearable device, such as a headset, a helmet, earbuds, etc., that include the client 120, the local memory 170, the asset location selector 130, the immersive audio renderer 122, the movement estimator 460, the pose sensor(s) 108, or any combination thereof. As another example, the media output device(s) 102 can include a head-mounted wearable device and a separate player device, such as a game console, a computer, or a smart phone. In this example, at least one pair of the speaker(s) 104 and at least one of the pose sensor(s) 108 can be integrated within the head-mounted wearable device and other components of the system 100 can be integrated into the player device, or divided between the player device and the head-mounted wearable device.
For example, the pose outlier detector and mitigator 150 processes pose information 204 that includes the prior pose(s) 152, the current pose 154, the predicted pose(s) 156, and one or more hand/torso poses 214. To illustrate, one or more of the pose sensor(s) 108 may be configured to track movement of the listener's hand, such as pose sensor(s) 108 included in (or coupled to) a handheld controller device, a virtual reality and/or haptic glove, a smart watch or other hand-based wearable device, etc. Additionally or alternatively, one or more of the pose sensor(s) 108 may be configured to track movement of the listener's torso, such as pose sensor(s) 108 included in (or coupled to) a portable electronic device such as a smart phone or tablet device, a virtual reality and/or haptic vest, etc.
The determination of whether one or more pose constraints are violated, at block 210, is based on human body movement constraints 206 and physical constraints on space 208. For example, the human body movement constraints 206 can be included in the pose constraints 158 and can include a velocity constraint 216, an acceleration constraint 218, and a constraint on a hand or torso pose of the listener relative to a head pose of the listener, illustrated as a relative head/body constraint 220.
The physical constraints on space 208 include one or more boundary constraints 222, such as 6 DOF boundaries associated with rendering of the immersive audio environment. For example, the physical constraints on space 208 can include scene boundary distances along 6 directions, such as boundary distances in a +x direction, a −x direction, a +y direction, a −y direction, a +z direction, and a −z direction, where (x, y, z) correspond to a coordinate system used by the immersive audio renderer 122 to represent locations of the listener and audio sources associated with the sound scene.
Determining whether the one or more constraints are violated, at block 210, can include determining a velocity associated with the current pose 154 and comparing the velocity to the velocity constraint 216. For example, the velocity can correspond to a rotational velocity associated with a difference in the listener's head orientation between a selected prior pose 152 (e.g., a most recent prior pose 152) and the current pose 154 over a time period between a first timestamp associated with the selected prior pose 152 and a second timestamp associated with the current pose 154. The velocity constraint 216 can include a rotational velocity threshold, and the determined rotational velocity can be compared to the rotational velocity threshold to determine if the velocity constraint 216 is violated. As another example, the velocity can correspond to a translational velocity associated with a difference in the listener's location between a selected prior pose 152 (e.g., a most recent prior pose 152) and the current pose 154 over a time period between a first timestamp associated with the selected prior pose 152 and a second timestamp associated with the current pose 154. The velocity constraint 216 can include a translational velocity threshold, and the determined translational velocity can be compared to the translational velocity threshold to determine if the velocity constraint 216 is violated. Similar comparisons can be made to determine if the predicted pose 156 violates a rotational and/or translational velocity threshold of the velocity constraint 216 by determining the rotational velocity based on the change of head orientation between the current pose 154 and the predicted pose 156 over the time period between the second timestamp associated with the current pose 154 and a predicted third timestamp associated with the predicted pose 156 and/or by determining the translational velocity based on the change of the listener's location between the current pose 154 and the predicted pose 156 over the time period between the second timestamp associated with the current pose 154 and the predicted third timestamp associated with the predicted pose 156.
Determining whether the one or more constraints are violated, at block 210, can include determining an acceleration associated with the current pose 154 and comparing the acceleration to the acceleration constraint 218. For example, the acceleration can correspond to a rotational acceleration associated with a difference in the listener's head rotational velocity between a selected prior pose 152 (e.g., a most recent prior pose 152) and the current pose 154 over a time period between a first timestamp associated with the selected prior pose 152 and a second timestamp associated with the current pose 154. The acceleration constraint 218 can include a rotational acceleration threshold, and the determined rotational acceleration can be compared to the rotational acceleration threshold to determine if the acceleration constraint 218 is violated. As another example, the acceleration can correspond to a translational acceleration associated with a difference in the listener's translational velocity between a selected prior pose 152 (e.g., a most recent prior pose 152) and the current pose 154 over a time period between a first timestamp associated with the selected prior pose 152 and a second timestamp associated with the current pose 154. The acceleration constraint 218 can include a translational acceleration threshold, and the determined translational acceleration can be compared to the translational acceleration threshold to determine if the acceleration constraint 218 is violated. Similar comparisons can be made to determine if the predicted pose 156 violates a rotational and/or translational acceleration threshold of the acceleration constraint 218 by determining the rotational acceleration based on the change of head rotational velocity between the current pose 154 and the predicted pose 156 over the time period between the second timestamp associated with the current pose 154 and a predicted third timestamp associated with the predicted pose 156 and/or by determining the translational acceleration based on the change of the listener's translational velocity between the current pose 154 and the predicted pose 156 over the time period between the second timestamp associated with the current pose 154 and the predicted third timestamp associated with the predicted pose 156.
Determining whether the one or more constraints are violated, at block 210, can include determining the listener's current hand and/or torso location relative to the listener's head pose indicated by the current pose 154, and comparing the current hand and/or torso location relative to the listener's head pose to the relative head/body constraint 220. For example, the hand/torso pose 214 may include a hand pose that indicates a location of the listener's hand, and the location of the listener's hand relative to the location of the listener's head (indicated by the current pose 154) may be determined and compared to a hand-to-head relative location constraint of the relative head/body constraint 220. As another example, the hand/torso pose 214 may include a body pose that indicates a location of the listener's torso, and the location of the listener's torso relative to the location of the listener's may be determined and compared to a torso-to-head relative location constraint of the relative head/body constraint 220. Similar operations may be performed to compare a hand-to-head relative rotation and/or a torso-to-head relative rotation to a hand-to-head relative rotation constraint and/or a torso-to-head relative rotation constraint, respectively, of the relative head/body constraint 220. In some implementations, analogous comparisons of predicted hand-to-head relative location and/or rotation, predicted torso-to-head relative location and/or rotation, or a combination thereof, may be made to corresponding constraints of the relative head/body constraint 220 to determine if the predicted pose 156 and a predicted head/torso pose violate the relative head/body constraint 220.
Determining whether the one or more constraints are violated, at block 210, can include determining the listener's location in the audio scene, as indicated by the current pose 154, and comparing the listener's location to physical constraints on space 208. For example, the listener's location can be compared to the boundary constraints 222 to determine whether the listener is within or outside of a spatial boundary defined by the boundary constraints 222. Determining whether the one or more constraints are violated, at block 210, can include determining a predicted listener location in the audio scene, as indicated by the predicted pose 156, and comparing the listener's predicted location to the physical constraints on space 208. For example, the listener's predicted location can be compared to the boundary constraints 222 to determine whether the listener is predicted to be within or outside of a spatial boundary defined by the boundary constraints 222.
In response to determining, based on the pose information 204, that one or more of the human body movement constraints 206 and/or one or more of the physical constraints on space 208 are violated, the pose outlier detector and mitigator 150 sets an outlier detection indicator 224, at block 212. The outlier detection indicator 224 may be used to trigger performance of pose outlier mitigation operations, as described further with reference to
In some implementations, the outlier detection indicator 224 includes an indication of which constraint(s) were violated, an indication of whether the violations were detected for the current pose 154, for the predicted pose 156, or any combination thereof. Information associated with the determining that one or more constraints have been violated, such as computed velocities, accelerations, locations, relative movements, etc., can also be saved for re-use during outlier mitigation.
Based on a determination that the current pose 154 does not violate the one or more pose constraints (e.g., the outlier detection indicator 224 has not been set), the pose outlier detector and mitigator 150 is configured to use the current pose 154 as the current listener pose for purposes of asset retrieval (e.g., stream selection) at the audio asset selector 124 and/or rendering at the immersive audio renderer 122.
Otherwise, based on a determination that the current pose 154 violates at least one of the one or more pose constraints, the pose outlier detector and mitigator 150 is configured to determine the current listener pose based on a prior listener pose that did not violate the pose constraints. To illustrate, if the outlier detection indicator 224 has been set to indicate that the current pose 154 is an outlier, the current pose 154 is set to a previous pose, at operation 342. For example, the current pose 154 can be set to have the value of a most recent prior pose 152 that was not determined to be an outlier. Setting the current pose 154 to the value of the most recent non-outlier prior pose 152 can include changing one or more values of the current pose 154 to equal corresponding values of the prior pose 152, replacing the current pose 154 with the prior pose 152, or adjusting the current pose 154 to match the prior pose 152, as illustrative, non-limiting examples.
In some implementations, the pose outlier detector and mitigator 150 is further configured to, based on the determination that the current pose 154 violates at least one of the one or more pose constraints, determine the predicted pose 156 based on a prior predicted listener pose associated with the prior listener pose. For example, the prior pose 152 that is selected as the most recent non-outlier prior pose 152 for adjusting the current pose 154 may be associated with a prior predicted listener pose, and the predicted pose 156 is set to the value of that prior predicted listener pose, at operation 344.
One effect of the operations 342, 344 at block 340 is that the pose(s) used for asset selection and audio rendering may effectively repeat the most recent prior non-outlier pose and predicted pose, as if the listener had not moved since the prior non-outlier pose was indicated in the pose data 110. In certain circumstances, when one or more of the human body movement constraints 206 were detected to be violated, the violation may be indicative of an unusually large or physiologically improbable (or impossible) movement of the listener. For example, the human body movement constraints 206 may be associated with respective thresholds 306, including a velocity threshold 316 corresponding to the velocity constraint 216, an acceleration threshold 318 corresponding to the acceleration constraint 218, and a body pose threshold 320 corresponding to the body pose threshold 320. Each of the thresholds 306 may correspond to a maximum limit above which the listener's motion would prevent the listener from being able to track changes in the audio scene, so processing resources that would otherwise be expended by selecting, acquiring, and rendering assets while the listener's motion is greater than one or more of the thresholds 306 can be conserved without adversely impacting the listener's experience. In other circumstances, when one or more of the physical constraints on space 208 were detected to be violated, the violation may be indicative of the listener moving outside of the audio scene boundaries or into a region of the audio scene that has no audio sources, and repeating the most recent prior non-outlier pose and predicted pose prevents generation of an ineffectual target asset retrieval request 138.
Based on a determination that the current pose 154 does not violate the one or more pose constraints (e.g., the outlier detection indicator 224 has not been set), the pose outlier detector and mitigator 150 is configured to use the current pose 154 as the current listener pose for purposes of asset retrieval (e.g., stream selection) at the audio asset selector 124 and/or rendering at the immersive audio renderer 122.
Otherwise, based on a determination that the current pose 154 violates at least one of the one or more pose constraints, the pose outlier detector and mitigator 150 is configured to determine the current listener pose based on an adjustment of the current pose 154 to satisfy the one or more pose constraints. To illustrate, if the outlier detection indicator 224 has been set to indicate that the current pose 154 is an outlier, a value of the current pose 154 is adjusted to match a threshold 306 or a spatial boundary associated with a violated pose constraint at operation 372. For example, when a velocity of the listener that is determined based on the current pose 154 exceeds the velocity threshold 316, the velocity associated with the current pose 154 may be adjusted to match the velocity threshold 316, such as via a clipping operation that clips one or more values associated with the current pose such that none of the thresholds 306 are exceeded, at operation 372. As another example, when a location of the listener based on the current pose 154 is outside of a boundary indicated by the velocity constraint 216, the location associated with the current pose 154 may be clipped so that the boundary is not crossed (e.g., movement of the listener is allowed up to, but not beyond, the boundary).
In some implementations, the pose outlier detector and mitigator 150 is further configured to, based on the determination that the current pose 154 violates at least one of the one or more pose constraints, determine the predicted pose 156 based on adjusting a prior predicted listener pose associated with a prior listener pose. For example, a most recent non-outlier prior pose 152 may be selected, and the prior predicted pose 156 associated with the selected prior pose 152 may instead be used as the predicted pose 156. If use of the selected prior predicted pose 156 causes one or more of the thresholds 306 or the spatial boundaries to be exceeded, the prior predicted pose 156 can be adjusted (e.g., clipped) to match a threshold or a boundary associated with a violated pose constraint, at operation 374, in a similar manner as described for operation 372.
One effect of the operations 372, 374 at block 370 is that the pose(s) used for asset selection and audio rendering may track the listener's pose that is indicated by the pose data 110 as closely as possible but without allowing the listener's pose to violate any of the thresholds 306 or boundary constraints 222.
Although a particular implementation of each of the operations 200, 300, and 350 of
In
In some implementations, the movement estimator 460 is configured to determine the contextual movement estimate data 462 based on the prior pose(s) 152, the current pose 154, the predicted pose(s) 156, or a combination thereof. For example, the movement estimator 460 can determine the contextual movement estimate data 462 based on a historical movement rate, where the historical movement rate is determined based on differences between the prior pose(s) 152, between the prior pose(s) 152 and the current pose 154, between the prior pose(s) 152 or current pose 154 and the predicted pose(s) 156, or combinations thereof. In this context, the prior pose(s) 152 can include historical pose data 110; whereas the current pose 154 refers to a pose indicated by a most recent set of samples of the pose data 110.
The movement estimator 460 can base the contextual movement estimate data 462 on various types of information. For example, the movement estimator 460 can generate the movement estimate data 462 based on the pose data 110. To illustrate, the pose data 110 can indicate a current listener pose, and the movement estimator 460 can generate the movement estimate data 462 based on the current listener pose or a recent set of changes in the current listener pose over time. As one example, the movement estimator 460 can generate the contextual movement estimate data 462 based on a recent rate and/or a recent type of change of the listener pose, based on a set of recent listener pose data, where “recent” is determined based on some specified time limit (e.g., the last one minute, the last five minutes, etc.) or based on a specified number of samples of the pose data 110 (e.g., the most recent ten samples, the most recent one hundred samples, etc.).
As another example, the movement estimator 460 can generate the movement estimate data 462 based on a predicted pose. For example, a pose predictor can generate a predicted listener pose based at least partially on the pose data 110. The predicted listener pose can indicate a location and/or orientation of the listener in the immersive audio environment at some future time. In this example, the movement estimator 460 can generate the movement estimate data 462 based on movement that will occur (e.g., that is predicted to occur) to change from the current listener pose to the predicted listener pose.
As another example, the movement estimator 460 can generate the movement estimate data 462 based on historical interaction data 458 associated with an asset, associated with an immersive audio environment, associated with a scene of the immersive audio environment, or a combination thereof. The historical interaction data 458 can be indicative of interaction of a current user of the media output device(s) 102, interaction of other users who have consumed specific assets or interacted with the immersive audio environment, or a combination thereof. For example, the historical interaction data 458 can include movement trace data descriptive of movements of a set of users (which may include the current user) who have interacted with the immersive audio environment. In this example, the movement estimator 460 can use the historical interaction data 458 to estimate how much the current user is likely to move in the near future (e.g., during consumption of a portion of an asset or scene that the user is currently consuming). To illustrate, when the immersive audio environment is related to game content, a scene of the game content can depict (in sound, video, or both) a startling event (e.g., an explosion, a crash, a jump scare, etc.) that historically has caused users to quickly look in a particular direction or to pan around the environment, as indicated by the historical interaction data 458. In this illustrative example, the contextual movement estimate data 462 can indicate, based on the historical interaction data 458, that a rate of movement and/or a type of movement of the listener pose is likely to increase when the startling event occurs.
As another example, the movement estimator 460 can generate the movement estimate data 162 based on one or more context cues 454 (also referred to herein as “movement cues”) associated with the immersive audio environment. One or more of the context cue(s) 454 can be explicitly provided in metadata of the asset(s) representing the immersive audio environment. For example, metadata associated with an asset can include a field that indicates the contextual movement estimate data 462. To illustrate, a game creator or distributor can indicate in metadata associated with a particular asset that the asset or a portion of the asset is expected to result in a change in the rate of listener movement. As one example, if a scene of a game includes an event that is likely to cause the user to move more (or less), metadata of the game can indicate when the event occurs during playout of an asset, where the event occurs (e.g., a sound source location in the immersive audio environment), a type of event, an expected result of the event (e.g., increased or decreased translation in a particular direction, increased or decreased head rotation, etc.), a duration of the event, etc.
In some implementations, one or more of the context cue(s) 454 are implicit rather than explicit. For example, metadata associated with an asset can indicate a genre of the asset, and the movement estimator 460 can generate the contextual movement estimate data 462 based on the genre of the asset. To illustrate, the movement estimator 460 may expect less rapid head movement during play out of an immersive audio environment representing a classical music genre than is expected during play out of an immersive audio environment representing a first-person shooter game.
The movement estimator 460 is configured to set one or more pose update parameter(s) 456 based on the contextual movement estimate data 462. In a particular aspect, the pose update parameter(s) 456 indicate a pose data update rate for the pose data 110. For example, the movement estimator 460 can set the pose update parameter(s) 456 by sending the pose update parameter(s) 456 to the pose sensor(s) 108 to cause the pose sensor(s) 108 to provide the pose data 110 at a rate associated with the pose update parameter(s) 456. In some implementations, the system 100 includes two or more pose sensor(s) 108. In such implementations, the movement estimator 460 can send the same pose update parameter(s) 456 to each of the two or more pose sensor(s) 108, or the movement estimator 460 can send the different pose update parameter(s) 456 to different pose sensor(s) 108. To illustrate, the system 100 can include a first pose sensor 108 configured to generate pose data 110 indicating a translational position of a listener in the immersive audio environment and a second pose sensor 108 configured to generate pose data 110 indicating a rotational orientation of the listener in the immersive audio environment. In this example, the movement estimator 460 can send the different pose update parameter(s) 456 to the first and second pose sensors 108. For example, the contextual movement estimate data 462 can indicate that a rate of head rotation is expected to increase whereas a rate of translation is expected to remain unchanged. In this example, the movement estimator 460 can send first pose update parameter(s) 456 to cause the second pose sensor to increase the rate of generation of the pose data 110 indicating the rotational orientation of the listener and can refrain from sending pose update parameter(s) 456 to the first pose sensor (or can send second pose update parameter(s) 456) to cause the first pose sensor to continue generation of the pose data 110 indicating the translational orientation at the same rate as before.
One technical advantage of using the contextual movement estimate data 462 to set the pose update parameter(s) 456 is that pose data 110 update rates can be set based on user movement rates, which can enable conservation of resources and improved user experience. For example, when relatively high movement rates are expected (as indicated by the contextual movement estimate data 462), the pose update parameter(s) 456 can be set to increase the rate at which the pose data 110 is updated. The increased update rate for the pose data 110 reduces motion/sound latency of the output audio signal 180. To illustrate, in this example, user movement (e.g., head rotation) is reflected in the output audio signal 180 more quickly because pose data 110 reflecting the user movement is available to the immersive audio renderer 122 more quickly. Conversely, when relatively low movement rates are expected (as indicated by the contextual movement estimate data 462) the pose update parameter(s) 456 can be set to decrease the rate at which the pose data 110 is updated. The decreased update rate for the pose data 110 conserves resources (e.g., computing cycles, power, memory) associated with rendering and binauralization by the immersive audio renderer 122, resources (e.g., bandwidth, power) associated with transmission of the pose data 110, or a combination thereof.
In a particular aspect, the pose predictor 450 is configured to determine the predicted pose(s) 156 using predictive techniques such as extrapolation based on the prior pose(s) 152 and/or the current pose 154; inference using one or more artificial intelligence models; probability-based estimates based on the prior pose(s) 152 and/or the current pose 154; probability-based estimates based on the historical interaction data 458 of
In some implementations, the immersive audio renderer 122 can render two or more assets based on the predicted pose(s) 156. For example, in some circumstances, there can be significant uncertainty as to which of a set of possible poses the user will move to in the future. To illustrate, in a game environment, the user can be faced with several choices, and the specific choice the user makes can change the asset to be rendered, a future listener pose, or both. In this example, the predicted pose(s) 156 can include multiple poses for a particular future time, and the immersive audio renderer 122 can render one asset based on two or more predicted pose(s) 156, can renderer two or more different assets based on the two or more predicted poses 156, or both. In this example, when the current pose 154 aligns with one of the predicted pose(s) 156, the corresponding rendered asset is used to generate the output audio signal 180.
In some implementations, the immersive audio renderer 122 can render assets in stages, as described with reference to
In a particular aspect, a mode or rate of pose prediction by the pose predictor 450 can be related to the pose update parameter(s) 456. For example, the pose predictor 450 can be turned off when the pose update parameter(s) 456 have a particular value. To illustrate, when the contextual movement estimate data 462 indicates that little or no user movement is expected for a particular period of time, the pose update parameter(s) 456 can be set such that the pose sensor(s) 108 are turned off or provide pose data 110 at a low rate and the pose predictor 450 is turned off. Conversely, when the contextual movement estimate data 462 indicates that rapid user movement is expected for a particular period of time, the pose update parameter(s) 456 can be set such that the pose sensor(s) 108 provide pose data 110 at a high rate and the pose predictor 450 generates predicted poses 156. Additionally, or alternatively, the pose predictor 450 can generate predicted pose(s) 156 for times a first distance in the future in a first mode and a second distance in the future in a second mode, where the mode is selected based on the pose update parameter(s) 456.
A technical advantage of adjusting the mode or rate of pose prediction by the pose predictor 450 based on the pose update parameter(s) 456 is that the pose predictor 450 can generate more predicted pose(s) 156 for periods when more movement is expected and fewer predicted pose(s) 156 for periods when less movement is expected. Generating more predicted pose(s) 156 for periods when more movement is expected enables the immersive audio renderer 122 to have a higher likelihood of rendering in advance an asset that will be used to generate the output audio signal 180. For example, the immersive audio renderer 122 can render assets associated with the predicted pose(s) 156 and use a particular one of the rendered assets to generate the output audio signal 180 when the current pose 154 corresponds to the predicted pose used to render the particular asset. In this example, having more predicted poses 156 and corresponding rendered assets means that there is a higher likelihood that the current pose 154 at some point in the future will correspond to one of the predicted poses 156, enabling use of the corresponding rendered asset to generate the output audio signal 180 rather than performing real-time rendering operations. On the other hand, pose prediction and rendering assets based on predicted poses 156 is resource intensive, and can be wasteful if the assets rendered based on the predicted poses 156 are not used. Accordingly, generating fewer predicted pose(s) 156 for periods when less movement is expected enables the immersive audio renderer 122 to conserve resources.
In the system 500, the pose sensor(s) 108, the pose outlier detector and mitigator 150, the pose predictor 450, and the movement estimator 460 are onboard (e.g., integrated within) the media output device(s) 102. To enable the immersive audio renderer 122 to render certain assets before they are needed (e.g., based on predicted pose(s) 156), the pose data 110 of
As described with reference to
In a particular aspect, the pose predictor 450 is configured to determine the predicted pose(s) 156 using the predictive technique(s) described with reference to
In
In some implementations, the movement trace data 602 stored at the remote memory 112 is a copy of (e.g., the same as) the movement trace data 606 stored at the local memory 170. In some implementations, the movement trace data 602 stored at the remote memory 112 includes the same types of information (e.g., data fields) as the movement trace data 606 stored at the local memory 170, but include information describing how users of other immersive audio player device have interacted with the immersive audio environment. For example, the movement trace data 602 can aggregate historical user interaction associated with the immersive audio environment across a plurality of users of the immersive audio player 402 and other immersive audio players.
In implementations in which the movement estimator 460 determines the contextual movement estimate data 462 based on the historical interaction data 458, the historical interaction data 458 can indicate, or be used to determine, movement probability information associated with a particular scene or a particular asset of the immersive audio environment. For example, the movement probability information can indicate how likely a particular movement rate is during a particular portion of the immersive audio environment based on how the user or other users have moved during playback of the particular portion. As another example, the movement probability information can indicate how likely movement of a particular type (e.g., translation in a particular direction, rotation in a particular direction, etc.) is during a particular portion of the immersive audio environment based on how the user or other users have moved during playback of the particular portion. As a result, the movement estimator 460 can set the pose update parameter(s) 456 to prepare for expected movement associated with playback of the immersive audio environment. For example, when the historical interaction data 458 indicates that an upcoming portion of the immersive audio environment has historically been associated with rapid rotation of the listener pose, the movement estimator 460 can set the pose update parameter(s) 456 to increase the rate at which rotation related pose data 110 is provided by the pose sensor(s) 108 to decrease the motion-to-sound latency associated with the playout of the upcoming portion. Conversely, when the historical interaction data 458 indicates that an upcoming portion of the immersive audio environment has historically been associated with little or no change of the listener pose, the movement estimator 460 can set the pose update parameter(s) 456 to decrease the rate at which the pose data 110 is provided by the pose sensor(s) 108 to conserve power and computing resources.
In implementations in which the pose predictor 450 determines the predicted pose(s) 156 based on the historical interaction data 458, the historical interaction data 458 can indicate, or be used to determine, pose probability information associated with a particular scene or a particular asset of the immersive audio environment. For example, the pose probability information can indicate the likelihood of particular listener locations, particular listener orientations, or particular listener poses during playback of a particular portion of the immersive audio environment based on historic listener poses during playback of the particular portion.
A technical benefit of determining the historical interaction data 458 based on the movement trace data 602, 606 is that the movement trace data 602, 606 provides an accurate estimate of how real users interact with the immersive audio environment, thereby enabling more accurate pose prediction, more accurate contextual movement estimation, or both. Further, the movement trace data 602, 606 can be captured readily. To illustrate, during use of the immersive audio player 402 to playout content associated with a particular immersive audio environment, the immersive audio player 402 can store the movement trace data 606 at the local memory 170. The immersive audio player 402 can send the movement trace data 606 to the remote memory 112 to update the movement trace data 602 at any convenient time, such as after playout of the content associated with the particular immersive audio environment is complete or when the immersive audio player 402 is connected to the remote memory 112 and the connection to the remote memory 112 has available bandwidth. The movement trace data 602 can include an aggregation of historical interaction data from a user of the immersive audio player 402 and other users.
In a particular aspect, the pose sensors 108A and 108B are used together to determine a listener pose. For example, in some implementations, the pose sensor 108A provides pose data 110A representing rotation (e.g., a user's head rotation), and the pose sensor 108B provides pose data 110B indicating translation (e.g., a user's body movement). As another example, the pose data 110A can include first translation data, and the pose data 110B can include second translation data. In this example, the first and second translation data can be combined (e.g., subtracted) to determine a change in the listener pose in the immersive audio environment. Additionally, or alternatively, the pose data 110A can include first rotation data, and the pose data 110B can include second rotation data. In this example, the first and second rotation data can be combined (e.g., subtracted) to determine a change in the listener pose in the immersive audio environment.
In the example illustrated in
In a particular aspect, the mixing and binauralization operations 822 can be performed by a mixer and binauralizer 814 which includes, corresponds to, or is included within the binauralizer 128 of any of
When an asset is received for rendering, the pre-processing module 802 is configured to receive head-related impulse response information (HRIRs) and audio source position information pi (where boldface lettering indicates a vector, and where i is an audio source index), such as (x, y, z) coordinates of the location of each audio source in an audio scene. The pre-processing module 802 is configured to generate HRTFs and a representation of the audio source locations as a set of triangles T1 . . . NT (where NT denotes the number of triangles) having an audio source at each triangle vertex.
The position pre-processing module 804 is configured to receive the representation of the audio source locations T1 . . . NT, the audio source position information pi, listener position information pL(j) (e.g., x, y, z coordinates) that indicates a listener location for a frame j of the audio data to be rendered. The position pre-processing module 804 is configured to generate an indication of the location of the listener relative to the audio sources, such as an active triangle TA(j), of the set of triangles, that includes the listener location; an audio source selection indication mC(j) (e.g., an index of a chosen source (e.g., a higher order ambisonics (HOA) source) for signal interpolation); and spatial metadata interpolation weights {tilde over (w)}c(j, k) (e.g., chosen spatial metadata interpolation weights for a subframe k of frame j).
The spatial analysis module 806 receives the audio signals of the audio streams, illustrated as SESD(i,j) (e.g., an equivalent spatial domain representation of the signals for each source i and frame j) and also receives the indication of the location of the active triangle TA(j) that includes the listener. The spatial analysis module 806 can convert the input audio signals to an HOA format and generates orientation information for the HOA sources (e.g., θ(i, j, k, b) representing an azimuth parameter for HOA source i for sub-frame k of frame j and frequency bin b, and φ(i, j, k, b) representing an elevation parameter) and energy information (e.g., r(i, j, k, b) representing a direct-to-total energy ratio parameters and e(i, j, k, b) representing an energy value). The spatial analysis module 806 also generates a frequency domain representation of the input audio, such as S(i, j, k, b) representing a time-frequency domain signal of HOA source i.
The spatial metadata interpolation module 808 performs spatial metadata interpolation based on source orientation information oi, listener orientation information oL(j), the HOA source orientation information and energy information from the spatial analysis module 806, and the spatial metadata interpolation weights from the position pre-processing module 804. The spatial metadata interpolation module 808 generates energy and orientation information including {tilde over (e)}(i, j, b) representing an average (over sub-frames) energy for HOA source i and audio frame j for frequency band b, {tilde over (θ)}(i, j, b) representing an azimuth parameter for HOA source i for frame j and frequency bin b, {tilde over (φ)}(i, j, b) representing an elevation parameter for HOA source i for frame j and frequency bin b, and {tilde over (r)}(i, j, b) representing a direct-to-total energy ratio parameter for HOA source i for frame j and frequency bin b.
The signal interpolation module 810 receives energy information (e.g., {tilde over (e)}(i, j, b)) from the spatial metadata interpolation module 808, energy information (e.g., e(i, j, k, b)) and a frequency domain representation of the input audio (e.g., S(i, j, k, b)) from the spatial analysis module 806, and the audio source selection indication mC(j) from the position pre-processing module 804. The signal interpolation module 810 generates an interpolated audio signal Ŝ(j, k, b). Completion of the rendering operation 820 results in a rendered asset (e.g., the rendered asset 126 of any of
The mixer and binauralizer 814 receives the source orientation information oi, the listener orientation information oL(j), the HRTFs, and the interpolated audio signal Ŝ(j, k, b) and interpolated orientation and energy parameters from the signal interpolation module 810 and the spatial metadata interpolation module 808, respectively. When the asset is a pre-rendered asset 824, the mixer and binauralizer 814 receives the source orientation information oi, the HRTFs, and the interpolated audio signal Ŝ(j, k, b) and interpolated orientation and energy parameters as part of the pre-rendered asset 824. Optionally, if the listener pose associated with a pre-rendered asset 824 is specified in advance, the pre-rendered asset 824 also includes the listener orientation information oL(j). Alternatively, if the listener pose associated with a pre-rendered asset 824 is not specified in advance, the pre-rendered asset 824 receives the listener orientation information oL(j) based on the listener pose.
The mixer and binauralizer 814 is configured to apply one or more rotation operations based on an orientation of each interpolated signal and the listener's orientation; to binauralize the signals using the HRTFs; if multiple interpolated signals are received, to combine the signals (e.g., after binauralization); to perform one or more other operations; or any combination thereof, to generate the output audio signal 180.
The integrated circuit 902 also includes a signal input 904, such as bus interfaces and/or the modem 420, to enable the processor(s) 920 to receive input data 906, such as a target asset (e.g., a local asset 142 or a remote asset 144), the pose data 110, the historical interaction data 458, the context cue(s) 454, contextual movement estimate data 462, pose update parameter(s) 456, the manifest of assets 134, the audio assets 132, or combinations thereof. The integrated circuit 902 also includes a signal output 912, such as one or more bus interfaces and/or the modem 420, to enable the processor(s) 920 to provide output data 914 to one or more other devices. For example, the output data 914 can include the output audio signal 180, the pose update parameter(s) 456, the asset retrieval request 138, the asset request 136, or combinations thereof.
The integrated circuit 902 enables implementation of immersive audio processing as a component in one of a variety of devices, such as a speaker array as depicted in
The soundbar device 1002 includes or is coupled to the pose sensors 108 (e.g., cameras, structured light sensors, ultrasound, lidar, etc.) to enable detection of a pose of the listener 1020 and generation of head-tracker data of the listener 1020. For example, the soundbar device 1002 may detect a pose of the listener 1020 at a first location 1022 (e.g., at a first angle from a reference 1024), adjust the sound field based on the pose of the listener 1020, and perform a beam steering operation to cause emitted sound 1004 to be perceived by the listener 1020 as a pose-adjusted binaural signal. In an example, the beam steering operation is based on the first location 1022 and a first orientation of the listener 1020 (e.g., facing the soundbar device 1002). In response to a change in the pose of the listener 1020, such as movement of the listener 1020 to a second location 1032, the soundbar device 1002 adjusts the sound field (e.g., according to a 3 DOF/3 DOF+ or a 6 DOF operation) and performs a beam steering operation to cause the resulting emitted sound 1004 to be perceived by the listener 1020 as a pose-adjusted binaural signal at the second location 1032.
In some implementations, the headset device 1202 is configured to perform operations described with reference to the media output device(s) 102 of any of
In the example illustrated in
The immersive audio components 922 and optionally one or more pose sensors 108 are integrated in at least one of the earbuds 1306 (e.g., in the first earbud 1302, the second earbud 1304, or both). In a particular example, the immersive audio components 922 include the immersive audio renderer 122, the pose outlier detector and mitigator 150, and optionally other components described with reference to
In some implementations, the earbuds 1306 are configured to perform operations described with reference to the media output device(s) 102 of
In some implementations, the glasses 1402 are configured to perform operations described with reference to the media output device(s) 102 of
In some implementations, the headset 1502 is configured to perform operations described with reference to the media output device(s) 102 of
In some implementations, the vehicle 1602 is configured to perform operations described with reference to the media output device(s) 102 of
In some implementations, the vehicle 1702 is configured to perform operations described with reference to the media output device(s) 102 of
In some implementations, one or both of the vehicles of
Referring to
The method 1800 includes, at block 1802, obtaining, at one or more processors, pose data for a listener in an immersive audio environment. The pose data may be received from one or more pose sensors, such as the pose sensor(s) 108. In some implementations, the pose data includes first pose data associated with a head of the listener and second pose data associated with at least one of a torso of the listener or a hand of the listener.
The method 1800 includes, at block 1804, determining, at the one or more processors, a current listener pose based on the pose data and one or more pose constraints. For example, the pose outlier detector and mitigator 150 determines the current pose 154 based on the pose data 110 and the pose constraints 158.
The method 1800 includes, at block 1806, obtaining, at the one or more processors and based on the current listener pose, a rendered asset associated with the immersive audio environment. For example, the immersive audio renderer 122 can perform rendering operations to generate a rendered asset based on a local asset 142 or a remote asset 144. To illustrate, the rendering operations can include one or more of the rendering operations 820 described with reference to
The method 1800 also includes, at block 1808, generating, at the one or more processors, an output audio signal based on the rendered asset. For example, the immersive audio renderer 122 can generate the output audio signal 180 based on a rendered asset (e.g., the rendered asset(s) 126). To illustrate, generating the output audio signal 180 can include performing binauralization operations (e.g., by the binauralizer 128) one or more of the mixing and binauralization operations 822 described with reference to
According to some aspects, the one or more pose constraints include a human body movement constraint. For example, the human body movement constraint can correspond to a velocity constraint, such as the velocity constraint 216 of
The method 1800 optionally includes obtaining a pose based on the pose data and determining whether the pose violates at least one of the one or more pose constraints, such as described with reference to the pose outlier detection operations 200 of
In some implementations, the method 1800 includes, based on a determination that the pose violates at least one of the one or more pose constraints, determining the current listener pose based on an adjustment of the pose to satisfy the one or more pose constraints, such as described with reference to the operation 372 of
In some implementations, the pose data includes first data indicating a translational position of a listener in the immersive audio environment and second data indicating a rotational orientation of the listener in the immersive audio environment. In such implementations, the first data and the second data can be received from the same device, from different devices, or combinations thereof. For example, the first data can be received from a first device and the second data can be received from a second device distinct from the first device. As another example, first translation data can be obtained from a first device, second translation data can be obtained from a second device distinct from the first device, and the first data indicating the translational position of the listener in the immersive audio environment can be determined based on the first translation data and the second translation data.
In some implementations, the pose data obtained at block 1802 is associated with a first time, and the method 1800 includes determining a predicted listener pose associated with a second time subsequent to the first time. In such implementations, the rendered asset, at block 1806, can include at least one of the rendered assets associated with the predicted listener pose. In some implementations, more than one predicted listener pose for a particular time can be predicted and each predicted listener pose can be used to obtain a rendered asset. For example, the pose data obtained at block 1801 is associated with a first time, and the method 1800 can include determining two or more predicted listener poses associated with a second time subsequent to the first time, obtaining a first rendered asset associated with a first predicted listener pose, and obtaining a second rendered asset associated with a second predicted listener pose. In this example, the method 1800 can also include selectively generating the output audio signal based on either the first rendered asset or the second rendered asset. To illustrate, selectively generating the output audio signal based on either the first rendered asset or the second rendered asset can include obtaining a first target asset associated with the first predicted listener pose, rendering the first target asset to generate the first rendered asset, obtaining a second target asset associated with the second predicted listener pose, rendering the second target asset to generate the second rendered asset, obtaining pose data associated with the second time, and selecting, based on the pose data associated with the second time, the first rendered asset or the second rendered asset for further processing.
The method 1800 of
Referring to
In a particular implementation, the device 1900 includes a processor 1906 (e.g., a central processing unit (CPU)). The device 1900 may include one or more additional processors 1910 (e.g., one or more DSPs). In a particular aspect, the processor(s) 410 of any of
The device 1900 may include a memory 1986 and a CODEC 1934. The memory 1986 may include instructions 1956 that are executable by the one or more additional processors 1910 (or the processor 1906) to implement the functionality described with reference to any of
The device 1900 may include the display(s) 106 coupled to a display controller 1926. The speaker(s) 104 and a microphone 1994 may be coupled to the CODEC 1934. The CODEC 1934 may include a digital-to-analog converter (DAC) 1902, an analog-to-digital converter (ADC) 1904, or both. In a particular implementation, the CODEC 1934 may receive analog signals from the microphone 1994, convert the analog signals to digital signals using the analog-to-digital converter 1904, and provide the digital signals to the speech and music codec 1908. The speech and music codec 1908 may process the digital signals, and the digital signals or other digital signals (e.g., one or more assets associated with an immersive audio environment) may further be processed by the immersive audio components 922. In a particular implementation, the speech and music codec 1908 may provide digital signals to the CODEC 1934. The CODEC 1934 may convert the digital signals to analog signals using the digital-to-analog converter 1902 and may provide the analog signals to the speaker(s) 104.
In a particular implementation, the device 1900 may be included in a system-in-package or system-on-chip device 1922. In a particular implementation, the memory 1986, the processor 1906, the processors 1910, the display controller 1926, the CODEC 1934, and the modem 420 are included in the system-in-package or system-on-chip device 1922. In a particular implementation, the pose sensor(s) 108, an input device 1930, and a power supply 1944 are coupled to the system-in-package or the system-on-chip device 1922. Moreover, in a particular implementation, as illustrated in
The device 1900 may include a smart speaker, a speaker bar, a mobile communication device, a smart phone, a cellular phone, a laptop computer, a computer, a tablet, a personal digital assistant, a display device, a television, a gaming console, a music player, a radio, a digital video player, a digital video disc (DVD) player, a tuner, a camera, a navigation device, a vehicle, a headset, an augmented reality headset, a mixed reality headset, a virtual reality headset, an aerial vehicle, a home automation system, a voice-activated device, a wireless speaker and voice activated device, a portable electronic device, a car, a computing device, a communication device, an internet-of-things (IoT) device, a virtual reality (VR) device, a base station, a mobile device, or any combination thereof.
In conjunction with the described implementations, an apparatus includes means for obtaining pose data for a listener in an immersive audio environment. For example, the means for obtaining pose data can correspond to the pose sensor(s) 108, the pose outlier detector and mitigator 150, the movement estimator 460, the immersive audio renderer 122, the audio asset selector 124, the client 120, the immersive audio player 402, the processor(s) 410, the modem 420, the pose predictor 450, the binauralizer 128, the media output device(s) 102, the processor 1906, the one or more processor(s) 1910, one or more other circuits or components configured to obtain pose data, or any combination thereof.
The apparatus includes means for determining a current listener pose based on the pose data and one or more pose constraints. For example, the means for obtaining the current listener pose can correspond to the pose outlier detector and mitigator 150, the immersive audio renderer 122, the audio asset selector 124, the asset location selector 130, the client 120, the immersive audio player 402, the processor(s) 410, the modem 420, the movement estimator 460, the media output device(s) 102, the processor 1906, the one or more processor(s) 1910, one or more other circuits or components configured to obtain rendered assets, or any combination thereof.
The apparatus includes means for obtaining, based on the current listener pose, a rendered asset associated with the immersive audio environment. For example, the means for obtaining the rendered asset can correspond to the immersive audio renderer 122, the audio asset selector 124, the asset location selector 130, the client 120, the decoder 121, the immersive audio player 402, the processor(s) 410, the modem 420, the binauralizer 128, the media output device(s) 102, the processor 1906, the one or more processor(s) 1910, one or more other circuits or components configured to obtain rendered assets, or any combination thereof.
The apparatus includes means for generating an output audio signal based on the rendered asset. For example, the means for generating an output audio signal can correspond to the immersive audio renderer 122, the immersive audio player 402, the processor(s) 410, the binauralizer 128, the media output device(s) 102, the processor 1906, the one or more processor(s) 1910, one or more other circuits or components configured to generate an output audio signal, or any combination thereof.
In some implementations, a non-transitory computer-readable medium (e.g., a computer-readable storage device, such as the memory 1986 or the local memory 170) includes instructions (e.g., the instructions 1956 or the instructions 174) that, when executed by one or more processors (e.g., the one or more processors 410, the one or more processors 1910 or the processor 1906), cause the one or more processors to obtain pose data for a listener in an immersive audio environment; determine a current listener pose based on the pose data and one or more pose constraints; obtain, based on the current listener pose, a rendered asset associated with the immersive audio environment; and generate an output audio signal based on the rendered asset.
Particular aspects of the disclosure are described below in sets of interrelated Examples:
According to Example 1, a device includes: a memory configured to store audio data associated with an immersive audio environment; and one or more processors configured to: obtain pose data for a listener in the immersive audio environment; determine a current listener pose based on the pose data and one or more pose constraints; obtain, based on the current listener pose, a rendered asset associated with the immersive audio environment; and generate an output audio signal based on the rendered asset.
Example 2 includes the device of Example 1, wherein the one or more pose constraints include a human body movement constraint.
Example 3 includes the device of Example 2, wherein the human body movement constraint corresponds to a velocity constraint.
Example 4 includes the device of Example 2 or Example 3, wherein the human body movement constraint corresponds to an acceleration constraint.
Example 5 includes the device of any of Examples 2 to 4, wherein the human body movement constraint corresponds to a constraint on a hand or torso pose of the listener relative to a head pose of the listener.
Example 6 includes the device of any of Examples 1 to 5, wherein the one or more pose constraints include a boundary constraint that indicates a boundary associated with the immersive audio environment, and wherein the one or more processors are configured to determine the current listener pose such that the current listener pose is limited by the boundary.
Example 7 includes the device of any of Examples 1 to 6, wherein the one or more processors are configured to: obtain a pose based on the pose data; and determine whether the pose violates at least one of the one or more pose constraints.
Example 8 includes the device of Example 7, wherein the one or more processors are configured to, based on a determination that the pose does not violate the one or more pose constraints, use the pose as the current listener pose.
Example 9 includes the device of Example 7 or Example 8, wherein the one or more processors are configured to, based on a determination that the pose violates at least one of the one or more pose constraints, determine the current listener pose based on a prior listener pose that did not violate the one or more pose constraints.
Example 10 includes the device of Example 9, wherein the one or more processors are configured to, based on the determination that the pose violates at least one of the one or more pose constraints, determine a predicted listener pose based on a prior predicted listener pose associated with the prior listener pose.
Example 11 includes the device of Example 7, wherein the one or more processors are configured to, based on a determination that the pose violates at least one of the one or more pose constraints, determine the current listener pose based on an adjustment of the pose to satisfy the one or more pose constraints.
Example 12 includes the device of Example 11, wherein the one or more processors are configured to adjust a value of the pose to match a threshold associated with a violated pose constraint.
Example 13 includes the device of Example 12, wherein the one or more processors are configured to, based on the determination that the pose violates at least one of the one or more pose constraints, determine a predicted listener pose based on a prior predicted listener pose associated with the prior listener pose.
Example 14 includes the device of Example 13, wherein the one or more processors are configured to, based on a determination that the prior predicted listener pose violates at least one of the one or more pose constraints, determine the predicted listener pose based on an adjustment of the prior predicted listener pose to match a threshold associated with a violated pose constraint.
Example 15 includes the device of any of Examples 1 to 14, wherein the pose data is received from one or more pose sensors.
Example 16 includes the device of any of Examples 1 to 15, wherein the pose data includes first pose data associated with a head of a listener and second pose data associated with at least one of a torso of the listener or a hand of the listener.
Example 17 includes the device of Example 16, wherein the first pose data includes head translation data, head rotation data, or both, and wherein the second pose data includes body translation data, body rotation data, or both.
Example 18 includes the device of Example 16 or Example 17, wherein the first pose data is obtained from a first device and wherein the second pose data is received from a second device that is distinct from the first device.
Example 19 includes the device of any of Examples 1 to 18, wherein, to obtain the rendered asset, the one or more processors are configured to: determine a target asset based on the pose data; and generate an asset retrieval request to retrieve the target asset from a storage location.
Example 20 includes the device of Example 19, wherein the memory includes the storage location.
Example 21 includes the device of Example 19, wherein the storage location is at a remote device.
Example 22 includes the device of any of Examples 19 to 21, wherein the target asset is a pre-rendered asset and wherein, to generate the output audio signal, the one or more processors are configured to apply head related transfer functions to the target asset to generate a binaural output signal.
Example 23 includes the device of any of Examples 19 to 21, wherein, to obtain the rendered asset, the one or more processors are configured to render the target asset based on the current listener pose, and wherein, to generate the output audio signal, the one or more processors are configured to apply head related transfer functions to the rendered asset to generate a binaural output signal.
Example 24 includes the device of any of Examples 1 to 23, and further includes a pose sensor coupled to the one or more processors, and wherein the pose sensor is configured to provide at least a portion of the pose data.
Example 25 includes the device of Example 24, wherein the pose sensor and the one or more processors are integrated within a head-mounted wearable device.
Example 26 includes the device of any of Examples 1 to 25, wherein the one or more processors are integrated within an immersive audio player device.
Example 27 includes the device of any of Examples 1 to 26, and further includes a modem coupled to the one or more processors and configured to receive the pose data from a device that includes a pose sensor.
According to Example 28, a method includes: obtaining, at one or more processors, pose data for a listener in an immersive audio environment; determining, at the one or more processors, a current listener pose based on the pose data and one or more pose constraints; obtaining, at the one or more processors and based on the current listener pose, a rendered asset associated with the immersive audio environment; and generating, at the one or more processors, an output audio signal based on the rendered asset.
Example 29 includes the method of Example 28, wherein the one or more pose constraints include a human body movement constraint.
Example 30 includes the method of Example 29, wherein the human body movement constraint corresponds to a velocity constraint.
Example 31 includes the method of Example 29 or Example 30, wherein the human body movement constraint corresponds to an acceleration constraint.
Example 32 includes the method of any of Examples 29 to 31, wherein the human body movement constraint corresponds to a constraint on a hand or torso pose of the listener relative to a head pose of the listener.
Example 33 includes the method of any of Examples 28 to 32, wherein the one or more pose constraints include a boundary constraint that indicates a boundary associated with the immersive audio environment, and wherein the current listener pose is determined such that the current listener pose is limited by the boundary.
Example 34 includes the method of any of Examples 28 to 33, and further includes: obtaining a pose based on the pose data; and determining whether the pose violates at least one of the one or more pose constraints.
Example 35 includes the method of Example 34, and further includes, based on a determination that the pose does not violate the one or more pose constraints, using the pose as the current listener pose.
Example 36 includes the method of Example 34, and further includes, based on a determination that the pose violates at least one of the one or more pose constraints, determining the current listener pose based on a prior listener pose that did not violate the one or more pose constraints.
Example 37 includes the method of Example 36, and further includes, based on the determination that the pose violates at least one of the one or more pose constraints, determining a predicted listener pose based on a prior predicted listener pose associated with the prior listener pose.
Example 38 includes the method of Example 34, and further includes, based on a determination that the pose violates at least one of the one or more pose constraints, determining the current listener pose based on an adjustment of the pose to satisfy the one or more pose constraints.
Example 39 includes the method of Example 38, wherein determining the current listener pose includes adjusting a value of the pose to match a threshold associated with a violated pose constraint.
Example 40 includes the method of Example 38 or Example 39, and further includes, based on the determination that the pose violates at least one of the one or more pose constraints, determining a predicted listener pose based on a prior predicted listener pose associated with a prior listener pose.
Example 41 includes the method of Example 40, wherein, based on a determination that the prior predicted listener pose violates at least one of the one or more pose constraints, determining the predicted listener pose includes adjusting the prior predicted listener pose to match a threshold associated with a violated pose constraint.
Example 42 includes the method of any of Examples 28 to 41, wherein the pose data is received from one or more pose sensors.
Example 43 includes the method of any of Examples 28 to 42, wherein the pose data includes first pose data associated with a head of the listener and second pose data associated with at least one of a torso of the listener or a hand of the listener.
Example 44 includes the method of Example 43, wherein the first pose data includes head translation data, head rotation data, or both, and wherein the second pose data includes body translation data, body rotation data, or both.
Example 45 includes the method of Example 43 or Example 44, wherein the first pose data is obtained from a first device and wherein the second pose data is received from a second device that is distinct from the first device.
Example 46 includes the method of any of Examples 28 to 41, wherein obtaining the rendered asset includes: determining a target asset based on the pose data; and generating an asset retrieval request to retrieve the target asset from a storage location.
Example 47 includes the method of Example 46, wherein the storage location is at a local memory.
Example 48 includes the method of Example 46, wherein the storage location is at a remote device.
Example 49 includes the method of any of Examples 46 to 48, wherein the target asset is a pre-rendered asset and wherein generating the output audio signal includes applying head related transfer functions to the target asset to generate a binaural output signal.
Example 50 includes the method of any of Examples 46 to 48, wherein obtaining the rendered asset further includes rendering the target asset based on the current listener pose, and wherein generating the output audio signal includes applying head related transfer functions to the rendered asset to generate a binaural output signal.
According to Example 51, a non-transitory computer-readable device stores instructions that are executable by one or more processors to cause the one or more processors to: obtain pose data for a listener in an immersive audio environment; determine a current listener pose based on the pose data and one or more pose constraints; obtain, based on the current listener pose, a rendered asset associated with the immersive audio environment; and generate an output audio signal based on the rendered asset.
Example 52 includes the non-transitory computer-readable device of Example 51, wherein the one or more pose constraints include a human body movement constraint.
Example 53 includes the non-transitory computer-readable device of Example 52, wherein the human body movement constraint corresponds to a velocity constraint.
Example 54 includes the non-transitory computer-readable device of Example 52 or Example 53, wherein the human body movement constraint corresponds to an acceleration constraint.
Example 55 includes the non-transitory computer-readable device of any of Examples 52 to 54, wherein the human body movement constraint corresponds to a constraint on a hand or torso pose of the listener relative to a head pose of the listener.
Example 56 includes the non-transitory computer-readable device of any of Examples 51 to 55, wherein the one or more pose constraints include a boundary constraint that indicates a boundary associated with the immersive audio environment, and wherein the current listener pose is determined such that the current listener pose is limited by the boundary.
Example 57 includes the non-transitory computer-readable device of any of Examples 51 to 56, wherein the instructions cause the one or more processors to: obtain a pose based on the pose data; and determine whether the pose violates at least one of the one or more pose constraints.
Example 58 includes the non-transitory computer-readable device of Example 57, wherein, based on a determination that the pose does not violate the one or more pose constraints, the instructions cause the one or more processors to use the pose as the current listener pose.
Example 59 includes the non-transitory computer-readable device of Example 57 or Example 58, wherein, based on a determination that the pose violates at least one of the one or more pose constraints, the instructions cause the one or more processors to determine the current listener pose based on a prior listener pose that did not violate the one or more pose constraints.
Example 60 includes the non-transitory computer-readable device of Example 59, wherein, based on the determination that the pose violates at least one of the one or more pose constraints, the instructions cause the one or more processors to determine a predicted listener pose based on a prior predicted listener pose associated with the prior listener pose.
Example 61 includes the non-transitory computer-readable device of Example 57, wherein, based on a determination that the pose violates at least one of the one or more pose constraints, the instructions cause the one or more processors to determine the current listener pose based on an adjustment of the pose to satisfy the one or more pose constraints.
Example 62 includes the non-transitory computer-readable device of Example 61, wherein, to determine the current listener pose, the instructions cause the one or more processors to adjust a value of the pose to match a threshold associated with a violated pose constraint.
Example 63 includes the non-transitory computer-readable device of Example 61 or Example 62, wherein, based on the determination that the pose violates at least one of the one or more pose constraints, the instructions cause the one or more processors to determine a predicted listener pose based on a prior predicted listener pose associated with a prior listener pose.
Example 64 includes the non-transitory computer-readable device of Example 63, wherein, based on a determination that the prior predicted listener pose violates at least one of the one or more pose constraints, the instructions cause the one or more processors to determine the predicted listener pose based on adjusting the prior predicted listener pose to match a threshold associated with a violated pose constraint.
Example 65 includes the non-transitory computer-readable device of any of Examples 51 to 64, wherein the pose data is received from one or more pose sensors.
Example 66 includes the non-transitory computer-readable device of any of Examples 51 to 65, wherein the pose data includes first pose data associated with a head of the listener and second pose data associated with at least one of a torso of the listener or a hand of the listener.
Example 67 includes the non-transitory computer-readable device of Example 66, wherein the first pose data includes head translation data, head rotation data, or both, and wherein the second pose data includes body translation data, body rotation data, or both.
Example 68 includes the non-transitory computer-readable device of Example 66 or Example 67, wherein the first pose data is obtained from a first device and wherein the second pose data is received from a second device that is distinct from the first device.
Example 69 includes the non-transitory computer-readable device of any of Examples 51 to 68, wherein to obtain the rendered asset, the instructions cause the one or more processors to: determine a target asset based on the pose data; and generate an asset retrieval request to retrieve the target asset from a storage location.
Example 70 includes the non-transitory computer-readable device of Example 69, wherein the storage location is at a local memory.
Example 71 includes the non-transitory computer-readable device of Example 69, wherein the storage location is at a remote device.
Example 72 includes the non-transitory computer-readable device of any of Examples 69 to 71, wherein the target asset is a pre-rendered asset and wherein, to generate the output audio signal, the instructions cause the one or more processors to apply head related transfer functions to the target asset to generate a binaural output signal.
Example 73 includes the non-transitory computer-readable device of any of Examples 69 to 71, wherein the instructions cause the one or more processors to: render the target asset based on the current listener pose to generate a rendered asset; and apply head related transfer functions to the rendered asset to generate a binaural output signal.
According to Example 74, an apparatus includes: means for obtaining pose data for a listener in an immersive audio environment; means for determining a current listener pose based on the pose data and one or more pose constraints; means for obtaining, based on the current listener pose, a rendered asset associated with the immersive audio environment; and means for generating an output audio signal based on the rendered asset.
Example 75 includes the apparatus of Example 74, wherein the one or more pose constraints include a human body movement constraint.
Example 76 includes the apparatus of Example 75, wherein the human body movement constraint corresponds to a velocity constraint.
Example 77 includes the apparatus of Example 75 or Example 76, wherein the human body movement constraint corresponds to an acceleration constraint.
Example 78 includes the apparatus of any of Examples 75 to 77, wherein the human body movement constraint corresponds to a constraint on a hand or torso pose of the listener relative to a head pose of the listener.
Example 79 includes the apparatus of any of Examples 74 to 78, wherein the one or more pose constraints include a boundary constraint that indicates a boundary associated with the immersive audio environment, and wherein the current listener pose is determined such that the current listener pose is limited by the boundary.
Example 80 includes the apparatus of any of Examples 74 to 78, and further includes: means for obtaining a pose based on the pose data; and means for determining whether the pose violates at least one of the one or more pose constraints.
Example 81 includes the apparatus of Example 80, and further includes means for using the pose as the current listener pose based on a determination that the pose does not violate the one or more pose constraints.
Example 82 includes the apparatus of Example 80 or Example 81, and further includes means for determining the current listener pose based on a prior listener pose that did not violate the one or more pose constraints.
Example 83 includes the apparatus of Example 82, and further includes means for determining a predicted listener pose based on a prior predicted listener pose associated with the prior listener pose.
Example 84 includes the apparatus of Example 80, and further includes means for determining the current listener pose based on an adjustment of the pose to satisfy the one or more pose constraints.
Example 85 includes the apparatus of Example 84, wherein the current listener pose is based on an adjustment of a value of the pose to match a threshold associated with a violated pose constraint.
Example 86 includes the apparatus of Example 84 or Example 85, and further includes means for determining a predicted listener pose based on a prior predicted listener pose associated with a prior listener pose.
Example 87 includes the apparatus of Example 86, wherein the means for determining the predicted listener pose includes means for adjusting the prior predicted listener pose to match a threshold associated with a violated pose constraint.
Example 88 includes the apparatus of any of Examples 74 to 87, wherein the pose data is received from one or more pose sensors.
Example 89 includes the apparatus of any of Examples 74 to 88, wherein the pose data includes first pose data associated with a head of the listener and second pose data associated with at least one of a torso of the listener or a hand of the listener.
Example 90 includes the apparatus of Example 89, wherein the first pose data includes head translation data, head rotation data, or both, and wherein the second pose data includes body translation data, body rotation data, or both.
Example 91 includes the apparatus of Example 89 or Example 90, wherein the first pose data is obtained from a first device and wherein the second pose data is received from a second device that is distinct from the first device.
Example 92 includes the apparatus of any of Examples 74 to 91, wherein the means for obtaining the rendered asset associated with the immersive audio environment includes: means for determining a target asset based on the pose data; and means for generating an asset retrieval request to retrieve the target asset from a storage location.
Example 93 includes the apparatus of Example 92, wherein the storage location is at a local memory.
Example 94 includes the apparatus of Example 92, wherein the storage location is at a remote device.
Example 95 includes the apparatus of any of Examples 92 to 94, wherein the target asset is a pre-rendered asset and wherein the means for generating the output audio signal includes means for applying head related transfer functions to the target asset to generate a binaural output signal.
Example 96 includes the apparatus of any of Examples 92 to 94, wherein the means for obtaining the rendered asset associated with the immersive audio environment further includes means for rendering the target asset based on the current listener pose, and wherein the means for generating the output audio signal includes means for applying head related transfer functions to the rendered asset to generate a binaural output signal.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software executed by a processor, or combinations of both. Various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or processor executable instructions depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, such implementation decisions are not to be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of non-transient storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor may read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application-specific integrated circuit (ASIC). The ASIC may reside in a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
The previous description of the disclosed aspects is provided to enable a person skilled in the art to make or use the disclosed aspects. Various modifications to these aspects will be readily apparent to those skilled in the art, and the principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.
The present application claims priority from Provisional Patent Application No. 63/515,648, filed Jul. 26, 2023, entitled “AUDIO PROCESSING,” the content of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63515648 | Jul 2023 | US |