This disclosure relates in general to systems and methods for capturing a sound field and sound field playback, in particular, using a mixed reality device.
It may be desirable to capture a sound field (e.g., recording a multidimensional audio scene) using an augmented reality (AR), mixed reality (MR), or extended reality (XR) device (e.g., a wearable head device). For example, it may be advantageous to use a wearable head device to record a 3-D audio scene surrounding a user of the device (e.g., to create AR, MR, or XR content without additional (and often more expensive) sound recording equipment, to create AR, MR, or XR content in a first person point of view). However, while recording the audio scene, the recording device may not be fixed. For instance, while recording, the user may move his or her head, thereby moving the recording device. The recording device movement may cause the recorded sound field and playback of the sound field to be disoriented. To ensure proper sound field orientation (e.g., to properly align with an AR, MR, or XR environment), it may be desirable to compensate for these movements in the sound field capture. Similarly, it may also be desirable to compensate for movements of a playback device during sound field playback to fix a sound source while a playback device moves relative to the AR, MR, or XR environment.
In some examples, the sound field or 3-D audio scene may be a part of AR/MR/XR content that supports six degrees of freedom for a user accessing the AR/MR/XR content. An entire sound field or 3-D audio scene that supports six degrees of freedom may result in very large and/or complex files, which would require more computing resources to access. Therefore, it may be desirable to reduce the complexity of such sound field or 3-D audio scene.
Examples of the disclosure describe systems and methods for capturing a sound field, in particular and sound field playback, using a mixed reality device. In some embodiments, the systems and methods compensate for movement of a recording device while capturing a sound field. In some embodiments, the systems and methods compensate for movement of a playback device while playing an audio of a sound field. In some embodiments, the systems and methods reduce a complexity of a captured sound field.
In some embodiments, a method comprises: detecting, with a microphone of a first wearable head device, a sound of an environment; determining a digital audio signal based on the detected sound, the digital audio signal associated with a sphere having a position in the environment; concurrently with detecting the sound, detecting, via a sensor of the first wearable head device, a microphone movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected microphone movement; and presenting the adjusted digital audio signal to a user of a second wearable head device via one or more speakers of the second wearable head device.
In some embodiments, the method further comprises: detecting, with a microphone of a third wearable head device, a second sound of the environment; determining a second digital audio signal based on the second detected sound, the second digital audio signal associated with a second sphere having a second position in the environment; concurrently, with detecting the second sound, detecting, via a sensor of the third wearable head device, a microphone movement with respect to the environment; adjusting the second digital audio signal, wherein the adjusting comprises adjusting the second position of the second sphere based on based on the second detected microphone movement; combining the adjusted digital audio signal and the second adjusted digital audio signal; and presenting the combined first adjusted digital audio signal and second adjusted digital audio signal to the user of the second wearable head device via the one or more speakers of the second wearable head device.
In some embodiments, the first adjusted digital audio signal and the second adjusted digital audio signal are combined at a server.
In some embodiments, the digital audio signal comprises an Ambisonic file.
In some embodiments, detecting the microphone movement with respect to the environment comprises one or more of performing simultaneous localization and mapping and visual inertial odometry.
In some embodiments, the sensor comprises one or more of an inertial measurement unit, a camera, a second microphone, a gyroscope, and a LiDAR sensor.
In some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
In some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
In some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the second wearable head device, content associated with the sound of the environment.
In some embodiments, a method comprises: receiving, at a wearable head device, a digital audio signal, the digital audio signal associated with a sphere having a position in the environment; detecting, via a sensor of the wearable head device, a device movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected device movement; and presenting the adjusted digital audio signal to a user of the wearable head device via one or more speakers of the wearable head device.
In some embodiments, the method further comprises: combining a second digital audio signal and a third digital audio signal; and downmixing the combined second and third digital audio signal, wherein the retrieved first digital audio signal is the combined second and third digital audio signal.
In some embodiments, downmixing the combined second and third digital audio signal comprises applying a first gain to the second digital audio signal and a second gain to the second digital audio signal.
In some embodiments, downmixing the combined second and third digital audio signal comprises reducing an Ambisonic order of the second digital audio signal based on a distance of the wearable head device from a recording location of the second digital audio signal.
In some embodiments, the sensor is an inertial measurement unit, a camera, a second microphone, a gyroscope, or a LiDAR sensor.
In some embodiments, detecting the device movement with respect to the environment comprises performing simultaneous localization and mapping or visual inertial odometry.
In some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
In some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement. In some embodiments, the digital audio signal is in Ambisonics format.
In some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the wearable head device, content associated with a sound of the digital audio signal in the environment.
In some embodiments, a method comprises: detecting sounds of an environment; extracting a sound object from the detected sounds; and combining the sound object and a residual. The sound object comprises a first portion of the detected sounds, the first portion meeting a sound object criterion, and the residual comprises a second portion of the detected sounds, the second portion not meeting the sound object criterion.
In some embodiments, the further comprises: detecting second sounds of the environment; determining whether a portion of the second detected sounds meets the sound object criterion, wherein: a portion of the second detected sounds meeting the sound object criterion comprises a second sound object, and a portion of the second detected sounds not meeting the sound object criterion comprises a second residual; extracting the second sound object from the second detected sounds; and consolidating the first sound object and the second sound object, wherein combining the sound object and the residual comprises combining the consolidated sound object, the first residual, and the second residual.
In some embodiments, the sound object supports six degrees of freedom in the environment, and the residual supports three degrees of freedom in the environment.
In some embodiments, the sound object has a higher spatial resolution than the residual.
In some embodiments, the residual is stored in a lower order Ambisonic file.
In some embodiments, a method, comprises: detecting, via a sensor of a wearable head device, a movement of the wearable head device with respect to the environment; adjusting a sound object, wherein the sound object is associated with a first sphere having a first position in the environment and the adjusting comprises adjusting the first position of the first sphere based on based on the detected device movement; adjusting a residual, wherein the residual is associated with a second sphere having a second position in the environment and the adjusting comprises adjusting the second position of the second sphere based on based on the detected device movement; mixing the adjusted sound object and the adjusted residual; and presenting the mixed adjusted sound object and the adjusted residual to a user of the wearable head device via one or more speakers of the wearable head device.
In some embodiments, a system comprises: a first wearable head device comprising a microphone and a sensor; a second wearable head device comprising a speaker; and one or more processors configured to execute a method comprising: detecting, with the microphone of the first wearable head device, a sound of an environment; determining a digital audio signal based on the detected sound, the digital audio signal associated with a sphere having a position in the environment; concurrently with detecting the sound, detecting, via the sensor of the first wearable head device, a microphone movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected microphone movement; and presenting the adjusted digital audio signal to a user of the second wearable head device via the speaker of the second wearable head device.
In some embodiments, the system further comprises a third wearable head device comprising a microphone and a sensor, wherein the method further comprises: detecting, with the microphone of the third wearable head device, a second sound of the environment; determining a second digital audio signal based on the second detected sound, the second digital audio signal associated with a second sphere having a second position in the environment; concurrently, with detecting the second sound, detecting, via the sensor of the third wearable head device, a second microphone movement with respect to the environment; adjusting the second digital audio signal, wherein the adjusting comprises adjusting the second position of the second sphere based on based on the second detected microphone movement; combining the adjusted digital audio signal and the second adjusted digital audio signal; and presenting the combined first adjusted digital audio signal and second adjusted digital audio signal to the user of the second wearable head device via the one or more speakers of the second wearable head device.
In some embodiments, the first adjusted digital audio signal and the second adjusted digital audio signal are combined at a server.
In some embodiments, the digital audio signal comprises an Ambisonic file.
In some embodiments, detecting the microphone movement with respect to the environment comprises performing one or more of simultaneous localization and mapping and visual inertial odometry.
In some embodiments, the sensor comprises one or more of an inertial measurement unit, a camera, a second microphone, a gyroscope, and a LiDAR sensor.
In some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
In some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
In some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the second wearable head device, content associated with the sound of the environment.
In some embodiments, a system comprises: a wearable head device comprising a sensor and a speaker; and one or more processors configured to execute a method comprising: receiving, at the wearable head device, a digital audio signal, the digital audio signal associated with a sphere having a position in the environment; detecting, via the sensor of the wearable head device, a device movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected device movement; and presenting the adjusted digital audio signal to a user of the wearable head device via the speaker of the wearable head device.
In some embodiments, the method further comprises: combining a second digital audio signal and a third digital audio signal; and downmixing the combined second and third digital audio signal, wherein the retrieved first digital audio signal is the combined second and third digital audio signal.
In some embodiments, downmixing the combined second and third digital audio signal comprises applying a first gain to the second digital audio signal and a second gain to the second digital audio signal.
In some embodiments, downmixing the combined second and third digital audio signal comprises reducing an Ambisonic order of the second digital audio signal based on a distance of the wearable head device from a recording location of the second digital audio signal.
In some embodiments, the sensor is an inertial measurement unit, a camera, a second microphone, a gyroscope, or a LiDAR sensor.
In some embodiments, detecting the device movement with respect to the environment comprises performing simultaneous localization and mapping or visual inertial odometry.
In some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
In some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
In some embodiments, the digital audio signal is in Ambisonics format.
In some embodiments, the wearable head device further comprises a display, and the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on the display of the wearable head device, content associated with a sound of the digital audio signal in the environment.
In some embodiments, a system comprises one or more processors configured to execute a method comprising: detecting sounds of an environment; extracting a sound object from the detected sounds; and combining the sound object and a residual. The sound object comprises a first portion of the detected sounds, the first portion meeting a sound object criterion, and the residual comprises a second portion of the detected sounds, the second portion not meeting the sound object criterion.
In some embodiments, the method further comprises: detecting second sounds of the environment; determining whether a portion of the second detected sounds meets the sound object criterion, wherein: a portion of the second detected sounds meeting the sound object criterion comprises a second sound object, and a portion of the second detected sounds not meeting the sound object criterion comprises a second residual; extracting the second sound object from the second detected sounds; and consolidating the first sound object and the second sound object, wherein combining the sound object and the residual comprises combining the consolidated sound object, the first residual, and the second residual.
In some embodiments, the sound object supports six degrees of freedom in the environment, and the residual supports three degrees of freedom in the environment.
In some embodiments, the sound object has a higher spatial resolution than the residual.
In some embodiments, the residual is stored in a lower order Ambisonic file.
In some embodiments, a system comprises: a wearable head device comprising a sensor and a speaker; and one or more processors configured to execute a method comprising: detecting, via the sensor of the wearable head device, a movement of the wearable head device with respect to the environment; adjusting a sound object, wherein the sound object is associated with a first sphere having a first position in the environment and the adjusting comprises adjusting the first position of the first sphere based on based on the detected device movement; adjusting a residual, wherein the residual is associated with a second sphere having a second position in the environment and the adjusting comprises adjusting the second position of the second sphere based on based on the detected device movement; mixing the adjusted sound object and the adjusted residual; and presenting the mixed adjusted sound object and the adjusted residual to a user of the wearable head device via the speaker of the wearable head device.
In some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: detecting, with a microphone of a first wearable head device, a sound of an environment; determining a digital audio signal based on the detected sound, the digital audio signal associated with a sphere having a position in the environment; concurrently with detecting the sound, detecting, via a sensor of the first wearable head device, a microphone movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected microphone movement; and presenting the adjusted digital audio signal to a user of a second wearable head device via one or more speakers of the second wearable head device.
In some embodiments, the method further comprises: detecting, with a microphone of a third wearable head device, a second sound of the environment; determining a second digital audio signal based on the second detected sound, the second digital audio signal associated with a second sphere having a second position in the environment; concurrently, with detecting the second sound, detecting, via a sensor of the third wearable head device, a second microphone movement with respect to the environment; adjusting the second digital audio signal, wherein the adjusting comprises adjusting the second position of the second sphere based on based on the second detected microphone movement; combining the adjusted digital audio signal and the second adjusted digital audio signal; and presenting the combined first adjusted digital audio signal and second adjusted digital audio signal to the user of the second wearable head device via the one or more speakers of the second wearable head device.
In some embodiments, the first adjusted digital audio signal and the second adjusted digital audio signal are combined at a server.
In some embodiments, the digital audio signal comprises an Ambisonic file.
In some embodiments, detecting the microphone movement with respect to the environment comprises performing one or more of simultaneous localization and mapping and visual inertial odometry.
In some embodiments, the sensor comprises one or more of an inertial measurement unit, a camera, a second microphone, a gyroscope, and a LiDAR sensor.
In some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
In some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
In some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the second wearable head device, content associated with the sound of the environment.
In some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: receiving, at a wearable head device, a digital audio signal, the digital audio signal associated with a sphere having a position in the environment; detecting, via a sensor of the wearable head device, a device movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected device movement; and presenting the adjusted digital audio signal to a user of the wearable head device via one or more speakers of the wearable head device.
In some embodiments, the method further comprises: combining a second digital audio signal and a third digital audio signal; and downmixing the combined second and third digital audio signal, wherein the retrieved first digital audio signal is the combined second and third digital audio signal.
In some embodiments, downmixing the combined second and third digital audio signal comprises applying a first gain to the second digital audio signal and a second gain to the second digital audio signal.
In some embodiments, downmixing the combined second and third digital audio signal comprises reducing an Ambisonic order of the second digital audio signal based on a distance of the wearable head device from a recording location of the second digital audio signal.
In some embodiments, the sensor is an inertial measurement unit, a camera, a second microphone, a gyroscope, or a LiDAR sensor.
In some embodiments, detecting the device movement with respect to the environment comprises performing simultaneous localization and mapping or visual inertial odometry.
In some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
In some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
In some embodiments, the digital audio signal is in Ambisonics format.
In some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the wearable head device, content associated with a sound of the digital audio signal in the environment.
In some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: detecting sounds of an environment; extracting a sound object from the detected sounds; and combining the sound objects and a residual. The sound object comprises a first portion of the detected sounds, the first portion meeting a sound object criterion, and the residual comprises a second portion of the detected sounds, the second portion not meeting the sound object criterion.
In some embodiments, the method further comprises: detecting second sounds of the environment; determining whether a portion of the second detected sounds meets the sound object criterion, wherein: a portion of the second detected sounds meeting the sound object criterion comprises a second sound object, and a portion of the second detected sounds not meeting the sound object criterion comprises a second residual; extracting the second sound object from the second detected sounds; and consolidating the first sound object and the second sound object, wherein combining the sound object and the residual comprises combining the consolidated sound object, the first residual, and the second residual.
In some embodiments, the sound object supports six degrees of freedom in the environment, and the residual supports three degrees of freedom in the environment.
In some embodiments, the sound object has a higher spatial resolution than the residual.
In some embodiments, the residual is stored in a lower order Ambisonic file.
In some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: detecting, via a sensor of a wearable head device, a device movement with respect to the environment; adjusting a sound object, wherein the sound object is associated with a first sphere having a first position in the environment and the adjusting comprises adjusting the first position of the first sphere based on based on the detected device movement; adjusting a residual, wherein the residual is associated with a second sphere having a second position in the environment and the adjusting comprises adjusting the second position of the second sphere based on based on the detected device movement; mixing the adjusted sound object and the adjusted residual; and presenting the mixed adjusted sound object and adjusted residual to a user of the wearable head device via one or more speakers of the wearable head device.
In the following description of examples, reference is made to the accompanying drawings which form a part hereof, and in which it is shown by way of illustration specific examples that can be practiced. It is to be understood that other examples can be used and structural changes can be made without departing from the scope of the disclosed examples.
Like all people, a user of a MR system exists in a real environment—that is, a three-dimensional portion of the “real world,” and all of its contents, that are perceptible by the user. For example, a user perceives a real environment using one's ordinary human senses—sight, sound, touch, taste, smell—and interacts with the real environment by moving one's own body in the real environment. Locations in a real environment can be described as coordinates in a coordinate space; for example, a coordinate can comprise latitude, longitude, and elevation with respect to sea level; distances in three orthogonal dimensions from a reference point; or other suitable values. Likewise, a vector can describe a quantity having a direction and a magnitude in the coordinate space.
A computing device can maintain, for example in a memory associated with the device, a representation of a virtual environment. As used herein, a virtual environment is a computational representation of a three-dimensional space. A virtual environment can include representations of any object, action, signal, parameter, coordinate, vector, or other characteristic associated with that space. In some examples, circuitry (e.g., a processor) of a computing device can maintain and update a state of a virtual environment; that is, a processor can determine at a first time t0, based on data associated with the virtual environment and/or input provided by a user, a state of the virtual environment at a second time t1. For instance, if an object in the virtual environment is located at a first coordinate at time t0, and has certain programmed physical parameters (e.g., mass, coefficient of friction); and an input received from user indicates that a force should be applied to the object in a direction vector; the processor can apply laws of kinematics to determine a location of the object at time t1 using basic mechanics. The processor can use any suitable information known about the virtual environment, and/or any suitable input, to determine a state of the virtual environment at a time t1. In maintaining and updating a state of a virtual environment, the processor can execute any suitable software, including software relating to the creation and deletion of virtual objects in the virtual environment; software (e.g., scripts) for defining behavior of virtual objects or characters in the virtual environment; software for defining the behavior of signals (e.g., audio signals) in the virtual environment; software for creating and updating parameters associated with the virtual environment; software for generating audio signals in the virtual environment; software for handling input and output; software for implementing network operations; software for applying asset data (e.g., animation data to move a virtual object over time); or many other possibilities.
Output devices, such as a display or a speaker, can present any or all aspects of a virtual environment to a user. For example, a virtual environment may include virtual objects (which may include representations of inanimate objects; people; animals; lights; etc.) that may be presented to a user. A processor can determine a view of the virtual environment (for example, corresponding to a “camera” with an origin coordinate, a view axis, and a frustum); and render, to a display, a viewable scene of the virtual environment corresponding to that view. Any suitable rendering technology may be used for this purpose. In some examples, the viewable scene may include some virtual objects in the virtual environment, and exclude certain other virtual objects. Similarly, a virtual environment may include audio aspects that may be presented to a user as one or more audio signals. For instance, a virtual object in the virtual environment may generate a sound originating from a location coordinate of the object (e.g., a virtual character may speak or cause a sound effect); or the virtual environment may be associated with musical cues or ambient sounds that may or may not be associated with a particular location. A processor can determine an audio signal corresponding to a “listener” coordinate—for instance, an audio signal corresponding to a composite of sounds in the virtual environment, and mixed and processed to simulate an audio signal that would be heard by a listener at the listener coordinate (e.g., using the methods and systems described herein)—and present the audio signal to a user via one or more speakers.
Because a virtual environment exists as a computational structure, a user may not directly perceive a virtual environment using one's ordinary senses. Instead, a user can perceive a virtual environment indirectly, as presented to the user, for example by a display, speakers, haptic output devices, etc. Similarly, a user may not directly touch, manipulate, or otherwise interact with a virtual environment; but can provide input data, via input devices or sensors, to a processor that can use the device or sensor data to update the virtual environment. For example, a camera sensor can provide optical data indicating that a user is trying to move an object in a virtual environment, and a processor can use that data to cause the object to respond accordingly in the virtual environment.
A MR system can present to the user, for example using a transmissive display and/or one or more speakers (which may, for example, be incorporated into a wearable head device), a MR environment (“MRE”) that combines aspects of a real environment and a virtual environment. In some embodiments, the one or more speakers may be external to the wearable head device. As used herein, a MRE is a simultaneous representation of a real environment and a corresponding virtual environment. In some examples, the corresponding real and virtual environments share a single coordinate space; in some examples, a real coordinate space and a corresponding virtual coordinate space are related to each other by a transformation matrix (or other suitable representation). Accordingly, a single coordinate (along with, in some examples, a transformation matrix) can define a first location in the real environment, and also a second, corresponding, location in the virtual environment; and vice versa.
In a MRE, a virtual object (e.g., in a virtual environment associated with the MRE) can correspond to a real object (e.g., in a real environment associated with the MRE). For instance, if the real environment of a MRE comprises a real lamp post (a real object) at a location coordinate, the virtual environment of the MRE may comprise a virtual lamp post (a virtual object) at a corresponding location coordinate. As used herein, the real object in combination with its corresponding virtual object together constitute a “mixed reality object.” It is not necessary for a virtual object to perfectly match or align with a corresponding real object. In some examples, a virtual object can be a simplified version of a corresponding real object. For instance, if a real environment includes a real lamp post, a corresponding virtual object may comprise a cylinder of roughly the same height and radius as the real lamp post (reflecting that lamp posts may be roughly cylindrical in shape). Simplifying virtual objects in this manner can allow computational efficiencies, and can simplify calculations to be performed on such virtual objects. Further, in some examples of a MRE, not all real objects in a real environment may be associated with a corresponding virtual object. Likewise, in some examples of a MRE, not all virtual objects in a virtual environment may be associated with a corresponding real object. That is, some virtual objects may solely in a virtual environment of a MRE, without any real-world counterpart.
In some examples, virtual objects may have characteristics that differ, sometimes drastically, from those of corresponding real objects. For instance, while a real environment in a MRE may comprise a green, two-armed cactus—a prickly inanimate object—a corresponding virtual object in the MRE may have the characteristics of a green, two-armed virtual character with human facial features and a surly demeanor. In this example, the virtual object resembles its corresponding real object in certain characteristics (color, number of arms); but differs from the real object in other characteristics (facial features, personality). In this way, virtual objects have the potential to represent real objects in a creative, abstract, exaggerated, or fanciful manner; or to impart behaviors (e.g., human personalities) to otherwise inanimate real objects. In some examples, virtual objects may be purely fanciful creations with no real-world counterpart (e.g., a virtual monster in a virtual environment, perhaps at a location corresponding to an empty space in a real environment).
In some examples, virtual objects may have characteristics that resemble corresponding real objects. For instance, a virtual character may be presented in a virtual or mixed reality environment as a life-like figure to provide a user an immersive mixed reality experience. With virtual characters having life-like characteristics, the user may feel like he or she is interacting with a real person. In such instances, it is desirable for actions such as muscle movements and gaze of the virtual character to appear natural. For example, movements of the virtual character should be similar to its corresponding real object (e.g., a virtual human should walk or move its arm like a real human). As another example, the gestures and positioning of the virtual human should appear natural, and the virtual human can initial interactions with the user (e.g., the virtual human can lead a collaborative experience with the user). Presentation of virtual characters or objects having life-like audio responses is described in more detail herein.
Compared to VR systems, which present the user with a virtual environment while obscuring the real environment, a mixed reality system presenting a MRE affords the advantage that the real environment remains perceptible while the virtual environment is presented. Accordingly, the user of the mixed reality system is able to use visual and audio cues associated with the real environment to experience and interact with the corresponding virtual environment. As an example, while a user of VR systems may struggle to perceive or interact with a virtual object displayed in a virtual environment—because, as noted herein, a user may not directly perceive or interact with a virtual environment—a user of an MR system may find it more intuitive and natural to interact with a virtual object by seeing, hearing, and touching a corresponding real object in his or her own real environment. This level of interactivity may heighten a user's feelings of immersion, connection, and engagement with a virtual environment. Similarly, by simultaneously presenting a real environment and a virtual environment, mixed reality systems may reduce negative psychological feelings (e.g., cognitive dissonance) and negative physical feelings (e.g., motion sickness) associated with VR systems. Mixed reality systems further offer many possibilities for applications that may augment or alter our experiences of the real world.
Persistent coordinate data may be coordinate data that persists relative to a physical environment. Persistent coordinate data may be used by MR systems (e.g., MR system 112, 200) to place persistent virtual content, which may not be tied to movement of a display on which the virtual object is being displayed. For example, a two-dimensional screen may display virtual objects relative to a position on the screen. As the two-dimensional screen moves, the virtual content may move with the screen. In some embodiments, persistent virtual content may be displayed in a corner of a room. A MR user may look at the corner, see the virtual content, look away from the corner (where the virtual content may no longer be visible because the virtual content may have moved from within the user's field of view to a location outside the user's field of view due to motion of the user's head), and look back to see the virtual content in the corner (similar to how a real object may behave).
In some embodiments, persistent coordinate data (e.g., a persistent coordinate system and/or a persistent coordinate frame) can include an origin point and three axes. For example, a persistent coordinate system may be assigned to a center of a room by a MR system. In some embodiments, a user may move around the room, out of the room, re-enter the room, etc., and the persistent coordinate system may remain at the center of the room (e.g., because it persists relative to the physical environment). In some embodiments, a virtual object may be displayed using a transform to persistent coordinate data, which may enable displaying persistent virtual content. In some embodiments, a MR system may use simultaneous localization and mapping to generate persistent coordinate data (e.g., the MR system may assign a persistent coordinate system to a point in space). In some embodiments, a MR system may map an environment by generating persistent coordinate data at regular intervals (e.g., a MR system may assign persistent coordinate systems in a grid where persistent coordinate systems may be at least within five feet of another persistent coordinate system).
In some embodiments, persistent coordinate data may be generated by a MR system and transmitted to a remote server. In some embodiments, a remote server may be configured to receive persistent coordinate data. In some embodiments, a remote server may be configured to synchronize persistent coordinate data from multiple observation instances. For example, multiple MR systems may map the same room with persistent coordinate data and transmit that data to a remote server. In some embodiments, the remote server may use this observation data to generate canonical persistent coordinate data, which may be based on the one or more observations. In some embodiments, canonical persistent coordinate data may be more accurate and/or reliable than a single observation of persistent coordinate data. In some embodiments, canonical persistent coordinate data may be transmitted to one or more MR systems. For example, a MR system may use image recognition and/or location data to recognize that it is located in a room that has corresponding canonical persistent coordinate data (e.g., because other MR systems have previously mapped the room). In some embodiments, the MR system may receive canonical persistent coordinate data corresponding to its location from a remote server.
With respect to
In the example shown, mixed reality objects comprise corresponding pairs of real objects and virtual objects (e.g., 122A/122B, 124A/124B, 126A/126B) that occupy corresponding locations in coordinate space 108. In some examples, both the real objects and the virtual objects may be simultaneously visible to user 110. This may be desirable in, for example, instances where the virtual object presents information designed to augment a view of the corresponding real object (such as in a museum application where a virtual object presents the missing pieces of an ancient damaged sculpture). In some examples, the virtual objects (122B, 124B, and/or 126B) may be displayed (e.g., via active pixelated occlusion using a pixelated occlusion shutter) so as to occlude the corresponding real objects (122A, 124A, and/or 126A). This may be desirable in, for example, instances where the virtual object acts as a visual replacement for the corresponding real object (such as in an interactive storytelling application where an inanimate real object becomes a “living” character).
In some examples, real objects (e.g., 122A, 124A, 126A) may be associated with virtual content or helper data that may not necessarily constitute virtual objects. Virtual content or helper data can facilitate processing or handling of virtual objects in the mixed reality environment. For example, such virtual content could include two-dimensional representations of corresponding real objects; custom asset types associated with corresponding real objects; or statistical data associated with corresponding real objects. This information can enable or facilitate calculations involving a real object without incurring unnecessary computational overhead.
In some examples, the presentation described herein may also incorporate audio aspects. For instance, in MRE 150, virtual character 132 could be associated with one or more audio signals, such as a footstep sound effect that is generated as the character walks around MRE 150. As described herein, a processor of mixed reality system 112 can compute an audio signal corresponding to a mixed and processed composite of all such sounds in MRE 150, and present the audio signal to user 110 via one or more speakers included in mixed reality system 112 and/or one or more external speakers.
Example mixed reality system 112 can include a wearable head device (e.g., a wearable augmented reality or mixed reality head device) comprising a display (which may comprise left and right transmissive displays, which may be near-eye displays, and associated components for coupling light from the displays to the user's eyes); left and right speakers (e.g., positioned adjacent to the user's left and right ears, respectively); an inertial measurement unit (IMU) (e.g., mounted to a temple arm of the head device); an orthogonal coil electromagnetic receiver (e.g., mounted to the left temple piece); left and right cameras (e.g., depth (time-of-flight) cameras) oriented away from the user; and left and right eye cameras oriented toward the user (e.g., for detecting the user's eye movements). However, a mixed reality system 112 can incorporate any suitable display technology, and any suitable sensors (e.g., optical, infrared, acoustic, LIDAR, EOG, GPS, magnetic). In addition, mixed reality system 112 may incorporate networking features (e.g., Wi-Fi capability, mobile network (e.g., 4G, 5G) capability) to communicate with other devices and systems, including neural networks (e.g., in the cloud) for data processing and training data associated with presentation of elements (e.g., virtual character 132) in the MRE 150 and other mixed reality systems. Mixed reality system 112 may further include a battery (which may be mounted in an auxiliary unit, such as a belt pack designed to be worn around a user's waist), a processor, and a memory. The wearable head device of mixed reality system 112 may include tracking components, such as an IMU or other suitable sensors, configured to output a set of coordinates of the wearable head device relative to the user's environment. In some examples, tracking components may provide input to a processor performing a Simultaneous Localization and Mapping (SLAM) and/or visual odometry algorithm. In some examples, mixed reality system 112 may also include a handheld controller 300, and/or an auxiliary unit 320, which may be a wearable beltpack, as described herein.
In some embodiments, an animation rig is used to present the virtual character 132 in the MRE 150. Although the animation rig is described with respect to virtual character 132, it is understood that the animation rig may be associated with other characters (e.g., a human character, an animal character, an abstract character) in the MRE 150.
In some embodiments, the disclosed asymmetrical microphone arrangements allow the system to record a sound field more independently from a user's movements (e.g., head rotation) (e.g., by allowing head movement along all axes of the environment to be detected acoustically, by allowing a sound field that may be more easily adjusted (e.g., the sound field has more information along different axes of the environment) to compensate these movements). More examples of these features and advantages are described herein.
In some examples involving augmented reality or mixed reality applications, it may be desirable to transform coordinates from a local coordinate space (e.g., a coordinate space fixed relative to headgear device 500A) to an inertial coordinate space, or to an environmental coordinate space. For instance, such transformations may be necessary for a display of headgear device 500A to present a virtual object at an expected position and orientation relative to the real environment (e.g., a virtual person sitting in a real chair, facing forward, regardless of the position and orientation of headgear device 500A), rather than at a fixed position and orientation on the display (e.g., at the same position in the display of headgear device 500A). This can maintain an illusion that the virtual object exists in the real environment (and does not, for example, appear positioned unnaturally in the real environment as the headgear device 500A shifts and rotates). In some examples, a compensatory transformation between coordinate spaces can be determined by processing imagery from the depth cameras 544 (e.g., using a Simultaneous Localization and Mapping (SLAM) and/or visual odometry procedure) in order to determine the transformation of the headgear device 500A relative to an inertial or environmental coordinate system. In the example shown in
In some examples, the depth cameras 544 can supply 3D imagery to a hand gesture tracker 511, which may be implemented in a processor of headgear device 500A. The hand gesture tracker 511 can identify a user's hand gestures, for example by matching 3D imagery received from the depth cameras 544 to stored patterns representing hand gestures. Other suitable techniques of identifying a user's hand gestures will be apparent.
In some examples, one or more processors 516 may be configured to receive data from headgear subsystem 504B, the IMU 509, the SLAM/visual odometry block 506, depth cameras 544, microphones 550; and/or the hand gesture tracker 511. The processor 516 can also send and receive control signals from the 6DOF totem system 504A. The processor 516 may be coupled to the 6DOF totem system 504A wirelessly, such as in examples where the handheld controller 500B is untethered. Processor 516 may further communicate with additional components, such as an audio-visual content memory 518, a Graphical Processing Unit (GPU) 520, and/or a Digital Signal Processor (DSP) audio spatializer 522. The DSP audio spatializer 522 may be coupled to a Head Related Transfer Function (HRTF) memory 525. The GPU 520 can include a left channel output coupled to the left source of imagewise modulated light 524 and a right channel output coupled to the right source of imagewise modulated light 526. GPU 520 can output stereoscopic image data to the sources of imagewise modulated light 524, 526. The DSP audio spatializer 522 can output audio to a left speaker 512 and/or a right speaker 514. The DSP audio spatializer 522 can receive input from processor 519 indicating a direction vector from a user to a virtual sound source (which may be moved by the user, e.g., via the handheld controller 500B). Based on the direction vector, the DSP audio spatializer 522 can determine a corresponding HRTF (e.g., by accessing a HRTF, or by interpolating multiple HRTFs). The DSP audio spatializer 522 can then apply the determined HRTF to an audio signal, such as an audio signal corresponding to a virtual sound generated by a virtual object. This can enhance the believability and realism of the virtual sound, by incorporating the relative position and orientation of the user relative to the virtual sound in the mixed reality environment—that is, by presenting a virtual sound that matches a user's expectations of what that virtual sound would sound like if it were a real sound in a real environment.
In some examples, such as shown in
While
In some embodiments, computation, determination, calculation, or derivation steps of method 600 are performed using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system and/or using a server (e.g., in the cloud).
In some embodiments, the method 600 includes detecting a sound (step 602). For example, a sound is detected by a microphone (e.g., microphone 250; microphones 250A, 250B, 250C, and 250D; microphone of handheld controller 300; microphone array 507) of a wearable head device or an AR/MR/XR system. In some embodiments, the sound includes a sound from a sound field or a 3-D audio scene of an environment (an AR, MR, or XR environment) of the wearable head device or the AR/MR/XR system.
In some examples, while the sound is being detected by the microphone, the microphone may not be stationary. For example, a user of the device including the microphone may not be stationary, such that the sound does not appear to be recorded at a fixed location and position. In some instances, the user wears a wearable head device including the microphone, and the user's head is not stationary (e.g., the user's headpose or head orientation changes over time) due to intentional and/or unintentional head movements. By processing the detected sound as described herein, a recording corresponding to the detected sound may be compensated for these movements, as if the sound was detected by a stationary microphone.
In some embodiments, the method 600 includes determining a digital audio signal based on the detected sound (step 604). In some embodiments, the digital audio signal is associated with a sphere having a position (e.g., a location, an orientation) in an environment (e.g., an AR, MR, or XR environment). As used herein, it is understood that “sphere” and “spherical” are not meant to limit an audio signal, a signal representation, or a sound to a strictly spherical pattern or geometry. As used herein, a “sphere” or “spherical” may refer to a pattern or geometry comprising components that span more than three dimensions of an environment.
For example, a spherical signal representation of the detected sound is derived. In some embodiments, the spherical signal representation represents a sound field with respect to a point in space (e.g., the sound field at a location of the recording device). For example, a 3-D spherical signal representation is derived based on the sound detected by the microphone in step 602. In some embodiments, in response to receiving signals corresponding to the detected sound, the 3-D spherical signal representation is determined using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system.
In some embodiments, the digital audio signal (e.g., a spherical signal representation) is in an Ambisonics or spherical harmonics format. The Ambisonics format advantageously allow the spherical signal representation to be efficiently edited for headpose compensation (e.g., an orientation associated of the Ambisonics representation may be easily translated to compensate for movement during sound detection).
In some embodiments, the method 600 includes detecting a microphone movement (step 606). In some embodiments, the method 600 includes concurrently with detecting the sound (e.g., from step 602), detecting, via a sensor of a wearable head device, a microphone movement with respect to the environment.
In some embodiments, a movement (e.g., changing headpose) of a recording device (e.g., MR system 112, wearable head device 200A, wearable head device 200B, handheld controller 300, wearable system 501A, wearable system 501B) is determined during sound detection (e.g., from step 602). For example, the movement is determined by a sensor (e.g., IMU (e.g., IMU 509), camera (e.g., cameras 228A, 228B; camera 544), a second microphone, gyroscope, LiDAR sensor, or other suitable sensor) of the device and/or by using AR/MR/XR localization techniques, such as simultaneous localization and mapping (SLAM) and/or visual inertial odometry (VIO). Determined movement may be, for example, three degree-of-freedom (3DOF) movement or six degree-of-freedom (6DOF) movement.
In some embodiments, the method 600 includes adjusting the digital audio signal (step 608). In some embodiments, the adjusting comprises adjusting the position (e.g., a location, an orientation) of the sphere based on based on the detected microphone movement (e.g., magnitude, direction). For example, after the 3-D spherical signal representation is derived (e.g., from step 604), a user's headpose is compensated with the adjustment. In some embodiments, a function for headpose compensation is derived based on the detected movement. For example, the function can represent a translation and/or rotation that corresponds to an opposite of the detected movement. As an example, at a time of sound detection, a headpose rotation of 2 degrees about a Z-axis is determined (e.g., by a method described herein). To compensate for this movement, the function for headpose compensation includes a −2 degrees translation about the Z-axis to counteract the effects of the movement on a sound recording at this time of sound detection. In some embodiments, the function for headpose compensation is determined by applying an inverse transformation on a representation of the detected movement during sound detection.
In some embodiments, the movement is represented by a matrix or vectors in space, which may be used to determine an amount of compensation needed to generate a fixed-orientation recording. For example, the function can include vectors (as a function of sound detection time) in an opposite direction of the movement vectors to represent a translation for counteracting the effects of movement on a recording during sound detection.
In some embodiments, the method 600 includes generating a fixed-orientation recording. The fixed-orientation recording may be an adjusted digital audio signal (e.g., a compensated digital audio signal configured to be presented to a listener). For example, based on headpose compensation (e.g., from step 608), a fixed-orientation recording is generated. In some embodiments, the fixed-orientation recording is unaffected by a user's head orientation and/or movement during recording (e.g., from step 602). In some embodiments, the fixed-orientation recording includes location and/or position information of the recording device in the AR/MR/XR environment, and the location and/or position information indicate a location and orientation of the recorded sound content in the AR/MR/XR environment.
In some embodiments, the digital audio signal (e.g., a spherical signal representation) is in an Ambisonics format, and the Ambisonics format advantageously allows a system to efficiently update coordinates of the spherical signal representation for headpose compensation (e.g., an orientation associated of the Ambisonics representation may be easily translated to compensate for movement during sound detection). After the movement of the recording device is determined (e.g., using a method described herein), a function for headpose compensation is derived, as described herein. Based on the derived function, Ambisonics signal representation may be updated to compensate for the device movement to generate a fixed-orientation recording (e.g., an adjusted digital audio signal).
As an example, at a time of sound detection, a headpose rotation of 2 degrees about a Z-axis is determined (e.g., by a method described herein). To compensate for this movement, the function for headpose compensation includes a −2 degrees translation about the Z-axis to counteract the effects of the movement on a sound recording at this time of sound detection. The function is applied to the Ambisonics spherical signal representation at a corresponding time (e.g., the time of this movement during sound capture) to translate the signal representation by −2 degree translation about the Z-axis, and a fixed-orientation recording of this time is generated. After the application of the function to the spherical signal representation, a fixed-orientation recording unaffected by a user's head orientation and/or movement during recording is generated (e.g., the effects of the 2-degree movement during sound detection are unnoticed by a user listening to the fixed-orientation recording).
In some instances, a user of the device including the microphone may not be stationary, such that the sound does not appear to be recorded at a fixed location and position. For example, the user wears a wearable head device including the microphone, and the user's head is not stationary (e.g., the user's headpose or head orientation changes over time) due to intentional and/or unintentional head movements. By compensating for the headpose and generating a fixed-orientation recording, as described herein, a recording corresponding to the detected sound may be compensated for these movements, as if the sound was detected by a stationary microphone.
In some embodiments, the method 600 advantageously enables producing of a recording of a 3-D audio scene surrounding a user (e.g., of a wearable head device), and the recording is unaffected by the user's head orientation. A recording unaffected by the user's head orientation allows a more accurate audio reproduction of an AR/MR/XR environment, as described in more detail herein.
In some embodiments, computation, determination, calculation, or derivation steps of method 650 are performed using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system and/or using a server (e.g., in the cloud).
In some embodiments, the method 650 includes receiving a digital audio signal (step 652). In some embodiments, the method 650 includes receiving, at a wearable head device, a digital audio signal. The digital audio signal is associated with a sphere having a position (e.g., a location, an orientation) in the environment (e.g., an AR, MR, or XR environment). For example, a fixed-orientation recording (e.g., an adjusted digital audio signal) is retrieved by an AR/MR/XR device (e.g., MR system 112, wearable head device 200A, wearable head device 200B, handheld controller 300, wearable system 501A, wearable system 501B). In some embodiments, the recording includes sounds from a sound field or a 3-D audio scene of an AR/MR/XR environment of the wearable head device or an AR/MR/XR system detected and processed using the methods described herein. In some embodiment, the recording is a fixed-orientation recording (as described herein). The fixed-orientation recording can be presented to a listener as if the sound of the recording was detected by a stationary microphone. In some embodiments, the fixed-orientation recording includes location and/or position information of the recording device in the AR/MR/XR environment, and the location and/or position information indicate a location and orientation of the recorded sound content in the AR/MR/XR environment.
In some embodiments, the recording includes sounds from a sound field or a 3-D audio scene of an AR/MR/XR environment (e.g., audio of AR/MR/XR content). In some embodiments, the recording includes sounds from a fixed sound source of an AR/MR/XR environment (e.g., from a fixed object of the AR/MR/XR environment).
In some embodiments, the recording includes a spherical signal representation (e.g., in Ambisonics format). In some embodiments, the recording is converted into a spherical signal representation (e.g., in Ambisonics format). The spherical signal representation may be advantageously updated to compensate for a user's headpose during audio playback of the recording.
In some embodiments, the method 650 includes detecting a device movement (step 654). In some embodiments, the method 650 includes detecting, via a sensor of the wearable head device, a device movement with respect to the environment. For example, in some embodiments, a movement (e.g., changing headpose) of a recording device (e.g., MR system 112, wearable head device 200A, wearable head device 200B, handheld controller 300, wearable system 501A, wearable system 501B) is determined while the user is listening to the audio. For example, the movement is determined by a sensor (e.g., IMU (e.g., IMU 509), camera (e.g., cameras 228A, 228B; camera 544), a second microphone, gyroscope, LiDAR sensor, or other suitable sensor) of the device and/or by using AR/MR/XR localization techniques, such as simultaneous localization and mapping (SLAM) and/or visual inertial odometry (VIO). Determined movement may be, for example, three degree-of-freedom (3DOF) movement or six degree-of-freedom (6DOF) movement.
In some embodiments, the method 650 includes adjusting the digital audio signal (step 656). In some embodiments, the adjusting comprises adjusting the position of the sphere based on based on the detected device movement (e.g., magnitude, direction).
In some embodiments, a function for headpose compensation is derived based on the detected movement. For example, the function can represent a translation and/or rotation that corresponds to an opposite of the detected movement. As an example, at a time of sound detection, a headpose rotation of 2 degrees about a Z-axis is determined (e.g., by a method described herein). To compensate for this movement, the function for headpose compensation includes a −2 degrees translation about the Z-axis to counteract the effects of the movement on a sound recording at this time of sound detection. In some embodiments, the function for headpose compensation is determined by applying an inverse transformation on a representation of the detected movement during sound detection.
In some embodiments, the movement is represented by a matrix or vectors in space, which may be used to determine an amount of compensation needed to generate a fixed-orientation recording. For example, the function can include vectors (as a function of sound detection time) in an opposite direction of the movement vectors to represent a translation for counteracting the effects of movement on a recording during sound detection.
In some embodiments, the function for headpose compensation is applied to the recording or a spherical signal representation of the recording (e.g., a digital audio signal) to compensate for the headpose. In some embodiments, the spherical signal representation is in an Ambisonics format, and the Ambisonics format advantageously allows a system to efficiently update coordinates of the spherical signal representation for headpose compensation (e.g., an orientation associated of the Ambisonics representation may be easily translated to compensate for movement during playback). After the movement of the playback device is determined (e.g., using a method described herein), a function for headpose compensation is derived, as described herein. Based on the derived function, Ambisonics signal representation may be updated to compensate for the device movement.
As an example, during playback, a headpose rotation of 2 degrees about a Z-axis is determined (e.g., by a method described herein). To compensate for this movement, the function for headpose compensation includes a −2 degrees translation about the Z-axis to counteract the effects of the movement at this time of playback. The function is applied to the Ambisonics spherical signal representation at a corresponding time (e.g., the time of this movement during playback) to translate the signal representation by −2 degree translation about the Z-axis. After the application of the function to the spherical signal representation, a second spherical signal representation may be generated (e.g., the effects of the 2-degree movement during playback do not affect a fixed sound source location).
In some embodiments, the method 650 includes presenting the adjusted digital audio signal (step 658). In some embodiments, the method 650 includes presenting the adjusted digital audio signal to a user of the wearable head device via one or more speakers of the wearable head device. For example, after a user's headpose is compensated (e.g., using step 654), the compensated spherical signal representation converts to a binaural signal (e.g., an adjusted digital audio signal). In some embodiments, the binaural signal corresponds to an audio output to the user, and the audio output compensates for the user's movements, using the methods described herein. It is understood that binaural signal is merely an example of this conversion. In some embodiments, more generally, the compensated spherical signal representation converts into an audio signal that corresponds to an audio output being outputted by one or more speakers. In some embodiment, the conversion is performed by a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system.
The wearable head device or AR/MR/XR system may play the audio output corresponding to the converted binaural signal or audio signal (e.g., an adjusted digital audio signal). In some embodiments, the audio is compensated for movement of the device. That is, the audio playback would appear to originate from a fixed sound source of an AR/MR/XR environment. For example, a user in an AR/MR/XR environment rotates his or her head to the right away from a fixed sound source (e.g., a virtual speaker). After the head rotation, the user's left ear is closer to the fixed sound source. After performing the disclosed compensation, the audio from the fixed sound source to the user's left ear would be louder.
In some embodiments, the method 650 advantageously allows a 3-D sound field representation to be rotated based on a listener's head movement at playback time, before being decoded into a binaural representation for playback. The audio playback would appear to originate from a fixed sound source of an AR/MR/XR environment, providing the user a more realistic AR/MR/XR experience (e.g., a fixed AR/MR/XR object would appear fixed aurally while a user moves relative to the corresponding fixed object (e.g., changes headpose)).
In some embodiments, the method 600 may be performed using more than one device or system. That is, more than one device or system may capture a sound field or audio scene, and the effects of movements of the devices or systems on the sound field or audio scene captures may be compensated.
In some embodiments, computation, determination, calculation, or derivation steps of method 700 are performed using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system and/or using a server (e.g., in the cloud).
In some embodiments, the method 700 includes detecting a first sound (step 702A). For example, a sound is detected by a microphone (e.g., microphone 250; microphones 250A, 250B, 250C, and 250D; microphone of handheld controller 300; microphone array 507) of a first wearable head device or a first AR/MR/XR system. In some embodiments, the sound includes a sound from a sound field or a 3-D audio scene of an AR/MR/XR environment of the first wearable head device or the first AR/MR/XR system.
In some embodiments, the method 700 includes determining a first digital audio signal based on the first detected sound (step 704A). In some embodiments, the first digital audio signal is associated with a first sphere having a first position (e.g., a location, an orientation) in an environment (e.g., an AR, MR, or XR environment).
For example, a first spherical signal representation of the first detected sound is derived. In some embodiments, the spherical signal representation represents a sound field with respect to a point in space (e.g., the sound field at a location of the first recording device). For example, a 3-D spherical signal representation is derived based on the sound detected by the microphone in step 702A. In some embodiments, in response to receiving signals corresponding to the detected sound, the 3-D spherical signal representation is determined using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a first wearable head device or a first AR/MR/XR system. In some embodiments, the spherical signal representation is in an Ambisonics or spherical harmonics format.
In some embodiments, the method 700 includes detecting a first microphone movement (step 706A). In some embodiments, the method 700 includes concurrently with detecting the first sound, detecting, via a sensor of the first wearable head device, a first microphone movement with respect to the environment. In some embodiments, the movement (e.g., changing headpose) of the first recording device (e.g., MR system 112, wearable head device 200A, wearable head device 200B, handheld controller 300, wearable system 501A, wearable system 501B) is determined during sound detection (e.g., from step 702A). For example, the movement is determined by a sensor (e.g., IMU (e.g., IMU 509), camera (e.g., cameras 228A, 228B; camera 544), a second microphone, gyroscope, LiDAR sensor, or other suitable sensor) of the first device and/or by using AR/MR/XR localization techniques, such as simultaneous localization and mapping (SLAM) and/or visual inertial odometry (VIO). Determined movement may be, for example, three degree-of-freedom (3DOF) movement or six degree-of-freedom (6DOF) movement.
In some embodiments, the method 700 includes adjusting the first digital audio signal (step 708A). In some embodiments, the adjusting comprises adjusting the first position (e.g., a location, an orientation) of the first sphere based on based on the detected first microphone movement (e.g., magnitude, direction). For example, after the first 3-D spherical signal representation is derived (e.g., from step 704A), a first user's headpose is compensated with the adjustment. In some embodiments, a first function for first headpose compensation is derived based on the detected movement. For example, the first function can represent a translation and/or rotation that corresponds to an opposite of the detected movement. As an example, at a time of sound detection, a first headpose rotation of 2 degrees about a Z-axis is determined (e.g., by a method described herein). To compensate for this movement, the first function for first headpose compensation includes a −2 degrees translation about the Z-axis to counteract the effects of the movement on a sound recording at this time of sound detection. In some embodiments, the first function for first headpose compensation is determined by applying an inverse transformation on a representation of the detected movement during sound detection.
In some embodiments, the movement is represented by a matrix or vectors in space, which may be used to determine an amount of compensation needed to generate a fixed-orientation recording. For example, the first function can include vectors (as a function of sound detection time) in an opposite direction of the movement vectors to represent a translation for counteracting the effects of movement on a first recording during sound detection.
In some embodiments, the method 700 includes generating a first fixed-orientation recording. The first fixed-orientation recording may be an adjusted first digital audio signal (e.g., a compensated digital audio signal configured to be presented to a listener). For example, based on first headpose compensation (e.g., from step 708A), a first fixed-orientation recording is generated. In some embodiments, the first fixed-orientation recording is unaffected by a first user's head orientation and/or movement during recording (e.g., from step 702A). In some embodiments, the first fixed-orientation recording includes location and/or position information of the first recording device in the AR/MR/XR environment, and the location and/or position information indicate a location and orientation of the first recorded sound content in the AR/MR/XR environment.
In some embodiments, the first digital audio signal (e.g., a spherical signal representation) is in an Ambisonics format. After the movement of the first recording device is determined (e.g., using a method described herein), a first function for headpose compensation is derived, as described herein. Based on the derived first function, the Ambisonics signal representation may be updated to compensate for the first device movement to generate a first fixed-orientation recording.
As an example, at a time of sound detection, a first headpose rotation of 2 degrees about a Z-axis is determined (e.g., by a method described herein). To compensate for this movement, the first function for first headpose compensation includes a −2 degrees translation about the Z-axis to counteract the effects of the movement on a sound recording at this time of sound detection. The first function is applied to the Ambisonics spherical signal representation at a corresponding time (e.g., the time of this movement during sound capture) to translate the signal representation by −2 degree translation about the Z-axis, and a first fixed-orientation recording of this time is generated. After the application of the first function to the first spherical signal representation, a first fixed-orientation recording unaffected by a first user's head orientation and/or movement during recording is generated (e.g., the effects of the 2-degree movement during sound detection are unnoticed by a user listening to the fixed-orientation recording).
In some instances, a first user of the first device including the microphone may not be stationary, such that the first sound does not appear to be recorded at a first fixed location and position. For example, the first user wears a first wearable head device including the microphone, and the first user's head is not stationary (e.g., the user's headpose or head orientation changes over time) due to intentional and/or unintentional head movements. By compensating for the first headpose and generating a first fixed-orientation recording, as described herein, a first recording corresponding to the detected sound may be compensated for these movements, as if the sound was detected by a stationary microphone.
In some embodiments, the method 700 includes detecting a second sound (step 702B). For example, sounds are detected by a microphone (e.g., microphone 250; microphones 250A, 250B, 250C, and 250D; microphone of handheld controller 300; microphone array 507) of a second wearable head device or a second AR/MR/XR system. In some embodiments, the sound includes a sound from a sound field or a 3-D audio scene of an AR/MR/XR environment of the second wearable head device or the second AR/MR/XR system. In some embodiments, the AR/MR/XR environment of the second device or system is the same environment as the first device or system, as described with respect to steps 702A-708A.
In some embodiments, the method 700 includes determining a second digital audio signal based on the second detected sound (step 704B). In some embodiments, the second digital audio signal is associated with a second sphere having a second position (e.g., a location, an orientation) in an environment (e.g., an AR, MR, or XR environment). For example, the second spherical signal representation corresponding to the second sounds is derived similarly to a first spherical signal representation, as described with respect to step 704A. For the sake of brevity, this is not described here.
In some embodiments, the method 700 includes detecting a second microphone movement (step 706B). For example, the second microphone movement is detected similarly to detection of a first microphone movement, as described with respect to step 706A. For the sake of brevity, this is not described here.
In some embodiments, the method 700 includes adjusting the second digital audio signal (step 708B). For example, the second headpose is compensated (e.g., using a second function for the second headpose) similarly to compensation of a first headpose, as described with respect to step 708A. For the sake of brevity, this is not described here.
In some embodiments, the method 700 includes generating a second fixed-orientation recording. For example, the second fixed-orientation recording is generated (e.g., by applying the second function to the second spherical signal representation) similarly to generation of a first fixed-orientation recording, as described with respect to step 708A. For the sake of brevity, this is not described here.
After the application of the second function to the second spherical signal representation, a second fixed-orientation recording unaffected by a second user's head orientation and/or movement during recording is generated (e.g., effects of movement during sound detection are unnoticed by a user listening to the second fixed-orientation recording).
In some instances, a second user of the second device including the microphone may not be stationary, such that the second sound does not appear to be recorded at a second fixed location and position. For example, the second user wears a second wearable head device including the microphone, and the second user's head is not stationary (e.g., the user's headpose or head orientation changes over time) due to intentional and/or unintentional head movements. By compensating for the second headpose and generating a second fixed-orientation recording, as described herein, a second recording corresponding to the detected sound may be compensated for these movements, as if the sound was detected by a stationary microphone.
In some embodiments, steps 702A-708A are performed at the same time as steps 702B-708B (e.g., the first device or system and the second device or system are recording a sound field or a 3-D audio scene at a same time). For example, a first user of a first device or system and a second user of a second device or system are recording a sound field or a 3-D audio scene together in the AR/MR/XR environment at a same time. In some embodiments, steps 702A-708A are performed at a different time than steps 702B-708B (e.g., the first device or system and the second device or system are recording a sound field or a 3-D audio scene at a different times). For example, a first user of a first device or system and a second user of a second device or system are recording a sound field or a 3-D audio scene in the AR/MR/XR environment at different times.
In some embodiments, the method 700 includes combining the adjusted digital audio signal and the second adjusted digital audio signal (step 710). For example, the first fixed-orientation recording and the second fixed-orientation recording are combined. The combined first adjusted digital audio signal and second adjusted digital audio signal may be presented to a listener (e.g., in response to a playback request). In some embodiments, the combined fixed-orientation recording includes location and/or position information of the first and second recording devices in the AR/MR/XR environment, and the location and/or position information indicate respective locations and orientations of the first and second recorded sound contents in the AR/MR/XR environment.
In some embodiments, the recordings are combined at a server (e.g., in the cloud) that communicates with the first device or system and the second device or system (e.g., the devices or systems send the respective sound objects to the server for further processing and storage). In some embodiments, the recordings are combined at a master device (e.g., a first or second wearable head device or AR/MR/XR system).
In some embodiments, combining the first and second fixed-orientation recordings produces a combined fixed-orientation recording corresponding to a combined sound field or 3-D audio scene of an environment (e.g., a larger AR/MR/XR environment that requires more than one device for sound detection; the first and second fixed-orientation recordings comprise sounds from different parts of a AR/MR/XR environment) of the first and second recording devices or systems. In some embodiments, the first fixed-orientation recording is an earlier recording of a AR/MR/XR environment, and the second fixed-orientation recording is a later recording of the AR/MR/XR environment. Combining the first and second fixed-orientation recordings allow a sound field or a 3-D audio scene of the AR/MR/XR environment to be updated with new fixed-orientation recordings while achieving the advantages described herein.
In some embodiments, the method 700 advantageously enables producing of a recording of a 3-D audio scene surrounding more than one user (e.g., of more than one wearable head device), and the combined recording is unaffected by the users' head orientations. A recording unaffected by the users' head orientations allows a more accurate audio reproduction of an AR/MR/XR environment, as described in more detail herein.
In some embodiments, using detected data from multiple devices, as described with respect to method 700, may improve location estimation. For instance, correlating data from multiple devices could help provide distance information that may be harder to estimate from a single device audio capture.
Although method 700 is described as comprising movement or headpose compensation for two recordings and combining the two compensated recordings, it is understood that the method 700 may also comprise movement or headpose compensation for one recording and combining a compensated recording and a non-compensated recording. For example, method 700 may be performed to combine a compensated recording and a recording from a fixed recording device (e.g., detecting a recording that does not require compensation).
In some embodiments, computation, determination, calculation, or derivation steps of method 750 are performed using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system and/or using a server (e.g., in the cloud).
In some embodiments, the method 750 includes combining a first digital audio signal and a second digital audio signal (step 752). For example, a first fixed-orientation recording and a second fixed-orientation recording are combined. In some embodiments, the recordings are combined at a server (e.g., in the cloud) that communicates with the first device or system and the second device or system, and the combined fixed-orientation recording is sent to a playback device (e.g., MR system, wearable head device 200A, wearable head device 200B, handheld controller 300, wearable system 501A, wearable system 501B). In some embodiments, a first digital audio signal and a second digital audio signal are not fixed-orientation recordings.
In some embodiments, the recordings are combined by the playback device. For example, the first and second fixed-orientation recordings are stored at the playback device, and the playback device combines the two fixed-orientation recordings. As another example, at least one of the first and second fixed-orientation recording is received by the playback device (e.g., sent by a second device or system, sent by a server), and the first and second fixed-orientation recordings are combined by the playback device after the playback device stores the fixed-orientation recordings.
In some embodiments, the first fixed-orientation recording and second fixed-orientation recording are combined prior to a playback request. For example, the fixed-orientation recordings are combined at step 710 of method 700 prior to the playback request, and in response to a playback request, the playback device receives the combined fixed-orientation recording. For the sake of brevity, similar examples and advantages between steps 710 and 752 are not described here.
In some embodiments, the method 750 includes downmixing the combined second and third digital audio signal (step 754). For example, the combined fixed-orientation recording is downmixed. For example, the combined fixed-orientation recording from step 752 is downmixed into an audio stream suitable for playback at the playback device (e.g., downmixing the combined fixed-orientation recording into an audio stream comprising a suitable number of corresponding channels (e.g., 2, 5.1, 7.1) for playback at the playback device).
In some embodiments, downmixing the combined fixed-orientation recording comprises applying a respective gain to each of the fixed orientation recording. In some embodiments, downmixing the combined fixed-orientation recording comprises reducing an Ambisonic order corresponding of a respective fixed-orientation recording based on distance of a listener from a recording location of the fixed-orientation recording.
In some embodiments, the method 750 includes receiving a digital audio signal (step 756). In some embodiments, the method 750 includes receiving, at a wearable head device, a digital audio signal. The digital audio signal is associated with a sphere having a position (e.g., a location, an orientation) in the environment. For example, a fixed-orientation recording (e.g., a combined digital audio signal from step 710 or 752, a combined and downmixed combined digital audio signal from step 754) is retrieved by an AR/MR/XR device (e.g., MR system 112, wearable head device 200A, wearable head device 200B, handheld controller 300, wearable system 501A, wearable system 501B). In some embodiments, the recording includes sounds from a sound field or a 3-D audio scene of an AR/MR/XR environment of the wearable head device or an AR/MR/XR system captured and processed by more than one device using the methods described herein. In some embodiment, the recording is a combined fixed-orientation recording (as described herein). The combined fixed-orientation recording is presented to a listener as if the sound of the recording was detected by stationary microphones. In some embodiments, the combined fixed-orientation recording includes location and/or position information of the recording devices (e.g., the first and second recording devices, as described with respect to method 700) in the AR/MR/XR environment, and the location and/or position information indicate respective locations and orientations of the combined recorded sound content in the AR/MR/XR environment. In some embodiments, the retrieved digital audio signal is not a fixed-orientation recording.
In some embodiments, the recording includes combined sounds from a sound field or a 3-D audio scene of an AR/MR/XR environment (e.g., audio of AR/MR/XR content). In some embodiments, the recording includes combined sounds from fixed sound sources of an AR/MR/XR environment (e.g., from a fixed object of the AR/MR/XR environment).
In some embodiments, the recording includes a spherical signal representation (e.g., in Ambisonics format). In some embodiments, the recording is converted into a spherical signal representation (e.g., in Ambisonics format). The spherical signal representation may be advantageously updated to compensate for a user's headpose during audio playback of the recording.
In some embodiments, the method 750 includes detecting a device movement (step 758). For example, in some embodiments, movement of the device is detected, as described with respect to step 654. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 750 includes adjusting the digital audio signal (step 760). For example, in some embodiments, effects of headpose (e.g., of a playback device) are compensated, as described with respect to step 656. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 750 includes presenting the adjusted digital audio signal (step 762). For example, in some embodiments, the adjusted digital audio signal (e.g., compensating for a movement of a playback device) is presented, as described with respect to step 658. For the sake of brevity, some examples and advantages are not described here.
The wearable head device or AR/MR/XR system may play the audio output corresponding to the converted binaural signal or audio signal (e.g., corresponding to a combined recording, an adjusted digital audio signal from step 760), as described with respect to step 658. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 750 advantageously allows a combined 3-D sound field representation (e.g., a 3-D sound field captured by more than one recording device) to be rotated based on a listener's head movement at playback time, before being decoded into a binaural representation for playback. The audio playback would appear to originate from fixed sound sources of an AR/MR/XR environment, providing the user a more realistic AR/MR/XR experience (e.g., fixed AR/MR/XR objects would appear fixed aurally while a user moves relative to the corresponding fixed objects (e.g., changes headpose)).
In some embodiments, when capturing a sound field or a 3-D audio scene, it may be advantageous to separate sound objects and a residual (e.g., portions of the sound field or 3-D audio scene that do not include a sound object) in the sound field or the 3-D audio scene. For example, the sound field or the 3-D audio scene may be a part of AR/MR/XR content that supports six degrees of freedom for a user accessing the AR/MR/XR content. An entire sound field or 3-D audio scene that supports six degrees of freedom may result in very large and/or complex files, which would require more computing resources to access. Therefore, it may be advantageous to extract the sound objects (e.g., sounds associated with objects of interest in an AR/MR/XR environment, dominant sounds in an AR/MR/XR environment) from the sound field or 3-D audio scene and render the sound objects with six-degrees-of-freedom support. The remaining portions (e.g., portions that do not include a sound object such as background noise and sounds) of the sound field or 3-D audio scene may be separated as a residual, and the residual may be rendered with three-degrees-of-freedom support. The sound objects (supporting six degrees of freedom) and the residual (supporting three degrees of freedom) may be combined to generate a less complex (e.g., smaller file sizes) and more efficient sound field or audio scene.
In some embodiments, computation, determination, calculation, or derivation steps of method 800 are performed using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system and/or using a server (e.g., in the cloud).
In some embodiments, the method 800 includes detecting a sound (step 802). For example, a sound is detected, as described with respect to step 602, 702A, or 702B. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 800 includes determining a digital audio signal based on the detected sound (step 804). In some embodiments, the digital audio signal is associated with a sphere having a position (e.g., a location, an orientation) in an environment (e.g., an AR, MR, or XR environment). For example, a spherical signal representation is derived, as described with respect to step 604, 704A, or 704B. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 800 includes detecting a microphone movement (step 806). For example, movement of a microphone is detected, as described with respect to step 606, 706A, or 706B. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 800 includes adjusting the digital audio signal (step 808). For example, effects of headpose are compensated, as described with respect to step 608, 708A, or 708B. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 800 includes generating a fixed-orientation recording. For example, a fixed-orientation recording is generated, as described with respect to step 608, 708A, or 708B. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 800 includes extracting a sound object (step 810). For example, the sound object correspond to a sound associated with an object of interest in an AR/MR/XR environment or to dominant sounds in an AR/MR/XR environment. In some embodiments, a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system determines the sound objects in a sound field or an audio scene and extracts the sound objects from the sound field or audio scene. In some embodiments, the extracted sound object comprises audio (e.g., audio signals associated with the sound) and location and position information (e.g., coordinates and orientation of a sound source associated with the sound object in the AR/MR/XR environment).
In some embodiments, a sound object comprises a portion of a detected sounds, and the portion meets a sound object criterion. For example, a sound object determined based on an activity of a sound. In some embodiments, the device or the system determines objects having a sound activity (e.g., frequency change, displacement in the environment, amplitude change) above a threshold sound activity (e.g., above a threshold frequency change, above a threshold displacement in the environment, above a threshold amplitude change). For example, the environment is a virtual concert, and the sound field includes sounds of an electric guitar and noises of the virtual spectators. The device or system may determine that the sounds of an electric guitar, in accordance with a determination that the sounds of the electric guitar have a sound activity above a threshold sound activity (e.g., a speedy musical passage is being played on the electric guitar), are sound objects are extracted accordingly, and the noise of the virtual spectators is part of a residual (as described in more detail herein).
In some embodiments, the sound objects are determined by information of the AR/MR/XR environment (e.g., the information of the AR/MR/XR environment defines the objects of interest or dominant sounds and their corresponding sounds). In some embodiments, the sound objects are user-defined (e.g., while recording a sound field or an audio scene, the user defines the objects of interest or dominant sounds in the environment and their corresponding sounds).
In some embodiments, sounds of a virtual object may be sound objects at a first time and a residue at a second time. For example, at the first time, the device or system determines that a sound of the virtual object is a sound object (e.g., above a threshold sound activity) and extracts the sound object. However, at the second time, the device or system determines that a sound of the virtual object is not a sound object (e.g., below a threshold sound activity) and does not extract the sound object (e.g., the sound of the virtual object is part of the residual at the second time).
In some embodiments, the method 800 includes combining the sound object and a residual (step 812). For example, the wearable head device or AR/MR/XR system combines the extracted sound objects (e.g., from step 810) and the residual (e.g., portions of the sound field or audio scene not extracted as sound objects). In some embodiments, the combined sound objects and residual is a less complex and more efficient sound field or audio scene, compared to a sound field or audio scene without sound object extraction. In some embodiments, the residual is stored with lower spatial resolution (e.g., in a 1st order Ambisonics file). In some embodiments, a sound object is stored with higher spatial resolution (e.g., because the sound object comprises a sound of an object of interest or a dominant sound in an AR/MR/XR environment).
In some examples, the sound field or the 3-D audio scene may be a part of AR/MR/XR content that supports six degrees of freedom for a user accessing the AR/MR/XR content. In some embodiments, the sound objects (e.g., sounds associated with objects of interest in an AR/MR/XR environment, dominant sounds in an AR/MR/XR environment) from the sound field or 3-D audio scene are rendered (e.g., by a processor of a wearable head device or an AR/MR/XR system) with six-degrees-of-freedom support. The remaining portions (e.g., portions that do not include a sound object such as background noise and sounds) of the sound field or 3-D audio scene may be separated as a residual, and the residual may be rendered with three-degrees-of-freedom support. The sound objects (supporting six degrees of freedom) and the residual (supporting three degrees of freedom) may be combined to generate a less complex (e.g., smaller file sizes) and more efficient sound field or audio scene.
In some embodiments, the method 800 advantageously generates a less complex (e.g., smaller file sizes) sound field or audio scene. By extracting sound objects and rendering them at higher spatial resolution, while rendering the residual at a lower spatial resolution, the generated sound field or audio scene is more efficient (e.g., smaller file sizes, less required computational resources) than an entire sound field or audio scene that supports six degrees of freedom. Furthermore, while being more efficient, the generated sound field or audio scene does not compromise a user's AR/MR/XR experience by maintaining the more important qualities of a six-degrees-of-freedom sound field or audio scene while minimizing resources on portions that do not require more degrees of freedom.
In some embodiments, computation, determination, calculation, or derivation steps of method 850 are performed using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system and/or using a server (e.g., in the cloud).
In some embodiments, the method 850 includes combining a sound object and a residual (step 852). For example, a sound object and a residual are combined, as described with respect step 812. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the sound object and the residual are combined prior to a playback request. For example, the sound object and the residual are combined at step 812 while method 800 is performed, prior to the playback request, and in response to a playback request, the playback device receives the combined sound objects and the residual.
In some embodiments, the method 850 includes detecting a device movement (step 854). For example, in some embodiments, movement of a device is detected, as described with respect to step 654 or step 758. For the sake of brevity, some examples and advantages are not described here.
In some embodiments, the method 850 includes adjusting the sound object (step 856). In some embodiments, the sound object is associated with a first sphere having a first position in the environment. For example, in some embodiments, effects of headpose are compensated for a sound object, as described with respect to step 656 or step 760. For the sake of brevity, some examples and advantages are not described here.
As an example, a sound object supports six degrees of freedom. Due to the sound object's high spatial resolution, effects of headpose along these six degrees of freedom may be advantageously compensated. For instance, headpose movement along any of the six degrees of freedom may be compensated, such that the sound object appears to originate from a fixed sound source in an AR/MR/XR environment, even if a headpose moves along any of the six degrees of freedom.
In some embodiments, the method 850 includes converting a sound object to first binaural signal. For example, the playback device (e.g., a wearable head device, an AR/MR/XR system) converts a sound object to a binaural signal. In some embodiments, all of the sound objects (e.g., extracted as described herein) are converted into respective binaural signals. In some embodiments, each sound object is converted one at a time. In some embodiments, more than one sound object are converted at a same time.
In some embodiments, the method 850 includes adjusting the residual (step 858). In some embodiments, the residual is associated with a second sphere having a second position in the environment. For example, in some embodiments, effects of headpose are compensated for a residual, as described with respect to step 654 or step 758. For the sake of brevity, some examples and advantages are not described here. In some embodiments, the residual is stored with lower spatial resolution (e.g., in a 1st order Ambisonics file).
In some embodiments, the method 850 includes converting a residual to second binaural signal. For example, the playback device (e.g., a wearable head device, an AR/MR/XR system) converts a residual (as described herein) to a binaural signal.
In some embodiments, the steps 856 and 858 are performed in parallel (e.g., the sound objects and the residual are converted at a same time). In some embodiments, the steps 856 and 858 are performed sequentially (e.g., the sound objects are converted first, then the residual; the residual is converted first, then the sound objects).
In some embodiments, the method 850 includes mixing the adjusted sound object and the adjusted residual (step 860). For example, the first (e.g., the adjusted sound object) and second binaural signals (e.g., the adjusted residual) are mixed. For example, after the sound objects and the residual are converted into respective binaural signals, the playback device (e.g., a wearable head device, an AR/MR/XR system) mixes the binaural signals into an audio stream for presentation to a listener of the device. In some embodiments, the audio stream comprises sounds in an AR/MR/XR environment of the playback device.
In some embodiments, the method 850 includes presenting the mixed adjusted sound object and residual (step 864). In some embodiments, the method 850 includes presenting the mixed adjusted sound object and residual to a user of the wearable head device via one or more speakers of the wearable head device. For example, the audio stream mixed from the first and second binaural signals is played by a playback device (e.g., a wearable head device, an AR/MR/XR system). In some embodiments, the audio stream comprises sounds in an AR/MR/XR environment of the playback device. For the sake of brevity, some examples and advantages of presenting an adjusted digital audio signal are not described here.
In some embodiments, due to the extraction of sound objects, the audio stream is less complex (e.g., smaller file sizes), compared to an audio stream that does not have corresponding extracted sound objects and a residual. By extracting sound objects and rendering them at higher spatial resolution, while rendering the residual at a lower spatial resolution, the audio stream is more efficient (e.g., smaller file sizes, less required computational resources) than sound field or audio scene including portions that support unnecessary degrees of freedom. Furthermore, while being more efficient, the audio stream does not compromise a user's AR/MR/XR experience by maintaining the more important qualities of a six-degrees-of-freedom sound field or audio scene while minimizing resources on portions that do not require more degrees of freedom.
In some embodiments, the method 800 may be performed using more than one device or system. That is, more than one device or system may capture a sound field or audio scene, and sound objects and residuals may be extracted from the sound fields or audio scenes detected by more than one device or system.
In some embodiments, computation, determination, calculation, or derivation steps of method 900 are performed using a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a wearable head device or an AR/MR/XR system and/or using a server (e.g., in the cloud).
In some embodiments, the method 900 includes detecting a first sound (step 902A). For example, a sound is detected by a microphone (e.g., microphone 250; microphones 250A, 250B, 250C, and 250D; microphone of handheld controller 300; microphone array 507) of a first wearable head device or a first AR/MR/XR system. In some embodiments, the sound includes a sound from a sound field or a 3-D audio scene of an AR/MR/XR environment of the first wearable head device or the first AR/MR/XR system
In some embodiments, the method 900 includes determining a first digital audio signal based on the first detected sound (step 904A). In some embodiments, the first digital audio signal is associated with a first sphere having a first position (e.g., a location, an orientation) in an environment (e.g., an AR, MR, or XR environment). For example, the first spherical signal corresponding to the first sounds is derived similarly to a first spherical signal representation, as described with respect to step 704A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes detecting a first microphone movement (step 906A). For example, the first microphone movement is detected similarly to detection of a first microphone movement, as described with respect to step 706A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes adjusting the first digital audio signal (step 908A). For example, the first headpose is compensated (e.g., using a first function for the first headpose) similarly to compensation of a first headpose, as described with respect to step 708A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes generating a first fixed-orientation recording. For example, the first fixed-orientation recording is generated (e.g., by applying the first function to the first spherical signal representation) similarly to generation of a first fixed-orientation recording, as described with respect to step 708A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes extracting a first sound object (step 910A). For example, the first sound object correspond to a sound associated with an object of interest or dominant sounds in an AR/MR/XR environment detected by the first recording device. In some embodiments, a processor (e.g., processor of MR system 112, processor of wearable head device 200A, processor of wearable head device 200B, processor of handheld controller 300, processor of auxiliary unit 400, processor 516, DSP 522) of a first wearable head device or a first AR/MR/XR system determines the first sound objects in a sound field or an audio scene and extracts the sound objects from the sound field or audio scene. In some embodiments, the extracted first sound object comprises audio (e.g., audio signals associated with the sound) and location and position information (e.g., coordinates and orientation of a sound source associated with the first sound object in the AR/MR/XR environment). For the sake of brevity, some examples and advantages of sound object extraction (e.g., described with respect to step 810) are not described here.
In some embodiments, the method 900 includes detecting a second sound (step 902B). For example, a sound are detected by a microphone (e.g., microphone 250; microphones 250A, 250B, 250C, and 250D; microphone of handheld controller 300; microphone array 507) of a second wearable head device or a second AR/MR/XR system. In some embodiments, the sound includes a sound from a sound field or a 3-D audio scene of an AR/MR/XR environment of the second wearable head device or the second AR/MR/XR system. In some embodiments, the AR/MR/XR environment of the second device or system is the same environment as the first device or system, as described with respect to steps 902A-910A.
In some embodiments, the method 900 includes determining a second digital audio signal based on the second detected sound (step 904B). For example, the second spherical signal representation corresponding to the second sounds is derived similarly to a spherical signal representation, as described with respect to step 704A, 704B, or 904A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes detecting a second microphone movement (step 906B). For example, the second microphone movement is detected similarly to detection of a second microphone movement, as described with respect to step 706B or 906A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes adjusting the second digital audio signal (step 908B). For example, the second headpose is compensated (e.g., using a second function for the second headpose) similarly to compensation of a second headpose, as described with respect to step 708A, 708B, or 908A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes generating a second fixed-orientation recording. For example, the second fixed-orientation recording is generated (e.g., by applying the second function to the second spherical signal representation) similarly to generation of a fixed-orientation recording, as described with respect to step 708A, 708B, or 908A. For the sake of brevity, this is not described here.
In some embodiments, the method 900 includes extracting second sound objects (step 910B). For example, a second sound object is extracted similarly to extraction of a first sound object, as described with respect to step 910A. For the sake of brevity, this is not described here.
In some embodiments, steps 902A-910A are performed at the same time as steps 902B-910B (e.g., the first device or system and the second device or system are recording a sound field or a 3-D audio scene at a same time). For example, a first user of a first device or system and a second user of a second device or system are recording a sound field or a 3-D audio scene together in the AR/MR/XR environment at a same time. In some embodiments, steps 902A-910A are performed at a different time than steps 902B-910B (e.g., the first device or system and the second device or system are recording a sound field or a 3-D audio scene at a different times). For example, a first user of a first device or system and a second user of a second device or system are recording a sound field or a 3-D audio scene in the AR/MR/XR environment at different times.
In some embodiments, the method 900 includes consolidating the first sound objects and the second objects (step 912). For example, the first and second sound objects are consolidated by grouping into a single larger group of sound objects. Consolidation of the sound objects allow the sound objects to be more efficiently combined with the residual in the next step.
In some embodiments, the first and second sound objects are consolidated at a server (e.g., in the cloud) that communicates with the first device or system and the second device or system (e.g., the devices or systems send the respective sound objects to the server for further processing and storage). In some embodiments, the first and second sound objects are consolidated at a master device (e.g., a first or second wearable head device or AR/MR/XR system).
In some embodiments, the method 900 includes combining the consolidated sound objects and a residual (step 914). For example, a server (e.g., in the cloud) or a master device (e.g., a first or second wearable head device or AR/MR/XR system) combines the extracted sound objects (e.g., from step 914) and a residual (e.g., portions of the sound field or audio scene not extracted as sound objects; determined from the respective sound object extraction steps 910A and 910B). In some embodiments, the combined sound objects and residual is a less complex and more efficient sound field or audio scene, compared to a sound field or audio scene without sound object extraction. In some embodiments, the residual is stored with lower spatial resolution (e.g., in a 1st order Ambisonics file). In some embodiments, a sound object is stored with higher spatial resolution (e.g., because the sound object comprises a sound of an object of interest or a dominant sound in an AR/MR/XR environment). For the sake of brevity, some examples and advantages of combining sound objects and the residual are not described here.
In some embodiments, the method 900 advantageously generates a less complex (e.g., smaller file sizes) sound field or audio scene. By extracting sound objects and rendering them at higher spatial resolution, while rendering the residual at a lower spatial resolution, the generated sound field or audio scene is more efficient (e.g., smaller file sizes, less required computational resources) than an entire sound field or audio scene that supports six degrees of freedom. Furthermore, while being more efficient, the generated sound field or audio scene does not compromise a user's AR/MR/XR experience by maintaining the more important qualities of a six-degrees-of-freedom sound field or audio scene while minimizing resources on portions that do not require more degrees of freedom. This advantage becomes greater for larger sound fields or audio scenes, which may require more than one recording device for sound detection, such as the exemplary sound fields or audio scenes described with respect to method 900.
In some embodiments, using detected data from multiple devices, as described with respect to method 900, may allow improved extraction of sound objects with more accurate location estimation. For instance, correlating data from multiple devices could help provide distance information that may be harder to estimate from a single device audio capture.
In some embodiments, a wearable head device (e.g., a wearable head device described herein, AR/MR/XR system described herein) includes: a processor; a memory; and a program stored in the memory, configured to be executed by the processor, and including instructions for performing the methods described with respect to
In some embodiments, a non-transitory computer readable storage medium stores one or more programs, and the one or more programs includes instructions. When the instructions are executed by an electronic device (e.g., an electronic device or system described herein) with one or more processors and memory, the instructions cause the electronic device to perform the methods described with respect to
Although examples of the disclosure are described with respect to a wearable head device or an AR/MR/XR system, it is understood that the disclosed sound field recording and playback methods may also be performed using other devices or systems. For example, the disclosed methods may be performed using a mobile device for compensating for effects of movement during recording or playback. As another example, the disclosed methods may be performed using a mobile device for recording a sound field including extracting sound objects and combining the sound objects and a residual.
Although examples of the disclosure are described with respect to headpose compensation, it is understood that the disclosed sound field recording and playback methods may also be performed generally for compensation of any movement. For example, the disclosed methods may be performed using a mobile device for compensating for effects of movement during recording or playback.
With respect to the systems and methods described herein, elements of the systems and methods can be implemented by one or more computer processors (e.g., CPUs or DSPs) as appropriate. The disclosure is not limited to any particular configuration of computer hardware, including computer processors, used to implement these elements. In some cases, multiple computer systems can be employed to implement the systems and methods described herein. For example, a first computer processor (e.g., a processor of a wearable device coupled to one or more microphones) can be utilized to receive input microphone signals, and perform initial processing of those signals (e.g., signal conditioning and/or segmentation). A second (and perhaps more computationally powerful) processor can then be utilized to perform more computationally intensive processing, such as determining probability values associated with speech segments of those signals. Another computer device, such as a cloud server, can host an audio processing engine, to which input signals are ultimately provided. Other suitable configurations will be apparent and are within the scope of the disclosure.
According to some embodiments, a method comprises: detecting, with a microphone of a first wearable head device, a sound of an environment; determining a digital audio signal based on the detected sound, the digital audio signal associated with a sphere having a position in the environment; concurrently with detecting the sound, detecting, via a sensor of the first wearable head device, a microphone movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected microphone movement (e.g., magnitude, direction); and presenting the adjusted digital audio signal to a user of a second wearable head device via one or more speakers of the second wearable head device.
According to some embodiments, the method further comprises: detecting, with a microphone of a third wearable head device, a second sound of the environment; determining a second digital audio signal based on the second detected sound, the second digital audio signal associated with a second sphere having a second position in the environment; concurrently, with detecting the second sound, detecting, via a sensor of the third wearable head device, a second microphone movement with respect to the environment; adjusting the second digital audio signal, wherein the adjusting comprises adjusting the second position of the second sphere based on based on the second detected microphone movement; combining the adjusted digital audio signal and the second adjusted digital audio signal; and presenting the combined first adjusted digital audio signal and second adjusted digital audio signal to the user of the second wearable head device via the one or more speakers of the second wearable head device.
According to some embodiments, the first adjusted digital audio signal and the second adjusted digital audio signal are combined at a server.
According to some embodiments, the digital audio signal comprises an Ambisonic file.
According to some embodiments, detecting the microphone movement with respect to the environment comprises performing one or more of simultaneous localization and mapping and visual inertial odometry.
According to some embodiments, the sensor comprises one or more of an inertial measurement unit, a camera, a second microphone, a gyroscope, and a LiDAR sensor.
According to some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
According to some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
According to some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the second wearable head device, content associated with the sound of the environment.
According to some embodiments, a method comprises: receiving, at a wearable head device, a digital audio signal, the digital audio signal associated with a sphere having a position in the environment; detecting, via a sensor of the wearable head device, a device movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected device movement; and presenting the adjusted digital audio signal to a user of the wearable head device via one or more speakers of the wearable head device.
According to some embodiments, the method further comprises: combining a second digital audio signal and a third digital audio signal; and downmixing the combined second and third digital audio signal, wherein the retrieved first digital audio signal is the combined second and third digital audio signal.
According to some embodiments, downmixing the combined second and third digital audio signal comprises applying a first gain to the second digital audio signal and a second gain to the second digital audio signal.
According to some embodiments, downmixing the combined second and third digital audio signal comprises reducing an Ambisonic order of the second digital audio signal based on a distance of the wearable head device from a recording location of the second digital audio signal.
According to some embodiments, the sensor is an inertial measurement unit, a camera, a second microphone, a gyroscope, or a LiDAR sensor.
According to some embodiments, detecting the device movement with respect to the environment comprises performing simultaneous localization and mapping or visual inertial odometry.
According to some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
According to some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
According to some embodiments, the digital audio signal is in Ambisonics format.
According to some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the wearable head device, content associated with a sound of the digital audio signal in the environment.
According to some embodiments, a method comprises: detecting sounds of an environment; extracting a sound object from the detected sounds; and combining the sound objects and a residual. The sound object comprises a first portion of the detected sounds, the first portion meeting a sound object criterion, and the residual comprises a second portion of the detected sounds, the second portion not meeting the sound object criterion.
According to some embodiments, the further comprises: detecting second sounds of the environment; determining whether a portion of the second detected sounds meets the sound object criterion, wherein: a portion of the second detected sounds meeting the sound object criterion comprises a second sound object, and a portion of the second detected sounds not meeting the sound object criterion comprises a second residual; extracting the second sound object from the second detected sounds; and consolidating the first sound object and the second sound object, wherein combining the sound object and the residual comprises combining the consolidated sound object, the first residual, and the second residual.
According to some embodiments, the sound object supports six degrees of freedom in the environment, and the residual supports three degrees of freedom in the environment.
According to some embodiments, the sound object has a higher spatial resolution than the residual.
According to some embodiments, the residual is stored in a lower order Ambisonic file.
According to some embodiments, a method, comprises: detecting, via a sensor of a wearable head device, a device movement with respect to the environment; adjusting a sound object, wherein the sound object is associated with a first sphere having a first position in the environment and the adjusting comprises adjusting the first position of the first sphere based on based on the detected device movement; adjusting a residual, wherein the residual is associated with a second sphere having a second position in the environment and the adjusting comprises adjusting the second position of the second sphere based on based on the detected device movement; mixing the adjusted sound object and the adjusted residual; and presenting the mixed adjusted sound object and adjusted residual to a user of the wearable head device via one or more speakers of the wearable head device.
According to some embodiments, a system comprises: a first wearable head device comprising a microphone and a sensor; a second wearable head device comprising a speaker; and one or more processors configured to execute a method comprising: detecting, with the microphone of the first wearable head device, a sound of an environment; determining a digital audio signal based on the detected sound, the digital audio signal associated with a sphere having a position in the environment; concurrently with detecting the sound, detecting, via the sensor of the first wearable head device, a microphone movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected microphone movement; and presenting the adjusted digital audio signal to a user of the second wearable head device via the speaker of the second wearable head device.
According to some embodiments, the system further comprises a third wearable head device comprising a microphone and a sensor, wherein the method further comprises: detecting, with the microphone of the third wearable head device, a second sound of the environment; determining a second digital audio signal based on the second detected sound, the second digital audio signal associated with a second sphere having a second position in the environment; concurrently, with detecting the second sound, detecting, via the sensor of the third wearable head device, a second microphone movement with respect to the environment; adjusting the second digital audio signal, wherein the adjusting comprises adjusting the second position of the second sphere based on based on the second detected microphone movement; combining the adjusted digital audio signal and the second adjusted digital audio signal; and presenting the combined first adjusted digital audio signal and second adjusted digital audio signal to the user of the second wearable head device via the speaker of the second wearable head device.
According to some embodiments, the first adjusted digital audio signal and the second adjusted digital audio signal are combined at a server.
According to some embodiments, the digital audio signal comprises an Ambisonic file.
According to some embodiments, detecting the microphone movement with respect to the environment comprises one or more of performing simultaneous localization and mapping and visual inertial odometry.
According to some embodiments, the sensor comprises one of more of an inertial measurement unit, a camera, a second microphone, a gyroscope, and a LiDAR sensor.
According to some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
According to some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
According to some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the second wearable head device, content associated with the sound of the environment.
According to some embodiments, a system comprises: a wearable head device comprising a sensor and a speaker; and one or more processors configured to execute a method comprising: receiving, at the wearable head device, a digital audio signal, the digital audio signal associated with a sphere having a position in the environment; detecting, via the sensor of the wearable head device, a device movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected device movement; and presenting the adjusted digital audio signal to a user of the wearable head device via the speaker of the wearable head device.
According to some embodiments, the method further comprises: combining a second digital audio signal and a third digital audio signal; and downmixing the combined second and third digital audio signal, wherein the retrieved first digital audio signal is the combined second and third digital audio signal.
According to some embodiments, downmixing the combined second and third digital audio signal comprises applying a first gain to the second digital audio signal and a second gain to the second digital audio signal.
According to some embodiments, downmixing the combined second and third digital audio signal comprises reducing an Ambisonic order of the second digital audio signal based on a distance of the wearable head device from a recording location of the second digital audio signal.
According to some embodiments, the sensor is an inertial measurement unit, a camera, a second microphone, a gyroscope, or a LiDAR sensor.
According to some embodiments, detecting the device movement with respect to the environment comprises performing simultaneous localization and mapping or visual inertial odometry.
According to some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
According to some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
According to some embodiments, the digital audio signal is in Ambisonics format.
According to some embodiments, the wearable head device further comprises a display, and the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on the display of the wearable head device, content associated with a sound of the digital audio signal in the environment.
According to some embodiments, a system comprises one or more processors configured to execute a method comprising: detecting sounds of an environment; extracting a sound object from the detected sounds; and combining the sound objects and a residual. The sound object comprises a first portion of the detected sounds, the first portion meeting a sound object criterion, and the residual comprises a second portion of the detected sounds, the second portion not meeting the sound object criterion.
According to some embodiments, the method further comprises: detecting second sounds of the environment; determining whether a portion of the second detected sounds meets the sound object criterion, wherein: a portion of the second detected sounds meeting the sound object criterion comprises a second sound object, and a portion of the second detected sounds not meeting the sound object criterion comprises a second residual; extracting the second sound object from the second detected sounds; and consolidating the first sound object and the second sound object, wherein combining the sound object and the residual comprises combining the consolidated sound object, the first residual, and the second residual.
According to some embodiments, the sound object supports six degrees of freedom in the environment, and the residual supports three degrees of freedom in the environment.
According to some embodiments, the sound object has a higher spatial resolution than the residual.
According to some embodiments, the residual is stored in a lower order Ambisonic file.
According to some embodiments, a system comprises: a wearable head device comprising a sensor and a speaker; and one or more processors configured to execute a method comprising: detecting, via the sensor of the wearable head device, a device movement with respect to the environment; adjusting a sound object, wherein the sound object is associated with a first sphere having a first position in the environment and the adjusting comprises adjusting the first position of the first sphere based on based on the detected device movement; adjusting a residual, wherein the residual is associated with a second sphere having a second position in the environment and the adjusting comprises adjusting the second position of the second sphere based on based on the detected device movement; mixing the adjusted sound object and the adjusted residual; and presenting the mixed adjusted sound object and adjusted residual to a user of the wearable head device via the speaker of the wearable head device.
According to some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: detecting, with a microphone of a first wearable head device, a sound of an environment; determining a digital audio signal based on the detected sound, the digital audio signal associated with a sphere having a position in the environment; concurrently with detecting the sound, detecting, via a sensor of the first wearable head device, a microphone movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected microphone movement; and presenting the adjusted digital audio signal to a user of a second wearable head device via one or more speakers of the second wearable head device.
According to some embodiments, the method further comprises: detecting, with a microphone of a third wearable head device, a second sound of the environment; determining a second digital audio signal based on the second detected sound, the second digital audio signal associated with a second sphere having a second position in the environment; concurrently, with detecting the second sound, detecting, via a sensor of the third wearable head device, a second microphone movement with respect to the environment; adjusting the second digital audio signal, wherein the adjusting comprises adjusting the second position of the second sphere based on based on the second detected microphone movement; combining the adjusted digital audio signal and the second adjusted digital audio signal; and presenting the combined first adjusted digital audio signal and second adjusted digital audio signal to the user of the second wearable head device via the speaker of the second wearable head device.
According to some embodiments, the first digital audio signal and the second digital audio signal are combined at a server.
According to some embodiments, the digital audio signal comprises an Ambisonic file.
According to some embodiments, detecting the microphone movement with respect to the environment comprises performing one or more of simultaneous localization and mapping and visual inertial odometry.
According to some embodiments, the sensor comprises one of more of an inertial measurement unit, a camera, a second microphone, a gyroscope, and a LiDAR sensor.
According to some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
According to some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
According to some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the second wearable head device, content associated with the sound of the environment.
According to some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: receiving, at a wearable head device, a digital audio signal, the digital audio signal associated with a sphere having a position in the environment; detecting, via a sensor of the wearable head device, a device movement with respect to the environment; adjusting the digital audio signal, wherein the adjusting comprises adjusting the position of the sphere based on based on the detected device movement; and presenting the adjusted digital audio signal to a user of the wearable head device via one or more speakers of the wearable head device.
According to some embodiments, the method further comprises: combining a second digital audio signal and a third digital audio signal; and downmixing the combined second and third digital audio signal, wherein the retrieved first digital audio signal is the combined second and third digital audio signal.
According to some embodiments, downmixing the combined second and third digital audio signal comprises applying a first gain to the second digital audio signal and a second gain to the second digital audio signal.
According to some embodiments, downmixing the combined second and third digital audio signal comprises reducing an Ambisonic order of the second digital audio signal based on a distance of the wearable head device from a recording location of the second digital audio signal.
According to some embodiments, the sensor is an inertial measurement unit, a camera, a second microphone, a gyroscope, or a LiDAR sensor.
According to some embodiments, detecting the device movement with respect to the environment comprises performing simultaneous localization and mapping or visual inertial odometry.
According to some embodiments, adjusting the digital audio signal comprises applying a compensation function to the digital audio signal.
According to some embodiments, wherein applying the compensation function comprises applying the compensation function based on an inverse of the microphone movement.
According to some embodiments, the digital audio signal is in Ambisonics format.
According to some embodiments, the method further comprises concurrently, with presenting the adjusted digital audio signal, displaying, on a display of the wearable head device, content associated with a sound of the digital audio signal in the environment.
According to some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: detecting sounds of an environment; extracting a sound object from the detected sounds; and combining the sound objects and a residual. The sound object comprises a first portion of the detected sounds, the first portion meeting a sound object criterion, and the residual comprises a second portion of the detected sounds, the second portion not meeting the sound object criterion.
According to some embodiments, the method further comprises: detecting second sounds of the environment; determining whether a portion of the second detected sounds meets the sound object criterion, wherein: a portion of the second detected sounds meeting the sound object criterion comprises a second sound object, and a portion of the second detected sounds not meeting the sound object criterion comprises a second residual; extracting the second sound object from the second detected sounds; and consolidating the first sound object and the second sound object, wherein combining the sound object and the residual comprises combining the consolidated sound object, the first residual, and the second residual.
According to some embodiments, the sound object supports six degrees of freedom in the environment, and the residual supports three degrees of freedom in the environment.
According to some embodiments, the sound object has a higher spatial resolution than the residual.
According to some embodiments, the residual is stored in a lower order Ambisonic file.
According to some embodiments, a non-transitory computer-readable medium stores one or more instructions, which, when executed by one or more processors of an electronic device, cause the device to perform a method comprising: detecting, via a sensor of a wearable head device, a device movement with respect to the environment; adjusting a sound object, wherein the sound object is associated with a first sphere having a first position in the environment and the adjusting comprises adjusting the first position of the first sphere based on based on the detected device movement; adjusting a residual, wherein the residual is associated with a second sphere having a second position in the environment and the adjusting comprises adjusting the second position of the second sphere based on based on the detected device movement; mixing the adjusted sound object and the adjusted residual; and presenting the mixed adjusted sound object and adjusted residual to a user of the wearable head device via one or more speakers of the wearable head device.
Although the disclosed examples have been fully described with reference to the accompanying drawings, it is to be noted that various changes and modifications will become apparent to those skilled in the art. For example, elements of one or more implementations may be combined, deleted, modified, or supplemented to form further implementations. Such changes and modifications are to be understood as being included within the scope of the disclosed examples as defined by the appended claims.
This application claims priority to U.S. Provisional Application No. 63/252,391, filed on Oct. 5, 2021, the contents of which are incorporated by reference herein in their entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/077487 | 10/3/2022 | WO |
Number | Date | Country | |
---|---|---|---|
63252391 | Oct 2021 | US |