This disclosures relates generally to systems and methods for audio signal processing, and in particular to systems and methods for presenting audio signals in a mixed reality environment.
Augmented reality and mixed reality systems place unique demands on the presentation of binaural audio signals to a user. On one hand, presentation of audio signals in a realistic manner—for example, in a manner consistent with the user's expectations—is crucial for creating augmented or mixed reality environments that are immersive and believable. On the other hand, the computational expense of processing such audio signals can be prohibitive, particularly for mobile systems that may feature limited processing power and battery capacity.
One particular challenge is the simulation of near-field audio effects. Near-field effects are important for re-creating impression of a sound source coming very close to a user's head. Near-field effects can be computed using databases of head-related transfer functions (HRTFs). However, typical HRTF databases include HRTFs measured at a single distance in a far-field from the user's head (e.g., more than 1 meter from the user's head), and may lack HRTFs at distances suitable for near-field effects. And even if the HRTF databases included measured or simulated HRTFs for different distances from the user's head (e.g., less than 1 meter from the user's head), it may be computationally expensive to directly use a high number of HRTFs for real-time audio rendering applications. Accordingly, systems and methods are desired for modeling near-field audio effects using far-field HRTFs in a computationally efficient manner.
Examples of the disclosure describe systems and methods for presenting an audio signal to a user of a wearable head device. According to an example method, a source location corresponding to the audio signal is identified. An acoustic axis corresponding to the audio signal is determined. For each of a respective left and right ear of the user, an angle between the acoustic axis and the respective ear is determined. For each of the respective left and right ear of the user, a virtual speaker position, of a virtual speaker array, is determined, the virtual speaker position collinear with the source location and with a position of the respective ear. The virtual speaker array comprises a plurality of virtual speaker positions, each virtual speaker position of the plurality located on the surface of a sphere concentric with the user's head, the sphere having a first radius. For each of the respective left and right ear of the user, a head-related transfer function (HRTF) corresponding to the virtual speaker position and to the respective ear is determined; a source radiation filter is determined based on the determined angle; the audio signal is processed to generate an output audio signal for the respective ear; and the output audio signal is presented to the respective ear of the user via one or more speakers associated with the wearable head device. Processing the audio signal comprises applying the HRTF and the source radiation filter to the audio signal.
In the following description of examples, reference is made to the accompanying drawings which form a part hereof, and in which it is shown by way of illustration specific examples that can be practiced. It is to be understood that other examples can be used and structural changes can be made without departing from the scope of the disclosed examples.
In some examples involving augmented reality or mixed reality applications, it may be desirable to transform coordinates from a local coordinate space (e.g., a coordinate space fixed relative to headgear device 400A) to an inertial coordinate space, or to an environmental coordinate space. For instance, such transformations may be necessary for a display of headgear device 400A to present a virtual object at an expected position and orientation relative to the real environment (e.g., a virtual person sitting in a real chair, facing forward, regardless of the position and orientation of headgear device 400A), rather than at a fixed position and orientation on the display (e.g., at the same position in the display of headgear device 400A). This can maintain an illusion that the virtual object exists in the real environment (and does not, for example, appear positioned unnaturally in the real environment as the headgear device 400A shifts and rotates). In some examples, a compensatory transformation between coordinate spaces can be determined by processing imagery from the depth cameras 444 (e.g., using a Simultaneous Localization and Mapping (SLAM) and/or visual odometry procedure) in order to determine the transformation of the headgear device 400A relative to an inertial or environmental coordinate system. In the example shown in
In some examples, the depth cameras 444 can supply 3D imagery to a hand gesture tracker 411, which may be implemented in a processor of headgear device 400A. The hand gesture tracker 411 can identify a user's hand gestures, for example by matching 3D imagery received from the depth cameras 444 to stored patterns representing hand gestures. Other suitable techniques of identifying a user's hand gestures will be apparent.
In some examples, one or more processors 416 may be configured to receive data from headgear subsystem 404B, the IMU 409, the SLAM/visual odometry block 406, depth cameras 444, microphones 450; and/or the hand gesture tracker 411. The processor 416 can also send and receive control signals from the 6DOF totem system 404A. The processor 416 may be coupled to the 6DOF totem system 404A wirelessly, such as in examples where the handheld controller 400B is untethered. Processor 416 may further communicate with additional components, such as an audio-visual content memory 418, a Graphical Processing Unit (GPU) 420, and/or a Digital Signal Processor (DSP) audio spatializer 422. The DSP audio spatializer 422 may be coupled to a Head Related Transfer Function (HRTF) memory 425. The GPU 420 can include a left channel output coupled to the left source of imagewise modulated light 424 and a right channel output coupled to the right source of imagewise modulated light 426. GPU 420 can output stereoscopic image data to the sources of imagewise modulated light 424, 426. The DSP audio spatializer 422 can output audio to a left speaker 412 and/or a right speaker 414. The DSP audio spatializer 422 can receive input from processor 419 indicating a direction vector from a user to a virtual sound source (which may be moved by the user, e.g., via the handheld controller 400B). Based on the direction vector, the DSP audio spatializer 422 can determine a corresponding HRTF (e.g., by accessing a HRTF, or by interpolating multiple HRTFs). The DSP audio spatializer 422 can then apply the determined HRTF to an audio signal, such as an audio signal corresponding to a virtual sound generated by a virtual object. This can enhance the believability and realism of the virtual sound, by incorporating the relative position and orientation of the user relative to the virtual sound in the mixed reality environment—that is, by presenting a virtual sound that matches a user's expectations of what that virtual sound would sound like if it were a real sound in a real environment.
In some examples, such as shown in
While
Audio Rendering
The systems and methods described below can be implemented in an augmented reality or mixed reality system, such as described above. For example, one or more processors (e.g., CPUs, DSPs) of an augmented reality system can be used to process audio signals or to implement steps of computer-implemented methods described below; sensors of the augmented reality system (e.g., cameras, acoustic sensors, IMUs, LIDAR, GPS) can be used to determine a position and/or orientation of a user of the system, or of elements in the user's environment; and speakers of the augmented reality system can be used to present audio signals to the user. In some embodiments, external audio playback devices (e.g. headphones, earbuds) could be used instead of the system's speakers for delivering the audio signal to the user's ears.
In augmented reality or mixed reality systems such as described above, one or more processors (e.g., DSP audio spatializer 422) can process one or more audio signals for presentation to a user of a wearable head device via one or more speakers (e.g., left and right speakers 412/414 described above). Processing of audio signals requires tradeoffs between the authenticity of a perceived audio signal—for example, the degree to which an audio signal presented to a user in a mixed reality environment matches the user's expectations of how an audio signal would sound in a real environment—and the computational overhead involved in processing the audio signal.
Modeling near-field audio effects can improve the authenticity of a user's audio experience, but can be computationally prohibitive. In some embodiments, an integrated solution may combine a computationally efficient rendering approach with one or more near-field effects for each ear. The one or more near-field effects for each ear may include, for example, parallax angles in simulation of sound incident for each ear, interaural time difference (ITDs) based on object position and anthropometric data, near-field level changes due to distance, and/or magnitude response changes due to proximity to the user's head and/or source radiation variation due to parallax angles. In some embodiments, the integrated solution may be computationally efficient so as to not excessively increase computational cost.
In a far-field, as a sound source moves closer or farther from a user, changes at the user's ears may be the same for each ear and may be an attenuation of a signal for the sound source. In a near-field, as a sound source moves closer or farther from the user, changes at the user's ears may be different for each ear and may be more than just attenuations of the signal for the sound source. In some embodiments, the near-field and far-field boundaries may be where the conditions change.
In some embodiments, a virtual speaker array (VSA) may be a discrete set of positions on a sphere centered at a center of the user's head. For each position on the sphere, a pair (e.g., left-right pair) of HRTFs is provided. In some embodiments, a near-field may be a region inside the VSA and a far-field may be a region outside the VSA. At the VSA, either a near-field approach or a far-field approach may be used.
A distance from a center of the user's head to a VSA may be a distance at which the HRTFs were obtained. For example, the HRTF filters may be measured or synthesized from simulation. The measured/simulated distance from the VSA to the center of the user's head may be referred to as “measured distance” (MD). A distance from a virtual sound source to the center of the user's head may be referred to as “source distance” (SD).
In the example, the left ear VSA module 510 can pan the left signal 504 over a set of N channels respectively feeding a set of left-ear HRTF filters 550 (L1, . . . LN) in a HRTF filter bank 540. The left-ear HRTF filters 550 may be substantially delay-free. Panning gains 512 (gL1, . . . gLN) of the left ear VSA module may be functions of a left incident angle (angL). The left incident angle may be indicative of a direction of incidence of sound relative to a frontal direction from the center of the user's head. Though shown from a top-down perspective with respect to the user's head in the figure, the left incident angle can comprise an angle in three dimensions; that is, the left incident angle can include an azimuth and/or an elevation angle.
Similarly, in the example, the right ear VSA module 520 can pan the right signal 506 over a set of M channels respectively feeding a set of right-ear HRTF filters 560 (R1, . . . RM) in the HRTF filter bank 540. The right-ear HRTF filters 550 may be substantially delay-free. (Although only one HRTF filter bank is shown in the figure, multiple HRTF filter banks, including those stored across distributed systems, are contemplated.) Panning gains 522 (gR1, . . . gRM) of the right ear VSA module may be functions of a right incident angle (angR). The right incident angle may be indicative of a direction of incidence of sound relative to the frontal direction from the center of the user's head. As above, the right incident angle can comprise an angle in three dimensions; that is, the right incident angle can include an azimuth and/or an elevation angle.
In some embodiments, such as shown, the left ear VSA module 510 may pan the left signal 504 over N channels and the right ear VSA module may pan the right signal over M channels. In some embodiments, N and M may be equal. In some embodiments, N and M may be different. In these embodiments, the left ear VSA module may feed into a set of left-ear HRTF filters (L1, . . . LN) and the right ear VSA module may feed into a set of right-ear HRTF filters (R1, . . . RM), as described above. Further, in these embodiments, panning gains (gL1, . . . gLN) of the left ear VSA module may be functions of a left ear incident angle (angL) and panning gains (gR1, . . . gRM) of the right ear VSA module may be functions of a right ear incident angle (angR), as described above.
The example system illustrates a single encoder 503 and corresponding input signal 501. The input signal may correspond to a virtual sound source. In some embodiments, the system may include additional encoders and corresponding input signals. In these embodiments, the input signals may correspond to virtual sound sources. That is, each input signal may correspond to a virtual sound source.
In some embodiments, when simultaneously rendering several virtual sound sources, the system may include an encoder per virtual sound source. In these embodiments, a mix module (e.g., 530 in
In some embodiments, the left incident angle 652 (angL) used for computing a left ear signal panning may be derived by computing an intersection of a line going from the user's left ear through a location of the virtual sound source 610, and a sphere containing the VSA 650. A panning angle combination (azimuth and elevation) may be computed for 3D environments as a spherical coordinate angle from the center 620 of the user's head to the intersection point.
Similarly, in some embodiments, the right incident angle 654 (angL) used for computing a left ear signal panning may be derived by computing an intersection of a line going from the user's right ear through the location of the virtual sound source 610, and the sphere containing the VSA 650. A panning angle combination (azimuth and elevation) may be computed for 3D environments as a spherical coordinate angle from the center 620 of the user's head to the intersection point.
In some embodiments, an intersection between a line and a sphere may be computed, for example, by combining an equation representing the line and an equation representing the sphere.
In some embodiments, the left incident angle 612 (angL) used for computing a left ear signal panning may be derived by computing an intersection of a line going from the user's left ear through a location of the virtual sound source 610, and a sphere containing the VSA 650. A panning angle combination (azimuth and elevation) may be computed for 3D environments as a spherical coordinate angle from the center 620 of the user's head to the intersection point.
Similarly, in some embodiments, the right incident angle 614 (angR) used for computing a left ear signal panning may be derived by computing an intersection of a line going from the user's right ear through the location of the virtual sound source 610, and the sphere containing the VSA 650. A panning angle combination (azimuth and elevation) may be computed for 3D environments as a spherical coordinate angle from the center 620 of the user's head to the intersection point.
In some embodiments, an intersection between a line and a sphere may be computed, for example, by combining an equation representing the line and an equation representing the sphere.
In some embodiments, rendering schemes may not differentiate the left incident angle 612 and the right incident angle 614, and instead assume the left incident angle 612 and the right incident angle 614 are equal. However, assuming the left incident angle 612 and the right incident angle 614 are equal may not be applicable or acceptable when reproducing near-field effects as described with respect to
In some embodiments, the geometric model illustrated in
In some embodiments, distances may be clamped. Clamping may include, for example, limiting distance values below a threshold value to another value. In some embodiments, clamping may include using the limited distance values (referred to as clamped distance values), instead of the actual distance values, for computations. Hard clamping may include limiting distance values below a threshold value to the threshold value. For example, if a threshold value is 5 millimeters, then distance values less than the threshold value will be set to the threshold value, and the threshold value, instead of the actual distance value which is less than the threshold value, may be used for computations. Soft clamping may include limiting distance values such that as the distance values approach or go below a threshold value, they asymptotically approach the threshold value. In some embodiments, instead of, or in addition to, clamping, distance values may be increased by a predetermined amount such that the distance values are never less than the predetermined amount.
In some embodiments, a first minimum distance from the ears of the listener may be used for computing gains and a second minimum distance from the ears of the listener may be used for computing other sound source position parameters such as, for example, angles used for computing HRTF filters, interaural time differences, and the like. In some embodiments, the first minimum distance and the second minimum distance may be different.
In some embodiments, the minimum distance used for computing gains may be a function of one or more properties of the sound source. In some embodiments, the minimum distance used for computing gains may be a function of a level (e.g., RMS value of a signal over a number of frames) of the sound source, a size of the sound source, or radiation properties of the sound source, and the like.
In some embodiments, gains computed from distance may be limited directly in lieu of limiting minimum distance used to compute gains. In other words, the gain may be computed based on distance as a first step, and in a second step the gain may be clamped to not exceed a predetermined threshold value.
In some embodiments, as a sound source gets closer to the head of the listener, a magnitude response of the sound source may change. For example, as a sound source gets closer to the head of the listener, low frequencies at an ipsilateral ear may be amplified and/or high frequencies at a contralateral ear may be attenuated. Changes in the magnitude response may lead to changes in interaural level differences (ILDs).
In some embodiments, changes in magnitude response may be taken into account by, for example, considering HRTF filters used in binaural rendering. In the case of a VSA, the HRTF filters may be approximated as HRTFs corresponding to a position used for computing right ear panning and a position used for computing left ear panning (e.g., as illustrated in
In some embodiments, parallax HRTF angles may be computed and then used to compute more accurate compensation filters. For example, referring to
In some embodiments, once attenuations due to distance have been taken into account, magnitude differences may be captured with additional signal processing. In some embodiments, the additional signal processing may consist of a gain, a low shelving filter, and a high shelving filter to be applied to each ear signal.
In some embodiments, a broadband gain may be computed for angles up to 120 degrees, for example, according to equation 1:
gain_db=2.5*sin(angleMD_deg*3/2) (Equation 1)
where angleMD_deg may be an angle of a corresponding HRTF at a MD, for example, relative to a position of an ear of the user. In some embodiments, angles other than 120 degrees may be used. In these embodiments, Equation 1 may be modified per the angle used.
In some embodiments, a broadband gain may be computed for angles greater than 120 degrees, for example, according to equation 2:
gain_db=2.5*sin(180+3*(angleMD_deg−120)) (Equation 2)
In some embodiments, angles other than 120 degrees may be used. In these embodiments, Equation 2 may be modified per the angle used.
In some embodiments, a low shelving filter gain may be computed, for example, according to equation 3:
lowshelfgain_db=2.5*(e−angleMD_deg/65−e−180/65) (Equation 3)
In some embodiments, other angles may be used. In these embodiments, Equation 3 may be modified per the angle used.
In some embodiment, a high shelving filter gain may be computed for angles larger than 110 degrees, for example, according to equation 4:
highshelfgain_db=3.3*(cos((angle_deg*180/pi−110)*3)−1) (Equation 4)
where angle_deg may be an angle of the source, relative to the position of the ear of the user. In some embodiments, angles other than 110 degrees may be used. In these embodiments, Equation 4 may be modified per the angle used.
The aforementioned effects (e.g., gain, low shelving filter, and high shelving filter) may be attenuated as a function of distance. In some embodiments, a distance attenuation factor may be computed, for example, according to equation 5:
distanceAttenuation=(HR/(HR−MD))*(1−MD/sourceDistance_clamped) (Equation 5)
where HR is the head radius, MD is the measured distance, and sourceDistance_clamped is the source distance clamped to be at least as big as the head radius.
In some embodiments, a filter (e.g., an EQ filter) may be applied for a sound source placed at the center of the user's head. The EQ filter may be used to reduce abrupt timbre changes as the sound source moves through the user's head. In some embodiment, the EQ filter may be scaled to match a magnitude response at the surface of the user's head as the simulated sound source moves from the center of the user's head to the surface of the user's head, and thus further reduce a risk of abrupt magnitude response changes when the sound source moves in and out of the user's head. In some embodiments, crossfade between an equalized signal and an unprocessed signal may be used based on a position of the sound source between the center of the user's head and the surface of the user's head.
In some embodiments, the EQ filter may be automatically computed as an average of the filters used to render a source on a surface of a head of the user. The EQ filter may be exposed to the user as a set of tunable/configurable parameters. In some embodiments, the tunable/configurable parameters may include control frequencies and associated gains.
In some embodiments, to optimize computing resources, a system may automatically switch between the signal flows 1200 and 1300, for example, based on whether the sound source to be rendered is in the far-field or in the near-field. In some embodiments, a filter state may need to be copied between the filters (e.g., the source radiation filter, the left ear near-field and source radiation filter and the right ear near-field and source radiation filter) during transitioning in order to avoid processing artifacts.
In some embodiments, the EQ filters described above may be bypassed when their settings are perceptually equivalent to a flat magnitude response with 0 dB gain. If the response is flat but with a gain different than zero, a broadband gain may be used to efficiently achieve the desired result.
A head coordinate system may be used for computing acoustic propagation from an audio object to ears of a listener. A device coordinate system may be used by a tracking device (such as one or more sensors of a wearable head device in an augmented reality system, such as described above) to track position and orientation of a head of a listener. In some embodiments, the head coordinate system and the device coordinate system may be different. A center of the head of the listener may be used as the origin of the head coordinate system, and may be used to reference a position of the audio object relative to the listener with a forward direction of the head coordinate system defined as going from the center of the head of the listener to a horizon in front of the listener. In some embodiments, an arbitrary point in space may be used as the origin of the device coordinate system. In some embodiments, the origin of the device coordinate system may be a point located in between optical lenses of a visual projection system of the tracking device. In some embodiments, the forward direction of the device coordinate system may be referenced to the tracking device itself, and dependent on the position of the tracking device on the head of the listener. In some embodiments, the tracking device may have a non-zero pitch (i.e. be tilted up or down) relative to a horizontal plane of the head coordinate system, leading to a misalignment between the forward direction of the head coordinate system and the forward direction of the device coordinate system.
In some embodiments, the difference between the head coordinate system and the device coordinate system may be compensated for by applying a transformation to the position of the audio object relative to the head of the listener. In some embodiments, the difference in the origin of the head coordinate system and the device coordinate system may be compensated for by translating the position of the audio objects relative to the head of the listener by an amount equal to the distance between the origin of the head coordinate system and the origin of the device coordinate system reference points in three dimensions (e.g., x, y, and z). In some embodiments, the difference in angles between the head coordinate system axes and the device coordinate system axes may be compensated for by applying a rotation to the position of the audio object relative to the head of the listener. For instance, if the tracking device is tilted downward by N degrees, the position of the audio object could be rotated downward by N degrees prior to rendering the audio output for the listener. In some embodiments, audio object rotation compensation may be applied before audio object translation compensation. In some embodiments, compensations (e.g., rotation, translation, scaling, and the like) may be taken together in a single transformation including all the compensations (e.g., rotation, translation, scaling, and the like).
In some embodiments, such as in those depicted in
Various exemplary embodiments of the disclosure are described herein. Reference is made to these examples in a non-limiting sense. They are provided to illustrate more broadly applicable aspects of the disclosure. Various changes may be made to the disclosure described and equivalents may be substituted without departing from the true spirit and scope of the disclosure. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process act(s) or step(s) to the objective(s), spirit or scope of the present disclosure. Further, as will be appreciated by those with skill in the art that each of the individual variations described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present disclosure. All such modifications are intended to be within the scope of claims associated with this disclosure.
The disclosure includes methods that may be performed using the subject devices. The methods may include the act of providing such a suitable device. Such provision may be performed by the end user. In other words, the “providing” act merely requires the end user obtain, access, approach, position, set-up, activate, power-up or otherwise act to provide the requisite device in the subject method. Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as in the recited order of events.
Exemplary aspects of the disclosure, together with details regarding material selection and manufacture have been set forth above. As for other details of the present disclosure, these may be appreciated in connection with the above-referenced patents and publications as well as generally known or appreciated by those with skill in the art. The same may hold true with respect to method-based aspects of the disclosure in terms of additional acts as commonly or logically employed.
In addition, though the disclosure has been described in reference to several examples optionally incorporating various features, the disclosure is not to be limited to that which is described or indicated as contemplated with respect to each variation of the disclosure. Various changes may be made to the disclosure described and equivalents (whether recited herein or not included for the sake of some brevity) may be substituted without departing from the true spirit and scope of the disclosure. In addition, where a range of values is provided, it is understood that every intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the disclosure.
Also, it is contemplated that any optional feature of the variations described may be set forth and claimed independently, or in combination with any one or more of the features described herein. Reference to a singular item, includes the possibility that there are plural of the same items present. More specifically, as used herein and in claims associated hereto, the singular forms “a,” “an,” “said,” and “the” include plural referents unless the specifically stated otherwise. In other words, use of the articles allow for “at least one” of the subject item in the description above as well as claims associated with this disclosure. It is further noted that such claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
Without the use of such exclusive terminology, the term “comprising” in claims associated with this disclosure shall allow for the inclusion of any additional element—irrespective of whether a given number of elements are enumerated in such claims, or the addition of a feature could be regarded as transforming the nature of an element set forth in such claims. Except as specifically defined herein, all technical and scientific terms used herein are to be given as broad a commonly understood meaning as possible while maintaining claim validity.
The breadth of the present disclosure is not to be limited to the examples provided and/or the subject specification, but rather only by the scope of claim language associated with this disclosure.
This application is a continuation of U.S. patent application Ser. No. 17/401,090, filed on Aug. 12, 2021, which is a continuation of U.S. patent application Ser. No. 16/593,943, filed on Oct. 4, 2019, now U.S. Pat. No. 11,122,383, which claims priority to U.S. Provisional Application No. 62/741,677, filed on Oct. 5, 2018, and to U.S. Provisional Application No. 62/812,734, filed on Mar. 1, 2019, the contents of which are incorporated by reference herein in their entirety.
Number | Date | Country | |
---|---|---|---|
62812734 | Mar 2019 | US | |
62741677 | Oct 2018 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17401090 | Aug 2021 | US |
Child | 18061367 | US | |
Parent | 16593943 | Oct 2019 | US |
Child | 17401090 | US |