This disclosures relates generally to systems and methods for audio signal processing, and in particular to systems and methods for presenting audio signals in a mixed reality environment.
Immersive and believable virtual environments require the presentation of audio signals in a manner that is consistent with a user's expectations—for example, expectations that an audio signal corresponding to an object in a virtual environment will be consistent with that object's location in the virtual environment, and with a visual presentation of that object. Creating rich and complex soundscapes (sound environments) in virtual reality, augmented reality, and mixed-reality environments requires efficient presentation of a large number of digital audio signals, each appearing to come from a different location/proximity and/or direction in a user's environment. The soundscape includes a presentation of objects and is relative to a user; the positions and orientations of the objects and of the user may change quickly, requiring that the soundscape be adjusted accordingly. Adjusting a soundscape to believably reflect the positions and orientations of the objects and of the user can require rapid changes to audio signals that can result in undesirable sonic artifacts, such as “clicking” sounds, that compromise the immersiveness of a virtual environment. However, some techniques for reducing such sonic artifacts may be computationally expensive, particularly for mobile devices commonly used to interact with virtual environments. It is desirable for systems and methods of presenting soundscapes to a user of a virtual environment to accurately reflect the sounds of the virtual environment, while minimizing sonic artifacts and remaining computationally efficient.
Examples of the disclosure describe systems and methods for presenting an audio signal to a user of a wearable head device. According to an example method, a first input audio signal is received. The first input audio signal is processed to generate a first output audio signal. The first output audio signal is presented via one or more speakers associated with the wearable head device. Processing the first input audio signal comprises applying a pre-emphasis filter to the first input audio signal; adjusting a gain of the first input audio signal; and applying a de-emphasis filter to the first audio signal. Applying the pre-emphasis filter to the first input audio signal comprises attenuating a low frequency component of the first input audio signal. Applying the de-emphasis filter to the first input audio signal comprises attenuating a high frequency component of the first input audio signal.
In the following description of examples, reference is made to the accompanying drawings which form a part hereof, and in which it is shown by way of illustration specific examples that can be practiced. It is to be understood that other examples can be used and structural changes can be made without departing from the scope of the disclosed examples.
Example Wearable System
In some examples involving augmented reality or mixed reality applications, it may be desirable to transform coordinates from a local coordinate space (e.g., a coordinate space fixed relative to headgear device 1200A) to an inertial coordinate space, or to an environmental coordinate space. For instance, such transformations may be necessary for a display of headgear device 1200A to present a virtual object at an expected position and orientation relative to the real environment (e.g., a virtual person sitting in a real chair, facing forward, regardless of the position and orientation of headgear device 1200A), rather than at a fixed position and orientation on the display (e.g., at the same position in the display of headgear device 1200A). This can maintain an illusion that the virtual object exists in the real environment (and does not, for example, appear positioned unnaturally in the real environment as the headgear device 1200A shifts and rotates). In some examples, a compensatory transformation between coordinate spaces can be determined by processing imagery from the depth cameras 1244 (e.g., using a Simultaneous Localization and Mapping (SLAM) and/or visual odometry procedure) in order to determine the transformation of the headgear device 1200A relative to an inertial or environmental coordinate system. In the example shown in
In some examples, the depth cameras 1244 can supply 3D imagery to a hand gesture tracker 1211, which may be implemented in a processor of headgear device 1200A. The hand gesture tracker 1211 can identify a user's hand gestures, for example by matching 3D imagery received from the depth cameras 1244 to stored patterns representing hand gestures. Other suitable techniques of identifying a user's hand gestures will be apparent.
In some examples, one or more processors 1216 may be configured to receive data from headgear subsystem 1204B, the IMU 1209, the SLAM/visual odometry block 1206, depth cameras 1244, microphones 1250; and/or the hand gesture tracker 1211. The processor 1216 can also send and receive control signals from the 6DOF totem system 1204A. The processor 1216 may be coupled to the 6DOF totem system 1204A wirelessly, such as in examples where the handheld controller 1200B is untethered. Processor 1216 may further communicate with additional components, such as an audio-visual content memory 1218, a Graphical Processing Unit (GPU) 1220, and/or a Digital Signal Processor (DSP) audio spatializer 1222. The DSP audio spatializer 1222 may be coupled to a Head Related Transfer Function (HRTF) memory 1225. The GPU 1220 can include a left channel output coupled to the left source of imagewise modulated light 1224 and a right channel output coupled to the right source of imagewise modulated light 1226. GPU 1220 can output stereoscopic image data to the sources of imagewise modulated light 1224, 1226. The DSP audio spatializer 1222 can output audio to a left speaker 1212 and/or a right speaker 1214. The DSP audio spatializer 1222 can receive input from processor 1216 indicating a direction vector from a user to a virtual sound source (which may be moved by the user, e.g., via the handheld controller 1200B). Based on the direction vector, the DSP audio spatializer 1222 can determine a corresponding HRTF (e.g., by accessing a HRTF, or by interpolating multiple HRTFs). The DSP audio spatializer 1222 can then apply the determined HRTF to an audio signal, such as an audio signal corresponding to a virtual sound generated by a virtual object. This can enhance the believability and realism of the virtual sound, by incorporating the relative position and orientation of the user relative to the virtual sound in the mixed reality environment—that is, by presenting a virtual sound that matches a user's expectations of what that virtual sound would sound like if it were a real sound in a real environment.
In some examples, such as shown in
While
Audio Spatialization
The systems and methods described below can be implemented in an augmented reality or mixed reality system, such as described above. For example, one or more processors (e.g., CPUs, DSPs) of an augmented reality system can be used to process audio signals or to implement steps of computer-implemented methods described below; sensors of the augmented reality system (e.g., cameras, acoustic sensors, IMUs, LIDAR, GPS) can be used to determine a position and/or orientation of a user of the system, or of elements in the user's environment; and speakers of the augmented reality system can be used to present audio signals to the user.
In augmented reality or mixed reality systems such as described above, one or more processors (e.g., DSP audio spatializer 1222) can process one or more audio signals for presentation to a user of a wearable head device via one or more speakers (e.g., left and right speakers 1212/1214 described above). In some embodiments, the one or more speakers may belong to a unit separate from the wearable head device (e.g., a pair of headphones in communication with the wearable head device). Processing of audio signals requires tradeoffs between the authenticity of a perceived audio signal—for example, the degree to which an audio signal presented to a user in a mixed reality environment matches the user's expectations of how an audio signal would sound in a real environment—and the computational overhead involved in processing the audio signal. Realistically spatializing an audio signal in a virtual environment can be critical to creating immersive and believable user experiences.
The system 100A receives one or more input signals 102A-N. The one or more input signals 102A-N may include digital audio signals corresponding to the objects to be presented in the soundscape. In some embodiments, the digital audio signals may be a pulse-code modulated (PCM) waveform of audio data. The total number of input signals (N) may represent the total number of objects to be presented in the soundscape.
Each encoder of the one or more encoders 104A-N receives at least one input signal of the one or more input signals 102A-N and outputs one or more gain adjusted signals. For example, in some embodiments, encoder 104A receives input signal 102A and outputs gain adjusted signals. In some embodiments, each encoder outputs a gain adjusted signal for each speaker of the one or more speakers 108A-M delivering the soundscape. For example, encoder 104 outs M gain adjusted signals for each of the speakers 108A-M. Speakers 108A-M may belong to an augmented reality or mixed reality system such as described above; for example, one or more of speakers 108A-M may belong to a wearable head device such as described above and may be configured to present an audio signal directly to an ear of a user wearing the device. In order to make the objects in the soundscape appear to originate from specific locations/proximities, each encoder of the one or more encoders 104A-N accordingly sets values of control signals input to the gain modules.
Each encoder of the one or more encoders 104A-N includes one or more gain modules. For example, encoder 104A includes gain modules g_A1-AM. In some embodiments, each encoder of the one or more encoders 104A-N in the system 100A may include the same number of gain modules. For example, each of the one or more encoders 104A-N may each include M gain modules. In some embodiments, the total number of gain modules in an encoder corresponds to a total number of speakers delivering the soundscape. Each gain module receives at least one input signal of the one or more input signals 102A-N, adjusts a gain of the input signal, and outputs a gain adjusted signal. For example, gain module g_A1 receives input signal 102A, adjusts a gain of the input signal 102A, and outputs a gain adjusted signal. Each gain module adjusts the gain of the input signal based on a value of a control signal of one or more control signals CTRL_A1-NM. For example, gain module g_A1 adjusts the gain of the input signal 102A based on a value of control signal CTRL_A1. Each encoder adjusts values of control signals input to the gain modules based on a location/proximity of the object to be presented in the soundscape the input signal corresponds to. Each gain module may be a multiplier that multiplies the input signal by a factor that is a function of a value of a control signal.
The mixer 106 receives gain adjusted signals from the encoders 104A-N, mixes the gain adjusted signals, and outputs mixed signals to the speakers 108A-M. The speakers 108A-M receive mixed signals from the mixer 106 and output sound. In some embodiments, the mixer 106 may be removed from the system 100A if there is only one input signal (e.g., input 102A).
In some embodiments, to perform this operation, a spatialization system (“spatializer”) processes each input signal (e.g., digital audio signal (“source’)) with a pair of Head-Related Transfer Function (HRTF) filters that simulate propagation and diffraction of sound through and by an outer ear and head of a user. The pair of HRTF filters include a HRTF filter for a left ear of the user and a HRTF filter for a right ear of the user. The outputs of the left ear HRTF filters for all sources are mixed together and played through a left ear speaker, and the outputs of the right ear HRTF filters for all sources are mixed together and played through a right ear speaker.
In the example, the decoder 110 includes left HRTF filters L_HRTF_1-M and right HRTF filters R_HRTF_1-M. The decoder 110 receives mixed signals from the mixer 106, filters and sums the mixed signals, and outputs filtered signals to the speakers 112. For example, the decoder 110 receives a first mixed signal from the mixer 106 representing a first object to be presented in the soundscape. Continuing the example, the decoder 110 processes the first mixed signal through a first left HRTF filter L_HRTF_1 and a first right HRTF filter R_HRTF_1. Specifically, the first left HRTF filter L_HRTF_1 filters the first mixed signal and outputs a first left filtered signal, and the first right HRTF filter R_HRTF_1 filters the first mixed signal and outputs a first right filtered signal. The decoder 110 sums the first left filtered signal with other left filtered signals, for example, output from the left HRTF filters L_HRTF_2-M, and outputs a left output signal to the left ear speaker 112A. The decoder 110 sums the first right filtered signal with other right filtered signals, for example, output from the right HRTF filters R_HRTF_2-M, and outputs a right output signal to the right ear speaker 112B.
In some embodiments, the decoder 110 may include a bank of HRTF filters. Each of the HRTF filters in the bank may model a specific direction relative to a user's head. In some embodiments, computationally efficient rendering methods may be used wherein incremental processing cost per virtual sound source is minimized. These methods may be based on decomposition of HRTF data over a fixed set of spatial functions and a fixed set of basis filters. In these embodiments, each mixed signal from the mixer 106 may be mixed into inputs of the HRTF filters that model directions that are closest to a source's direction. The levels of the signals mixed into each of those HRTF filters is determined by the specific direction of the source.
If directions and/or locations of the objects presented in the soundscape change, the encoders 104A-N can change the value of the control signals CTRL_A1-NM for the gain modules g_A1-NM to appropriately present the objects in the soundscape.
In some embodiments, the encoders 104A-N may change the values of the control signals CTRL_A1-NM for the gain modules g_A1-NM instantaneously. However, changing the values of the control signals CTRL_A1-NM instantaneously for the system 100A of
To reduce such sonic artifacts, in some embodiments, the encoders 104A-N may change the values of the control signals CTRL_A1-NM for the gain modules g_A1-NM over a period of time, rather than instantaneously. In some embodiments, the encoders 104A-N may compute new values for the control signals CTRL_A1-NM for each and every sample of the input signals 102A-N. The new values for the control signals CTRL_A1-NM may be only slightly different than previous values. The new values may follow a linear curve, an exponential curve, etc. This process may repeat until the required mixing levels for the new direction/location is/are reached. However, computing new values for the control signals CTRL_A1-NM for each and every sample of the input signals 102A-N for the system 100A of
In some embodiments, the encoders 104A-N may compute new values for the control signals CTRL_A1-NM repeatedly, for example, once every several samples, every two samples, every four samples, every ten samples, and the like. This process may repeat until the required mixing levels for the new direction/location is reached. However, computing new values for the control signals CTRL_A1-NM once every several samples for the system 100A of
To reduce sonic artifacts, in some embodiments, an encoder may search an input signal for a zero crossing and, at a point in time of the zero crossing, adjust values of control signals. In some embodiments, it may take many computing cycles for the encoder to search the input signal for a zero crossing and, at the point in time of the zero crossing, adjust the values of the control signals. However, if the input signal has a direct-current (DC) bias, the encoder may never detect or determine a zero crossing in the input signal and so would never adjust the value of the control signals. As such, a high pass filter or a DC blocking filter may be introduced before the encoder to reduce/remove the DC bias and ensure there are enough zero crossings in the signal. In some embodiments of a system (e.g., the system 100A and/or the system 100B), a high pass filters or a DC blocking filters may be introduced before each encoder in the system. Once the DC bias is reduced/removed from the input signal, the encoder may search the input signal without the DC bias for a zero crossing and, at the point in time of the zero crossing, adjust values of control signals. Searching for zero crossings may be time consuming. If the system includes other components or modules that make changes to a signal, those other components or modules would similarly search signals input to the other component or module for a zero crossing and, at a point in time of the zero crossing, adjust values of parameters of various components or modules.
As a non-limiting example,
The system 200 receives an input signal 202. The input signal 202 may include a digital audio signal corresponding to an object to be presented in a soundscape. The encoder 204 receives the input signal 202 and outputs four gain adjusted signals. The encoder 204 outputs a gain adjusted signal for each speaker of the first through fourth speakers 208A-D delivering the soundscape. In order to make the object in the soundscape appear to originate from a specific location/proximity, the encoder 204 accordingly sets values of control signals input to first through fourth gain modules g_1-4. The encoder 204 includes first through fourth gain modules g_1-4. The total number of gain modules corresponds to a total number of speakers delivering the soundscape. Each gain module of the first through fourth gain modules g_1-4 receives the input signal 202, adjusts a gain of the input signal 202, and outputs a gain adjusted signal. Each gain module of the first through fourth gain modules g_1-4 adjusts the gain of the input signal 202 based on a value of a control signal of first through fourth control signals CTRL_1-4. For example, the first gain module g_1 adjusts the gain of the input signal 202 based on a value of the first control signal CTRL_1. The encoder 204 adjusts the values of the first through fourth control signals CTRL_1-4 input to the first through fourth gain modules g_1-4 based on a location and/or proximity of the object to be presented in the soundscape the input signal 202 corresponds to. The mixer 206 receives gain adjusted signals from the encoder 204, mixes the gain adjusted signals, and outputs mixed signals to the first through fourth speakers 208A-D. In this example, because there is only one input signal 202 and only one encoder 204, the mixer 206 does not mix any gain adjusted signals. The first through fourth speakers 208A-D receive mixed signals from the mixer 106 and output sound.
In the example, each pre-emphasis filter of the one or more pre-emphasis filters 332A-N receives at least one input signal of the one or more input signals 302A-N, filters the input signal, and outputs a filtered signal to an encoder of the one or more encoders 304A-N. Each pre-emphasis filter filters at least one input signal, for example, by reducing low frequency energy from the input signal. An amplitude of a filtered signal output from the pre-emphasis filter may be closer to zero than the amplitude of the input signal. The severity of the sonic artifacts, which may be due to instantaneously changing the values of the control signals which may be dependent on a combination of the amount of gain change and the amplitude of the input signal at the time of the gain change, may be lessened by the amplitude of the filtered signal being close to zero.
In the example, each encoder of the one or more encoders 304A-N can adjust values of control signals input to gain modules based on a location/proximity of an object to be presented in the soundscape that the input signal, and therefore the filtered signal, corresponds to. Each encoder may adjust the values of the control signals instantaneously without resulting in sonic artifacts at the speakers 308A-M. This is because each gain module adjusts a gain of the filtered signal (e.g., the output of pre-emphasis filters 332A-N) rather than adjusting the input signal directly.
In the example, each de-emphasis filter of the one or more de-emphasis filters 334A-N receives a signal, for example a mixed signal of one or mixed signals output from the mixer 306, reconstructs a signal from the mixed signal, and outputs a reconstructed signal to a speaker of the one or more speakers 308A-M. Each de-emphasis filter can filter a signal, for example, by reducing high frequency energy from the signal. In some embodiments, the de-emphasis filter may turn all abrupt changes in amplitude of the input signal into changes in slopes of the waveform.
Instantaneously changing the values of the control signals can cause a change in the amplitude of the signal's waveform which may introduce predominately high-frequency noise. The pre-emphasis filter reduces the amplitude of the at least one input signal. The de-emphasis filter turns abrupt changes in amplitude of the signal into changes in slopes of the waveform with reduced high-frequency noise.
As illustrated in
In some embodiments, the filters 806, clustered reflections 814, reverberation module 816, reverberation panning module 818, and/or reverberation occlusion module 820 may be adjusted based on one or values of one or more control signals. In embodiments without the pre-emphasis filter 802 and the de-emphasis filter 826, instantaneously and/or repeatedly changing the values of the control signals may result in sonic artifacts. The pre-emphasis filter 802 and the de-emphasis filter 826 may reduce the severity of the sonic artifacts, such as described above.
In the example shown, the pre-emphasis filter 802 receives a 3D source signal, filters the 3D source signal, and outputs a filtered signal to the pre-processing module 804. The 3D source signal may be analogous to the input signals described above, for example, with respect to
The pre-processing module 804 includes one or more filters 806, one or more pre-delay modules 808, one or more panning modules 810, and a switch 812.
The filtered signal received from the pre-emphasis filter 802 is input to the one or more filters 806. The one or more filters 806 may be, for example, distance filters, air absorption filters, source directivity filters, occlusion filters, obstruction filters, and the like. A first filter of the one or more filters 806 outputs a signal to the switch 812, and the remaining filters of the one or more filters 806 output respective signals to pre-delay modules 808.
The switch 812 receives a signal output from the first filter and directs the signal to a first panning module, to a second panning module, or an interaural time difference (ITD) delay module. The ITD delay module outputs a first delayed signal to a third panning module and a second delayed signal to a fourth panning module.
The one or more pre-delay modules 808 each receive a respective signal, delay the received signal, and output a delayed version of the received signal. A first pre-delay module outputs first delayed signal to a fifth panning module. The remaining delay modules output delayed signals to various reverberation send buses.
The one or more panning modules 810 each pan a respective input signal to a bus. The first panning module pans the signal into a diffuse bus, the second panning module pans the signal into a standard bus, the third panning module pans the signal into a left bus, the fourth panning module pans the signal into a right bus, and the fifth panning module pans the signal into a clustered reflections bus.
The clustered reflections bus outputs a signal to the clustered reflections module 814. The clustered reflections module 814 generates a cluster of reflections and outputs the cluster of reflections to a clustered reflections occlusion module.
The various reverberation send buses output signals to various reverberation modules 816. The reverberation modules 816 generate reverberations and output the reverberations to various reverberation panning modules 818. The reverberation panning modules 818 pan the reverberations to various reverberation occlusion modules 820. The reverberation occlusion modules 820 model occlusions and other properties similar to the filters 806 and output occluded panned reverberations to the standard bus.
The multi-channel decorrelation filter bank 822 receives the diffuse bus and applies one or more decorrelation filters; for example, the filter bank 822 spreads signals to create sounds of non-point sources and outputs the diffused signals to the standard bus.
The virtualizer 824 receives the left bus, the right bus, and the standard bus and outputs signals to the de-emphasis filter 826. The virtualizer 824 may be analogous to decoders described above, for example, with respect to
Various exemplary embodiments of the disclosure are described herein. Reference is made to these examples in a non-limiting sense. They are provided to illustrate more broadly applicable aspects of the disclosure. Various changes may be made to the disclosure described and equivalents may be substituted without departing from the true spirit and scope of the disclosure. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process act(s) or step(s) to the objective(s), spirit or scope of the present disclosure. Further, as will be appreciated by those with skill in the art that each of the individual variations described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present disclosure. All such modifications are intended to be within the scope of claims associated with this disclosure.
The disclosure includes methods that may be performed using the subject devices. The methods may include the act of providing such a suitable device. Such provision may be performed by the end user. In other words, the “providing” act merely requires the end user obtain, access, approach, position, set-up, activate, power-up or otherwise act to provide the requisite device in the subject method. Methods recited herein may be carried out in any order of the recited events which is logically possible, as well as in the recited order of events.
Exemplary aspects of the disclosure, together with details regarding material selection and manufacture have been set forth above. As for other details of the present disclosure, these may be appreciated in connection with the above-referenced patents and publications as well as generally known or appreciated by those with skill in the art. The same may hold true with respect to method-based aspects of the disclosure in terms of additional acts as commonly or logically employed.
In addition, though the disclosure has been described in reference to several examples optionally incorporating various features, the disclosure is not to be limited to that which is described or indicated as contemplated with respect to each variation of the disclosure. Various changes may be made to the disclosure described and equivalents (whether recited herein or not included for the sake of some brevity) may be substituted without departing from the true spirit and scope of the disclosure. In addition, where a range of values is provided, it is understood that every intervening value, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the disclosure.
Also, it is contemplated that any optional feature of the variations described may be set forth and claimed independently, or in combination with any one or more of the features described herein. Reference to a singular item, includes the possibility that there are plural of the same items present. More specifically, as used herein and in claims associated hereto, the singular forms “a,” “an,” “said,” and “the” include plural referents unless the specifically stated otherwise. In other words, use of the articles allow for “at least one” of the subject item in the description above as well as claims associated with this disclosure. It is further noted that such claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
Without the use of such exclusive terminology, the term “comprising” in claims associated with this disclosure shall allow for the inclusion of any additional element—irrespective of whether a given number of elements are enumerated in such claims, or the addition of a feature could be regarded as transforming the nature of an element set forth in such claims. Except as specifically defined herein, all technical and scientific terms used herein are to be given as broad a commonly understood meaning as possible while maintaining claim validity.
The breadth of the present disclosure is not to be limited to the examples provided and/or the subject specification, but rather only by the scope of claim language associated with this disclosure.
This application claims priority to U.S. Provisional Application No. 62/742,254, filed on Oct. 5, 2018, to U.S. Provisional Application No. 62/812,546, filed on Mar. 1, 2019, and to U.S. Provisional Application No. 62/742,191, filed on Oct. 5, 2018, the contents of which are incorporated by reference herein in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
5491839 | Schotz | Feb 1996 | A |
8428269 | Brungart | Apr 2013 | B1 |
20030007648 | Currell | Jan 2003 | A1 |
Number | Date | Country |
---|---|---|
0138548 | Apr 1985 | EP |
0138548 | Apr 1985 | EP |
Entry |
---|
International Search Report and Written Opinion dated Dec. 20, 2019, for PCT Application No. PCT/US2019/54894, filed Oct. 4, 2019, fourteen pages. |
Savell, Thomas C. (Mar. 2, 2014). “The EMU10K1 Digital Audio Processor,” 1999 IEEE, Joint Emu/Creative Technology Center, WayBackMachine Archive, located at: https://web.archive.org/web/20140302033645/http:/alsa.cybermirror.org/manuals/creative/m2049.pdf. |
VOGONS Vintage Driver Library. (May 7, 1999). “Sound Blaster Live! Install (VXD 4.06.610 Liveware 2.0),” VOGONS Forums, uploaded by Swaaye on Mar. 13, 2015, located at: http://vogonsdrivers.com/getfile.php?fileid=781&menustate=0. |
Wikipedia. (Jun. 3, 2018). “Sound Blaster Live!,” WayBackMachine Archive, located at: https://web.archive.org/web/20180603220605/https:/en.wikipedia.org/wiki/Sound_Blaster_Live! last edited on Jan. 20, 2018. |
Number | Date | Country | |
---|---|---|---|
20200112816 A1 | Apr 2020 | US |
Number | Date | Country | |
---|---|---|---|
62742191 | Oct 2018 | US | |
62742254 | Oct 2018 | US | |
62812546 | Mar 2019 | US |