The present disclosure relates generally to the field of stereophony, and more specifically to clustering multiple audio sources for users of virtual-reality systems.
Humans can determine locations of sounds by comparing sounds perceived at each ear. The brain can determine the location of a sound source by utilizing subtle intensity, spectral, and timing differences of the sound perceived in each ear.
The intensity, spectra, and arrival time of the sound at each ear is characterized by a head-related transfer function (HRTF) unique to each user.
In virtual-reality systems, it is advantageous to generate an accurate virtual acoustic environment for users that reproduce sounds for sources at different virtual locations to create an immersive virtual-reality environment. When there are many distinct sound sources, treating them all as separate sound sources is computationally complex, making it difficult to render all of the sounds in real-time. On the other hand, if many sound sources are combined, it creates a disparity between the visual location of objects in the VR world and the perceived auditory locations of the objects. Because of the complexity of having many audio sources, conventional approaches typically fail to provide a real-time VR environment with accurate source positioning.
One solution to the problem includes determining when it is possible to cluster two or more audio sources of the virtual-reality acoustic environment together. By clustering the two or more audio sources, the user can hear the sources coming from a specific direction while reducing significant hardware resources and processing time.
A primary goal of a sound propagation and rendering system is the ability to handle a large number of sound sources. One way to increase the possible source count is to cluster nearby sources together, then simulate those sources together as a single proxy source. If a suitable clustering heuristic is chosen, the simulation time can be greatly reduced without significantly impacting quality. Clustering can be more aggressive for sources that are farther away or are in a different part of the environment than the listener. Clustering of sources is important for handling complex scenes. For instance, a game character may use several sound sources—footsteps, voice, gun, etc. Large scenes may have dozens of active characters, each with a few sound sources. Clustering enables these sources to be collapsed down to one source per character (or possibly combining multiple characters' sources), except where it would negatively impact the rendered audio.
In accordance with some embodiments, a method of clustering audio sources in virtual environments is performed at a virtual-reality device displaying a virtual environment. The device identifies two audio sources in the virtual environment. For each of the two audio sources, the device determines a respective bounding box in the virtual environment. The respective bounding box includes termination points for a respective plurality of rays emanating from a respective point in the virtual environment corresponding to the respective audio source. In some embodiments, rays are emitted from the surface of the source (e.g., a point, sphere, or other geometric shape). The device applies an overlap test to the bounding boxes to determine whether the two audio sources are in the same room. The device also identifies a location of a listener in the virtual environment, and determines an angle θ (e.g., a solid angle of the bounding box of the sources) according to rays from the location of the listener to the points in the virtual environment corresponding to the two audio sources. When the two audio sources are determined to be in the same room and the angle θ is less than a predetermined threshold angle Tθ, the device clusters the two audio sources together, including rendering combined audio for the two audio sources, from a single cluster audio location. In some embodiments, the single cluster audio location is treated as if the sound (e.g., a mixture of the two sources) is emanating from both source locations. When the two audio sources are determined not to be in the same room or the angle θ is greater than the threshold angle Tθ, the device does not cluster the two or more audio sources together, and renders audio for the virtual environment without combining audio for the two audio sources.
In some embodiments, applying the overlap test includes determining whether overlap between the two bounding boxes is more than a threshold fraction of each bounding box. In some embodiments, the bounding boxes are R1 and R2, and the device determines respective volumes |R1|v and |R2|v for the bounding boxes R1 and R2. The device also determining the volume |R1∩R2|v of the overlap between the two bounding boxes. Using this data, the device computes the minimum overlap value
The two audio sources are determined to be in the same room when the minimum overlap value exceeds a threshold fraction TR.
Generally, embodiments recheck the clustering conditions at regular intervals (e.g., every 100 ms). In some embodiments, after clustering the two audio sources together and passage of a time interval Δt, the device determines updated respective volumes |R1′|v and |R2′|v for updated bounding boxes R1′ and R2′ corresponding to the two audio sources. The device also determines an updated volume |R1′∩R2′|v of overlap between the two updated bounding boxes. The device then computes the updated minimum overlap value
When the updated minimum overlap value is less than a predetermined split threshold fraction TRsplit the device de-clusters the two audio sources. The predetermined split threshold fraction TRsplit is less than the threshold fraction TR (which prevents rapid switching back and forth between clustering and de-clustering). When the updated minimum overlap value is greater than the split threshold fraction TRsplit, the device maintains clustering of the two audio sources.
In some embodiments, each termination point of a respective ray emanating from a respective point corresponding to a respective audio source is a location in the virtual environment where either (1) the respective ray encounters an object in the virtual environment or (2) the respective ray exits from the virtual environment. In some embodiments, the virtual environment has perpendicular coordinate axes, and each bounding box is a minimal axis-aligned rectangle containing the termination points of its respective plurality of rays. For example, the virtual environment may be surrounded by four walls that meet at right angles. The bounding boxes for the audio sources may be aligned to be parallel to the walls of the virtual environment.
In some embodiments, after clustering the two audio sources together and passage of a time interval Δt, the device reevaluates the angle test. In particular, the device computes an updated angle θ′ according to rays from an updated location of the listener to updated points in the virtual environment corresponding to the two audio sources. When the updated angle θ′ is greater than the predetermined split threshold angle Tθsplit, the device de-clusters the two audio sources. The predetermined split threshold angle Tθsplit is greater than the threshold angle Tθ (which prevents rapid switching back and forth between clustering and de-clustering). When the updated angle θ′ is less than the split threshold angle Tθsplit, the device maintains clustering of the two audio sources.
In accordance with some embodiments, a method is performed at a virtual-reality device displaying a virtual scene. The method determines a bounding box of an acoustic space of the virtual scene. A listener of the virtual scene is located within the determined bounding box. In some embodiments, the bounding box of the listener is computed using the same method described herein to determine a bounding box of an audio source and/or an acoustic space. The method further determines one or more clustering metrics for two or more audio sources (distinct from the listener) of the virtual scene. The two or more audio sources are positioned within the acoustic space. When the one or more clustering metrics for the two or more audio sources satisfy clustering criteria, the method clusters the two or more audio sources together as a single audio source, and renders audio for the virtual scene. At least a portion of the rendered audio combines audio associated with the clustered two or more audio sources.
In accordance with some embodiments, a virtual-reality device includes one or more processors, memory, and one or more programs stored in the memory. The programs are configured to be executed by the one or more processors. The virtual-reality device determines a bounding box of an acoustic space of a virtual scene. The virtual scene is displayed by the virtual-reality device. A listener of the virtual scene is located within the determined bounding box. The virtual-reality device determines one or more clustering metrics for two or more audio sources, distinct from the listener, of the virtual scene. The two or more audio sources are positioned within the acoustic space. When the one or more clustering metrics for the two or more audio sources satisfy clustering criteria, the device clusters the two or more audio sources as a single audio source and render audio for the virtual scene. At least a portion of the rendered audio combines audio associated with the clustered two or more sources.
In accordance with some embodiments, a head-mounted display device includes one or more processors/cores and memory storing one or more programs configured to be executed by the one or more processors/cores. The one or more programs include instructions for performing the operations of any of the methods described herein. In accordance with some embodiments, a non-transitory computer-readable storage medium stores instructions that, when executed by one or more processors/cores of a head-mounted display device, cause the device to perform the operations of any of the methods described herein.
In another aspect, a head-mounted display device is provided and the head-mounted display device includes means for performing any of the methods described herein.
Thus, the disclosed embodiments provide an efficient way to cluster certain audio sources within a virtual environment, which enables the virtual reality system to provide a better user experience.
For a better understanding of the various described embodiments, reference should be made to the Description of Embodiments below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures and specification.
The figures depict embodiments of the present disclosure for purposes of illustration only. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles, or benefits, of the disclosure described herein.
Reference will now be made to embodiments, examples of which are illustrated in the accompanying drawings. In the following description, numerous specific details are set forth in order to provide an understanding of the various described embodiments. However, it will be apparent to one of ordinary skill in the art that the various described embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components, circuits, and networks have not been described in detail so as not to unnecessarily obscure aspects of the embodiments.
It will also be understood that, although the terms first and second are used, in some instances, to describe various elements, these elements should not be limited by these terms. These terms are used only to distinguish one element from another. For example, a first audio source could be termed a second audio source, and, similarly, a second audio source could be termed a first audio source, without departing from the scope of the various described embodiments. The first audio source and the second audio source are both audio source, but they are not the same audio source, unless specified otherwise.
The terminology used in the description of the various described embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various described embodiments and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof.
As used herein, the term “if” means “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” means “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.
A virtual-reality (VR) system simulates sounds that a user of the VR system perceives to have originated from sources at desired virtual locations of the virtual environment.
Embodiments of the virtual-reality system 100 may include or be implemented in conjunction with an artificial reality system. Artificial reality is a form of reality that has been adjusted in some manner before presentation to a user, which may include virtual-reality (VR), augmented reality (AR), mixed reality (MR), hybrid reality, or some combination and/or derivative thereof. Artificial reality content may include completely generated content or generated content combined with captured (e.g., real-world) content. The artificial reality content may include video, audio, haptic feedback, or some combination thereof. Any of this may be presented in a single channel or in multiple channels (such as stereo video that produces a three-dimensional effect to the viewer). In some embodiments, artificial reality may be associated with applications, products, accessories, services, or some combination thereof, that are used, for example, to create content in an artificial reality and/or are otherwise used in (e.g., perform activities in) artificial reality. The artificial reality system that provides the artificial reality content may be implemented on various platforms, including a head-mounted display (HMD) connected to a host computer system, a standalone HMD, a mobile device, or a computing system.
While
In some embodiments, the display device 101 is a head-mounted display that presents media to a user of the display device 101. The display device 101 may be referred to herein as a head-mounted display device. Examples of media presented by a display device 101 include one or more images, video, audio, or some combination thereof. In some embodiments, audio is presented via an external output device 140 (e.g., speakers and/or headphones) that receives audio information from the display device 101, the console 150, or both, and presents audio data based on the audio information. In some embodiments, the display device 101 immerses a user in a virtual environment.
In some embodiments, the display device 101 also acts as an augmented reality (AR) headset. In these embodiments, the display device 101 augments views of a physical, real-world environment with computer-generated elements (e.g., images, video, or sound). Moreover, in some embodiments, the display device 101 is able to cycle between different types of operation. Thus, the display device 101 operates as a virtual-reality (VR) device, an AR device, as glasses, or some combination thereof (e.g., glasses with no optical correction, glasses optically corrected for the user, sunglasses, or some combination thereof) based on instructions from the application engine 156.
In some embodiments, the display device 101 includes one or more of each of the following: an electronic display 102, processor(s) 103, an optics block 104, position sensors 106, a focus prediction module 108, an eye tracking module 110, locators 114, an inertial measurement unit 116, head tracking sensors 118, a scene rendering module 120, and memory 122. In some embodiments, the display device 101 includes only a subset of the modules described here. In some embodiments, display device 101 has different modules than those described here. Similarly, the functions can be distributed among the modules in a different manner than is described here.
One or more processors 103 (e.g., processing units or cores) execute instructions stored in the memory 122. The memory 122 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM or other random access solid state memory devices, and may include non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The memory 122, or alternatively the non-volatile memory device(s) within the memory 122, comprises a non-transitory computer readable storage medium. In some embodiments, the memory 122 or the computer readable storage medium of the memory 122 stores programs, modules, and data structures, and/or instructions for displaying one or more images on the display 102.
The display 102 displays images to the user in accordance with data received from the console 150 and/or the processors 103. In various embodiments, the display 102 comprises a single adjustable display element or multiple adjustable displays elements (e.g., a display for each eye of a user).
The optics block 104 directs light from the display 102 to an exit pupil, for viewing by a user, using one or more optical elements, such as Fresnel lenses, convex lenses, concave lenses, filters, and so forth, and may include combinations of different optical elements. The optics block 104 typically includes one or more lenses. In some embodiments, when the display 102 includes multiple adjustable display elements, the optics block 104 includes multiple optics blocks 104 (one for each adjustable display element).
The optics block 104 may be designed to correct one or more optical errors. Examples of optical errors include: barrel distortion, pincushion distortion, longitudinal chromatic aberration, transverse chromatic aberration, spherical aberration, comatic aberration, field curvature, astigmatism, and so forth. In some embodiments, content provided to the display 102 for display is pre-distorted, and the optics block 104 corrects the distortion when it receives image light from the display 102 generated based on the content.
Each state of the optics block 104 corresponds to a particular location of a focal plane of the display device 101. In some embodiments, the optics block 104 moves in a range of 5-10 mm with a positional accuracy of 5-10 μm. This can lead to 1000 states (e.g., positions) of the optics block 104. Any number of states could be provided. In some embodiments, fewer states are used. For example, in some cases, a first state corresponds to a focal plane located at infinity, a second state corresponds to a focal plane located at 2.0 meters (from a reference plane), a third state corresponds to a focal plane located at 1.0 meter, a fourth state corresponds to a focal plane located at 0.5 meters, a fifth state corresponds to a focal plane located at 0.333 meters, and a sixth state corresponds to a focal plane located at 0.250 meters.
Optional locators 114 are objects located in specific positions on the display device 101 relative to one another and relative to a specific reference point on the display device 101. A locator 114 may be a light emitting diode (LED), a corner cube reflector, a reflective marker, a type of light source that contrasts with an environment in which the display device 101 operates, or some combination thereof. In some embodiments, the locators 114 include active locators (e.g., an LED or other type of light emitting device) configured to emit light in the visible band (e.g., about 400 nm to 750 nm), in the infrared (IR) band (e.g., about 750 nm to 1 mm), in the ultraviolet band (e.g., about 100 nm to 400 nm), some other portion of the electromagnetic spectrum, or some combination thereof.
In some embodiments, the locators 114 are located beneath an outer surface of the display device 101, which is transparent to the wavelengths of light emitted or reflected by the locators 114 or is thin enough to not substantially attenuate the wavelengths of light emitted or reflected by the locators 114. In some embodiments, the outer surface or other portions of the display device 101 are opaque in the visible band of wavelengths of light. Thus, the locators 114 may emit light in the IR band under an outer surface that is transparent in the IR band but opaque in the visible band.
An inertial measurement unit (IMU) 116 is an electronic device that generates first calibration data based on measurement signals received from one or more head tracking sensors 118. One or more head tracking sensors 118 generate one or more measurement signals in response to motion of the display device 101. Examples of head tracking sensors 118 include accelerometers, gyroscopes, magnetometers, sensors suitable for detecting motion, sensors suitable for correcting errors associated with the IMU 116, or some combination thereof. The head tracking sensors 118 may be located external to the IMU 116, internal to the IMU 116, or some combination thereof.
Based on the measurement signals from the head tracking sensors 118, the IMU 116 generates first calibration data indicating an estimated position of the display device 101 relative to an initial position of the display device 101. For example, the head tracking sensors 118 include multiple accelerometers to measure translational motion (forward/back, up/down, left/right) and multiple gyroscopes to measure rotational motion (e.g., pitch, yaw, and roll). The IMU 116 can, for example, rapidly sample the measurement signals and calculate the estimated position of the display device 101 from the sampled data. For example, the IMU 116 integrates the measurement signals received from the accelerometers over time to estimate a velocity vector and integrates the velocity vector over time to determine an estimated position of a reference point on the display device 101. Alternatively, the IMU 116 provides the sampled measurement signals to the console 150, which determines the first calibration data. The reference point is a point that may be used to describe the position of the display device 101. The reference point is generally defined as a point in space. However, in practice the reference point is defined as a point within the display device 101 (e.g., the center of the IMU 116).
In some embodiments, the IMU 116 receives one or more calibration parameters from the console 150. As further discussed below, the one or more calibration parameters are used to maintain tracking of the display device 101. Based on a received calibration parameter, the IMU 116 may adjust one or more IMU parameters (e.g., sample rate). In some embodiments, certain calibration parameters cause the IMU 116 to update an initial position of the reference point so it corresponds to a next calibrated position of the reference point. Updating the initial position of the reference point as the next calibrated position of the reference point helps reduce accumulated error associated with the determined estimated position. The accumulated error, also referred to as drift error, causes the estimated position of the reference point to “drift” away from the actual position of the reference point over time.
An optional scene rendering module 120 receives content for the virtual scene from the application engine 156 and provides the content for display on the display 102. Additionally, the scene rendering module 120 can adjust the content based on information from the focus prediction module 108, a vergence processing module 112, the IMU 116, and/or head tracking sensors 118. For example, upon receiving the content from the engine 156, the scene rendering module 120 adjusts the content based on the predicted state (e.g., a state that corresponds to a particular eye position) of the optics block 104 received from focus the prediction module 108 by adding a correction or pre-distortion into rendering of the virtual scene to compensate or correct for the distortion caused by the predicted state of the optics block 104. The scene rendering module 120 may also add depth of field blur based on the user's gaze, vergence depth (or accommodation depth) received from a vergence processing module, or measured properties of the user's eye (e.g., three-dimensional position of the eye). Additionally, the scene rendering module 120 determines a portion of the content to be displayed on the display 102 based on one or more of the tracking module 154, the head tracking sensors 118, or the IMU 116, as described further below.
The imaging device 160 generates second calibration data in accordance with calibration parameters received from the console 150. The second calibration data includes one or more images showing observed positions of the locators 114 that are detectable by imaging device 160. In some embodiments, the imaging device 160 includes one or more cameras, one or more video cameras, other devices capable of capturing images including one or more locators 114, or some combination thereof. Additionally, the imaging device 160 may include one or more filters (e.g., for increasing signal to noise ratio). The imaging device 160 is configured to detect light emitted or reflected from the locators 114 in a field of view of the imaging device 160. In embodiments where the locators 114 include passive elements (e.g., a retroreflector), the imaging device 160 may include a light source that illuminates some or all of the locators 114, which retro-reflect the light towards the light source in the imaging device 160. The second calibration data is communicated from the imaging device 160 to the console 150, and the imaging device 160 receives one or more calibration parameters from the console 150 to adjust one or more imaging parameters (e.g., focal length, focus, frame rate, ISO, sensor temperature, shutter speed, or aperture).
The input interface 142 is a device that allows a user to send action requests to the console 150. An action request is a request to perform a particular action. For example, an action request may be to start or end an application or to perform a particular action within the application. The input interface 142 may include one or more input devices. Example input devices include a keyboard, a mouse, a game controller, or any other suitable device for receiving action requests and communicating the received action requests to the console 150. An action request received by the input interface 142 is communicated to the console 150, which performs an action corresponding to the action request. In some embodiments, the input interface 142 provides haptic feedback to the user in accordance with instructions received from the console 150. For example, haptic feedback is provided by the input interface 142 when an action request is received, or the console 150 communicates instructions to the input interface 142 causing the input interface 142 to generate haptic feedback when the console 150 performs an action.
The console 150 provides media to the display device 101 for presentation to the user in accordance with information received from the imaging device 160, the display device 101, and/or the input interface 142. In the example shown in
When the application store 152 is included in the console 150, the application store 152 stores one or more applications for execution by the console 150. An application is a group of instructions, that, when executed by a processor (e.g., the processors 103), is used for generating content for presentation to the user. Content generated by the processor based on an application may be in response to inputs received from the user via movement of the display device 101 or the input interface 142. Examples of applications include gaming applications, conferencing applications, or video playback applications.
When the tracking module 154 is included in the console 150, the tracking module 154 calibrates the virtual-reality system 100 using one or more calibration parameters and may adjust one or more calibration parameters to reduce error in the determination of the position of the display device 101. For example, the tracking module 154 adjusts the focus of the imaging device 160 to obtain a more accurate position for the observed locators 114 on the display device 101. Moreover, calibration performed by the tracking module 154 also accounts for information received from the IMU 116. Additionally, if tracking of the display device 101 is lost (e.g., the imaging device 160 loses line of sight of at least a threshold number of the locators 114), the tracking module 154 re-calibrates some or all of the system components.
In some embodiments, the tracking module 154 tracks the movement of the display device 101 using calibration data from the imaging device 160. For example, tracking module 154 determines positions of a reference point on the display device 101 using observed locators from the calibration data from the imaging device 160 and a model of the display device 101. In some embodiments, the tracking module 154 also determines positions of the reference point on the display device 101 using position information from the calibration data from the IMU 116 on the display device 101. In some embodiments, the tracking module 154 uses portions of the first calibration data, the second calibration data, or some combination thereof, to predict a future location of the display device 101. The tracking module 154 provides the estimated or predicted future position of the display device 101 to the application engine 156.
The application engine 156 executes applications within the virtual-reality system 100 and receives position information, acceleration information, velocity information, predicted future positions, or some combination thereof for the display device 101 from the tracking module 154. Based on the received information, the application engine 156 determines content to provide to the display device 101 for presentation to the user, such as a virtual scene. For example, if the received information indicates that the user has looked to the left, the application engine 156 generates content for the display device 101 that mirrors or tracks the user's movement in the virtual environment. Additionally, the application engine 156 performs an action within an application executing on the console 150 in response to an action request received from the input interface 142 and provides feedback to the user that the action was performed. The provided feedback may be visual or audible feedback via the display device 101 or the haptic feedback via the input interface 142.
In some embodiments, the display device 101 includes a front rigid body and a band that goes around a user's head (not shown). The front rigid body includes one or more display elements corresponding to the display 102, the IMU 116, the head tracking sensors 118, and the locators 114. In this example, the head tracking sensors 118 are located within the IMU 116. In some embodiments where the display device 101 is used in AR and/or MR applications, portions of the display device 101 may be at least partially transparent (e.g., an internal display or one or more sides of the display device 101).
A position, an orientation, and/or a movement of the display device 101 is determined by a combination of the locators 114, the IMU 116, the head tracking sensors 118, the imagining device 160, and the tracking module 154, as described above in conjunction with
To determine the location or object within the determined portion of the virtual scene at which the user is looking, the display device 101 may track the position and/or location of the user's eyes. Thus, in some embodiments, the display device 101 determines an eye position for each eye of the user. For example, the display device 101 tracks at least a subset of the three-dimensional position, roll, pitch, and yaw of each eye and uses these quantities to estimate a three-dimensional gaze point of each eye. Further, information from past eye positions, information describing a position of the user's head, and information describing a scene presented to the user may also be used to estimate the three-dimensional gaze point of an eye in various embodiments.
For each audio source or listener 206, the process builds a bounding box 210 according to rays 208 originating from the source/listener 206.
Each ray 208 “travels” through the acoustic space 200 of the virtual scene until it hits an object or escapes the scene. When a ray hits an object, the hit point will contribute to the room bounding box 210 (shown with a dashed line pattern). An axis-aligned bounding box 210 encloses all of the ray hit points, and is chosen to be minimal in size. In some embodiments, this axis-aligned bounding box 210 is later used to estimate a smoothed bounding box 312 using the room information from previous simulation updates. This is described in more detail in relation to
The number of rays influences the quality of the estimated bounding box. If there are not enough rays, the bounding box may not be an accurate representation of the room because of missing features (e.g., open ceilings). However, when there are too many rays, the sampling of the room may become computationally expensive while providing little additional value. In addition, using too many rays can lead to bounding boxes 210 that are too large. Some embodiments use ten rays 208 to balance quality and performance.
In some embodiments, the rays 208 terminate at the first object they encounter. In some embodiments, the rays are allowed to “bounce” one or more times after hitting a first object in the scene. In some embodiments, acoustic properties of objects in the scene are used to determine how much sound (if any) bounces off of an object and the direction of the acoustic bounces.
Some embodiments apply exponential smoothing to the bounding box to ensure that the estimated room bounding box does not change abruptly or vary significantly over time. Exponential smoothing can also improve the quality of the room estimation.
As indicated in
At time t=0, there is no previous iteration, so the dimensions of the smoothed bounding box R0 are the same as the dimensions of the bounding box {tilde over (R)}0, as specified in equations 340 and 342.
For t≥1, some embodiments use the recurrence relations 344 and 346 to compute the dimensions of the smoothed bounding box Rt. The recurrence relations use a convex combination of the previous smoothed bounding box 312 and the current bounding box 210. The value of αR (which is in the range of 0.0 to 1.0) influences how quickly the smoothed bounding box adapts to changes in the scene. When αR is close to zero, the scene changes are recognized quickly. When αR is close to 1, the response is slow because the smoothed bounding box is weighted primarily by the previous smoothed bounding box. Embodiments typically use a value for αR that is not near one of the extremes.
Some embodiments include data from two or more previous iterations in the calculation of the smoothed bounding box.
Although the discussion with respect to
In
In this equation, |X|v indicates the volume (e.g., area) of any region X. TR is the value of a predefined overlap threshold, which indicates how much the bounding boxes should overlap as a fraction of their volumes. When both of the fractions exceed the threshold TR, the sources are assumed to be in the same room. The value of TR influences how aggressive the clustering will be across different audio sources. A higher threshold value means that two sources must have similar bounding boxes in order to be clustered together. However, if the threshold is set too high, it would limit the amount of clustering performed. A lower threshold value increases the amount of clustering performed, but can create strange behavior when it is too small (e.g., clustering sources together through walls). Some embodiments use a threshold value of 0.5 as a compromise between the two extremes.
With a threshold of 0.5, the bounding boxes R1 and R2 in
The overlap test described here can be applied to either the bounding boxes (e.g., computed directly using the rays 208) or the smoothed bounding boxes, as described above in
Some embodiments set the angular threshold at 10 degrees. When two audio sources have orientations that are less than 10° different, it is difficult for a listener to detect a location difference. However, angular differences greater than 10° become noticeable. Some embodiments use various values for the threshold value, such as an angular threshold between 8° and 12°.
In
In some embodiments, each termination point of a respective ray 208 emanating from a respective point corresponding to a respective audio source 174 comprises (608) a location in the virtual environment where either (1) the respective ray 208 encounters an object in the virtual environment or (2) the respective ray exits from the virtual environment. In some instances, the object is a wall, such as the wall 204-1 in
As illustrated in
The device 100 applies (612) an overlap test to the bounding boxes 210 to determine whether the two audio sources are in the same room. For example, in
In some embodiments, the overlap test determines (614) whether the overlap between the two bounding boxes is more than a threshold fraction of each bounding box. This is described above with respect to
and (4) determine (624) that the two audio sources are in the same room when the minimum overlap value exceeds a predefined threshold fraction TR. This technique of measuring overlap acts indirectly to determine whether the two audio sources are in the same room. When the overlap test is applied in only two dimensions (as depicted in
Although described with respect to two audio sources, the same methodology can be applied to three or more audio sources. In some embodiments, when there are three or more audio sources, they are all considered to be in the same room when every pair of the three or more sources satisfies the overlap test.
In addition to the overlap test, embodiments apply an angle test, which measures the angle between the two sources from the perspective of the listener. This is illustrated in
When the two audio sources 174 are determined to be in the same room and the angle θ is less than the predetermined threshold angle Tθ, the device clusters (630) the two audio sources together, including rendering combined audio for the two audio sources, from a single cluster audio location. In some embodiments, the single cluster audio location is (632) a centroid of the points in the virtual environment corresponding to the two audio sources.
Conversely, when the two audio sources are determined not to be in the same room or the angle θ is greater than the threshold angle Tθ, the device forgoes (634) clustering the two audio sources. In this case, the device renders audio for the virtual environment without combining audio for the two audio sources.
Clustering audio sources is not a one-time event. Because the listener can move and objects in the virtual environment can move (e.g., audio sources), clustering has to be reevaluated dynamically. The process to recheck clustering is typically applied at regular intervals (e.g., once every 100 milliseconds). Some embodiments apply cluster recalibration more frequently. When the clustering algorithm is reapplied, old clusters can be broken up, and new clusters can be created. A given cluster may lose some audio sources and gain some other audio sources. In general, the clustering algorithm reapplies both the overlap test and the angle test.
In some embodiments, after clustering the two audio sources together and passage of a time interval Δt, the device performs (636) these steps: (1) determine (638) updated respective volumes |R1′|v and |R2′|v for the updated bounding boxes R1′ and R2′ corresponding to the two audio sources; (2) determine (640) an updated volume |R1′∩R2′|v of overlap between the two updated bounding boxes; and compute (642) an updated minimum overlap value
When the updated minimum overlap value is less than a predetermined split threshold fraction TRsplit, the device de-clusters (644) the two audio sources. The predetermined split threshold fraction TRsplit is less than the threshold fraction TR. On the other hand, when the updated minimum overlap value is greater than the split threshold fraction TRsplit, the device maintains (646) clustering of the two audio sources.
In some embodiments, after (648) clustering the two audio sources together and passage of a time interval Δt, the device measures (650) an updated angle θ′ formed according to rays from an updated location of the listener to updated points in the virtual environment corresponding to the two audio sources. When the updated angle θ′ is greater than a predetermined split threshold angle Tθsplit, the device de-clusters (652) the two audio sources. The predetermined split threshold angle Tθsplit is greater than the threshold angle Tθ. On the other hand, when the updated angle θ′ is less than the split threshold angle Tθsplit, the device maintains (654) clustering of the two audio sources.
While the method 600 includes a number of operations shown in a specific order, the method 600 is not limited to this specific set of operations or this specific order. Some embodiments include more or fewer operations. In some instances, the operations can be executed serially or in parallel, the order of two or more operations may be changed, and/or two or more operations may be combined into a single operation.
In addition to the criteria for clustering, each implementation also has criteria to recognize when sources should be split from their clusters (forming new smaller clusters). A good implementation does not quickly split then merge sources again. To avoid this, the thresholds used for clustering and de-clustering are not the same. For example, when the angle threshold Tθ is 10°, a reasonable value for the split angle threshold Tθsplit is 12°. To form a cluster, the spread of the audio sources has to be less than 10°, but to split up an existing cluster requires an angular spread of 12° or more. Similarly, if the overlap threshold TR is 0.5, a reasonable value for the split overlap threshold TRsplit is 0.25. In this case, there must be at least 50% overlap in order to make a cluster, but the overlap has to fall below 25% to break up an existing cluster. The difference between the splitting and clustering thresholds is sometimes referred to as hysteresis.
Some embodiments iterate the following clustering algorithm at the beginning of each simulation update. The main steps of this algorithm are:
This algorithm creates clusters that attempt to minimize perceivable error in the audio while also clustering aggressively. When sources are clustered, the impact is that all sources within a cluster will emit the same sound (a mix of their anechoic audio) from the union of their geometry. All sources within a cluster are treated as a single “source” within the simulation, and are rendered together as a single “source.”
The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the discussions above are not intended to be exhaustive or to limit the scope of the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen in order to best explain the principles underlying the claims and their practical applications, to thereby enable others skilled in the art to best use the embodiments with various modifications as are suited to the particular uses contemplated.
Number | Name | Date | Kind |
---|---|---|---|
9426598 | Walsh | Aug 2016 | B2 |
20080165979 | Takumai | Jul 2008 | A1 |
20130293530 | Perez | Nov 2013 | A1 |
20130328762 | McCulloch | Dec 2013 | A1 |
20160189426 | Thomas | Jun 2016 | A1 |
20180234765 | Torok | Aug 2018 | A1 |