 
                 Patent Application
 Patent Application
                     20250193623
 20250193623
                    The contemplated embodiments relate generally to audio systems and, more specifically, to proximity-based audio panning of a virtual sound source.
The use of virtual reality (VR) and augmented reality (AR) in many applications is gaining popularity. VR is an interactive experience that typically supplants the real-world environment of a user simulated via auditory and video feedback, while AR is an interactive experience featuring a combination of real and virtual worlds, real-time interaction, and accurate 3D registration of virtual and real objects. VR and AR are intended to provide a user with an immersive experience of either a virtual world or of the real world augmented with virtual sensory information. VR and AR applications now include entertainment (e.g., video games and multi-media entertainment), education (e.g., medical and military training) and business (e.g., virtual meetings or other interactions).
A significant challenge in VR and AR applications is how to map virtual sound sources to real speakers in a listening area to create an immersive and realistic audio experience. Creating such an accurate spatial audio scene using multiple speakers requires information about where the speakers are in the listening room so that each speaker appropriately processes an input audio signal that simulates the correct physical position of a virtual sound source. Tracking precise locations of speakers can be complicated and requires expensive sensors, such as IR emitters/receivers, UWB emitters/receivers, and/or a game engine-like software system. These approaches significantly increase the cost and complexity of a VR or AR audio system, making these approaches impractical for small-format and low computing-capability products, such as low-cost toys, to generate an immersive audio scene in which the location of the toy or other virtual sound is accurately simulated.
As the foregoing illustrates, what is needed in the art are improved techniques for generating an immersive audio scene.
One embodiment of the present disclosure sets forth a computer-implemented method that includes determining a first distance between a first speaker and a virtual sound source; determining a second distance between a second speaker and the virtual sound source; generating a first audio output signal for the first speaker based on an input audio signal, the first distance, and the second distance; generating a second audio output signal for the second speaker based on the input audio signal, the first distance, and the second distance; transmitting the first audio output signal to the first speaker for output; and transmitting the second audio output signal to the second speaker for output.
At least one technical advantage of the disclosed techniques relative to the prior art is that, with the disclosed techniques, a spatial audio scene can be created that includes a virtual sound source, such as a toy or other object for which an audio signal representing the virtual sound source is generated by speakers that are physically separate from the virtual sound source. With the disclosed techniques, the spatial audio scene can be produced using proximity sensors that measure the distance from the virtual sound source to each speaker that generates the audio signal representing the virtual sound source. As a result, the disclosed techniques can produce an immersive spatial audio mix with fewer hardware components and reduced software processing than other immersive spatial audio approaches. Another advantage of the disclosed techniques is that the techniques are flexible with respect to various characteristics of the sound system producing a spatial audio scene, such as the number of speakers included in the sound system, the locations of the speakers within the listening area, and the number and location of virtual sound sources in the spatial audio scene. A further advantage of the disclosed techniques is that the techniques are readily implemented with real-time audio processing, which is important for immersive applications where audio needs to be processed and distributed quickly to maintain a certain audio experience. These technical advantages represent one or more technological improvements over prior art approaches.
So that the manner in which the above recited features of the various embodiments can be understood in detail, a more particular description of the inventive concepts, briefly summarized above, may be had by reference to various embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of the inventive concepts and are therefore not to be considered limiting of scope in any way, and that there are other equally effective embodiments.
    
    
    
    
    
    
In the following description, numerous specific details are set forth to provide a more thorough understanding of the various embodiments. However, it will be apparent to one of skilled in the art that the inventive concepts may be practiced without one or more of these specific details.
According to various embodiments, a spatial audio scene is produced by an audio system using proximity-base audio-object panning. In the embodiments, the spatial audio scene includes at least one virtual sound source, such as a toy or other object, that operates as a virtual sound source within the spatial audio scene. To produce the spatial audio scene, an audio signal representing the virtual sound source is generated by speakers of the audio system that are physically separate from the virtual sound source and therefore are not contained within the virtual sound source. In the embodiments, the audio signal generated by each speaker varies in volume based on a change in location of the virtual sound source relative to the speakers. As a result, the spatial audio scene provides a user proximate the audio system with an immersive audio experience in which motion of the virtual sound source is reflected by proximity-based audio-object panning. In the embodiments, the spatial audio scene can be produced using proximity sensors that measure a distance from the virtual sound source to each speaker, where a gain value for each speaker is determined based on the measured distance from the virtual sound source to that speaker. Because the gain value for each speaker is based on a measured distances between the virtual sound source and the speakers, a complex two- or three-dimensional map of the relative positions of the speakers and the virtual sound source(s) is not required to produce the spatial audio scene. As a result, the audio system produces an immersive spatial audio mix with fewer hardware components and reduced software processing than other immersive spatial audio approaches.
  
In operation, when outputting an audio signal corresponding to virtual sound source 102, audio system 100 uses measured distances between virtual sound source 102 corresponding to a physical device (e.g., a toy or other object that may not have audio output capabilities) and each of the plurality of speakers 160 to generate gain settings for each of the plurality of speakers 160. In particular, audio panning application 120 uses distance sensor(s) 150 to determine the distance between virtual sound source 102 and each of speakers 160. Audio panning application 120 then uses the measured distances and one or more gain curves to determine a gain value for each of the plurality of speakers 160. The gain values are then used to control the volume of the audio signal output by each speaker 160 to create sound associated with virtual sound source 102.
Distance sensor(s) 150 include various types of sensors for measuring a distance between virtual sound source 102 and each of speakers 160. In some embodiments, distance sensor(s) 150 are proximity sensors. Distance sensor(s) 150 can use any technically feasible distance measuring techniques including, but not limited to the use of ultrasonics, infrared light, computer imaging modalities, and/or the like. In some embodiments, one or more distance sensors 150 are disposed within virtual sound source 102. For example, in some embodiments, distance sensors 150 include a microphone disposed within virtual sound source 102 that enables detection of inaudible audio signals generated by each speaker 160 to determine a distance between virtual sound source 102 and each speaker. In another example, in some embodiments, distance sensors 150 include an ultrasonic sensor that measures a distance to a speaker 160 by transmitting sound waves toward the speaker 160 and measuring the time interval required for a portion of the transmitted sound waves to be reflected back to the ultrasonic sensor. Additionally or alternatively, in some embodiments, one or more distance sensors 150 are disposed within each speaker 160. Additionally or alternatively, in some embodiments, one or more distance sensors 150 of audio system 100 are physically separate from both virtual sound source 102 and speakers 160.
In some embodiments, the audio system 100 includes other types of sensors in addition to the distance sensor(s) 150 to acquire information about the acoustic environment. Other types of sensors include cameras, a quick response (QR) code tracking system, motion sensors, such as an accelerometer or an inertial measurement unit (IMU) (e.g., a three-axis accelerometer, gyroscopic sensor, and/or magnetometer), pressure sensors, and so forth. In addition, in some embodiments, distance sensor(s) 150 can include wireless sensors, including radio frequency (RF) sensors (e.g., sonar and radar), and/or wireless communications protocols, including Bluetooth, Bluetooth low energy (BLE), cellular protocols, and/or near-field communications (NFC).
Each of the plurality of speakers 160 can be any technically feasible type of audio outputting device. For example, in some embodiments, the plurality of speakers 160 includes one or more digital speakers that receive an audio output signal in a digital form and convert the audio output signals into air-pressure variations or sound energy via a transducing process. According to various embodiments, each of the plurality of speakers 160 generates an audio signal (outputs sound) for virtual sound source 102 at a volume or gain level determined by audio panning application 120.
Computing device 110 enables implementation of the various embodiments described herein. In the embodiment illustrated in 
Processing unit 112 can be any suitable processor, such as a central processing unit (CPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), and/or any other type of processing unit, or a combination of different processing units, such as a CPU configured to operate in conjunction with a GPU and/or a DSP. In general, processing unit 112 can be any technically feasible hardware unit capable of processing data and/or executing software applications, such as audio panning application 120.
Memory 114 can include a random-access memory (RAM) module, a flash memory unit, or any other type of memory unit or combination thereof. Processing unit 112 is configured to read data from and write data to the memory 114. In various embodiments, memory 114 includes non-volatile memory, such as optical drives, magnetic drives, flash drives, or other storage. In some embodiments, separate data stores, such as an external data stores included in a network (“cloud storage”) can supplement memory 114. Audio panning application 120 within memory 114 can be executed by processing unit 112 to implement the overall functionality of computing device 110 and, thus, to coordinate the operation of audio system 100 as a whole. In various embodiments, an interconnect bus (not shown) connects processing unit 112, memory 114, speaker(s) 160, distance sensor(s) 150, and any other components of computing device 110.
In various embodiments, computing device 110 is included in virtual sound source 102, allowing virtual sound source 102 to use audio panning application 120 to generate a suitable sound level for each speaker 160 that generates an audio output representing virtual sound source 102. Alternatively, in some embodiments, computing device 110 is separate from virtual sound source 102 and distance sensor(s) 150 are disposed within virtual sound source 102. One such embodiment is described below in conjunction with 
  
In operation, distance sensor 150 determines a distance D1 between virtual sound source 102 and first speaker 262, a distance D2 between virtual sound source 102 and second speaker 264, and a distance D3 between virtual sound source 102 and third speaker 266. Distance sensor 150 then transmits distances D1, D2, and D3, to computing device 110, so that audio panning application 120 can determine a suitable gain value G1 for first speaker 262, a suitable gain value G2 for second speaker 264, and a suitable gain value G3 for third speaker 266. Techniques for determining gain value G1, gain value G2, and gain value G3 are described below in conjunction with 
  
In operation, distance sensor 352 determines a distance D1 between virtual sound source 102 and first speaker 362, distance sensor 354 determines a distance D2 between virtual sound source 102 and second speaker 364, and distance sensor 356 determines a distance D3 between virtual sound source 102 and third speaker 366. Distance sensor 352 transmits distance D1 to computing device 110, distance sensor 354 transmits distance D2 to computing device 110, and distance sensor 356 transmits distance D3 to computing device 110. Audio panning application 120 can then determine a suitable gain value G1 for first speaker 362 based on distance D1, a suitable gain value G2 for second speaker 364 based on distance D2, and a suitable gain value G3 for third speaker 366 based on distance D3. Once computing device 110 determines gain value G1, gain value G2, and gain value G3, computing device 110 generates a first audio output signal 302 to first speaker 362, a second audio output signal 304 to second speaker 364, and a third audio output signal 306 to third speaker 366. In some embodiments, computing device 110 generates first audio output signal 302, second audio output signal 304, and third audio output signal 306 based on an input audio signal 310 and a particular gain value, where input audio signal 310 represents a sound associated with virtual sound source 102. In such embodiments, computing device 110 generates first audio output signal 302 by modifying input audio signal 310 with gain value G1, second audio output signal 304 by modifying input audio signal 310 with gain value G2, and third audio output signal 306 by modifying input audio signal 310 with gain value G3. Computing device 110 then transmits first audio output signal 302 to first speaker 362, second audio output signal 304 to second speaker 364, and third audio output signal 306 to third speaker 366. First speaker 362 then generates an audio output (not shown) based on first audio output signal 302. Similarly, second speaker 364 generates an audio output (not shown) based on second audio output signal 304, and third speaker 366 generates an audio output (not shown) based on third audio output signal 306. In this way, audio from one or more virtual sound sources (e.g., virtual sound source 102) is distributed to an arbitrary number of physical speakers (e.g., first speaker 362, second speaker 364, and third speaker 366) so that localization of the virtual sound source within listening area 200 is perceptually accurate.
According to various embodiments, an audio system produces a spatial audio scene using proximity-base audio-object panning. In particular, in an audio system that includes multiple speakers and a virtual sound source, as the virtual sound source is moved away from a first speaker and closer to a second speaker, a gain value by which the first speaker modifies an audio input signal decreases and a gain value by which the second speaker modifies the audio input signal increases. The resultant effect is that the perceived location of the source of sound (sometimes referred to as a “phantom sound image”) moves away from the first speaker and closer to the second speaker, which produces the desired result: the phantom sound image follows the virtual sound source as the virtual sound source moves within a listening area. Thus, as the virtual sound source is moved among and/or around the speakers of the audio system, the audio system generates a different audio output from each speaker, and the perceived location of the virtual sound source matches the physical location of the virtual sound source.
In some embodiments, audio panning application 120 employs a gain computation for determining a gain value for each speaker. In such embodiments, the gain value for each speaker is based on the distance between the virtual sound source and the speaker, where a gain value for a particular speaker is functionally equivalent to a volume knob setting for that particular speaker. In some embodiments, the audio signal representing sound virtually generated by the virtual sound source is assumed to remain approximately constant and each speaker of the audio system is assumed to have approximately equal sensitivity (e.g., change in output volume) to changes in the gain value associated with that speaker. In such embodiments, the sum of the squares of each gain value equals a constant value. Alternatively, when one or more speakers are known to have different sensitivities, an appropriate offset gain value can be applied to one or more speakers to compensate for such a difference in sensitivity to changes in the gain values determined by audio panning application 120.
In some embodiments, the gain computation employed by audio panning application 120 is based on a simple panning algorithm to determine a gain value for each speaker. In such embodiments, the panning algorithm can be represented by a cross-fader curve. For clarity of description, use of a cross-fader curve for determining gain values is described herein with respect to an audio system that includes two speakers (e.g., first speaker 262 and second speaker 264 of 
  
Gain values for the first speaker and the second speaker are determined with cross-fader curve 400 based on distance D1 (between virtual sound source 102 and the first speaker) and distance D2 (between virtual sound source 102 and the second speaker). In some embodiments, a distance ratio of D1 and D2 is computed and used as an input for first gain curve 410 and second gain curve 420. For example, in the embodiment illustrated in 
In the embodiment described in conjunction with 
  
  
  
  
  
In the above-described embodiments, determination of gain values for an audio system that include two speakers is described using a relatively simple panning algorithm that employs a cross-fader curve. In other embodiments, gain values can be determined for an audio system that includes an arbitrary number of speakers. In such embodiments, a surround-sound panning algorithm can be employed to describe suitable gain curve functions for each of the arbitrary number of speakers. The analytical expressions for surround-sound panning algorithms are easily derived from the cross-fader curves of 
  
As shown, a method 600 begins at step 602, where audio panning application 120 determines speaker distances from virtual sound source, such as distances D1, D2, and D3. Generally, a different distance is determined for each speaker of audio system 100. In some embodiments, one or more distance sensors 150 that are disposed within virtual sound source 102 are employed to determine distances D1, D2, and D3. Alternatively, in some embodiments, a different distance sensor disposed within each speaker of audio system 100 is employed to determine distances D1, D2, and D3.
At step 604, audio panning application 120 determines a distance ratio for the current location of virtual sound source 102. For example, in some embodiments, the distance ratio for the current location of virtual sound source 102 is a ratio of a first distance between one speaker of audio system 100 and virtual sound source 102 and a second distance between a second speaker of audio system 100 and virtual sound source 102.
At step 606, audio panning application 120 determines a gain value for each speaker of audio system 100. In embodiments in which audio system 100 includes two speakers, the gain value for each speaker can be determined based on the distance ratio determined in step 604. In embodiments in which audio system 100 includes three or more speakers, a more complex algorithm can be employed for determining gain values instead of using a distance ratio. For example, in such embodiments, a gain value for each speaker can be determined based on distances D1, D2, and D3 and a suitable surround-sound panning algorithm.
At step 608, audio panning application 120 generates an audio signal for each speaker based on the gain value for that speaker and on an input audio signal (e.g., input audio signal 210) that represents a sound associated with virtual sound source 102. Generally, the audio signal for a particular speaker is generated by modifying the input audio signal with the gain value associated with that particular speaker.
At step 610, audio panning application 120 transmits the audio output signal for each speaker to the respective speaker. In some embodiments, the audio output signals are transmitted wirelessly, and in other embodiments, the audio output signals are transmitted via Bluetooth, WiFi, or any other technically feasible wireless protocol.
In sum, techniques are disclosed for producing a spatial audio scene using proximity-base audio-object panning. In the embodiments, the spatial audio scene includes at least one virtual sound source, such as a toy or other object, that operates as a virtual sound source within the spatial audio scene. To produce the spatial audio scene, an audio signal representing the virtual sound source is generated by speakers of the audio system that are physically separate from the virtual sound source and therefore are not contained within the virtual sound source. In the embodiments, the audio signal generated by each speaker varies in volume based on a change in location of the virtual sound source relative to the speakers. In the embodiments, the spatial audio scene can be produced using proximity sensors that measure a distance from the virtual sound source to each speaker, where a gain value for each speaker is determined based on the measured distance from the virtual sound source to that speaker.
At least one technical advantage of the disclosed techniques relative to the prior art is that, with the disclosed techniques, a spatial audio scene can be created that includes a virtual sound source, such as a toy or other object for which an audio signal representing the virtual sound source is generated by speakers that are physically separate from the virtual sound source. With the disclosed techniques, the spatial audio scene can be produced using proximity sensors that measure the distance from the virtual sound source to each speaker that generates the audio signal representing the virtual sound source. As a result, the disclosed techniques can produce an immersive spatial audio mix with fewer hardware components and reduced software processing than other immersive spatial audio approaches. Another advantage of the disclosed techniques is that they are flexible with respect to various characteristics of the sound system producing a spatial audio scene, such as the number of speakers included in the sound system, the locations of the speakers within the listening area, and the number and location of virtual sound sources in the spatial audio scene. A further advantage of the disclosed techniques is that they readily implemented with real-time audio processing, which is important for immersive applications where audio needs to be processed and distributed quickly to maintain a certain audio experience.
Aspects of the disclosure are also described according to the following clauses.
The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments.
Aspects of the present embodiments may be embodied as a system, method, or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “module” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
Aspects of the present disclosure are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such processors may be, without limitation, general purpose processors, special-purpose processors, application-specific processors, or field-programmable processors or gate arrays.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
While the preceding is directed to embodiments of the present disclosure, other and further embodiments of the disclosure may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
This application claims the benefit of U.S. Provisional patent application titled, “PROXIMITY-BASED AUDIO OBJECT PANNING,” filed on Dec. 8, 2023, and having Ser. No. 63/607,819. The subject matter of this related application is hereby incorporated herein by reference.
| Number | Date | Country | |
|---|---|---|---|
| 63607819 | Dec 2023 | US |