The present disclosure relates generally to sound capturing systems and, more specifically, to systems for capturing volumetric sounds using a plurality of microphones and projecting the generated volumetric sounds.
Audio is an integral part of multimedia content, whether viewed on a television, a personal computing device, a projector, or any other of a variety of viewing means. The importance of audio becomes increasingly significant when the content includes multiple sub-events occurring concurrently. For example, while viewing a sporting event, many viewers appreciate the ability to listen to conversations occurring between players, instructions given by a coach, exchanges of words between a player and an umpire, and similar verbal communications, simultaneously with the audio of the event itself.
The obstacle with providing such simultaneous concurrent audio content is that currently available sound capturing devices, i.e., microphones, are unable to practically adjust to dynamic and intensive environments, such as, e.g., a sporting event. Many current audio systems struggle to track a single player or coach as that person moves through space, and falls short of adequately tracking multiple concurrent audio events.
Commonly, a large microphone boom is used to move the microphone around in an attempt to capture the desired sound. This issue is becoming significantly more notable due to the advent of high-definition (HD) television that provides high-quality images on the screen with disproportionately low sound quality.
A demand for lifelike simulation is rapidly increasing, particularly for augmented and virtual reality experiences. Although the current visual offerings have made significant progress, the corresponding audio is left behind. One of the main reasons for this is that simulating an audio experience of a moving sound source requires overcoming various challenges relating to six degrees of separation (6DoF).
It would therefore be advantageous to provide a solution that would overcome the challenges noted above.
A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments, and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term “certain embodiments” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure.
Certain embodiments disclosed herein include a method for volumetric sounds generation, including: generating multiple sounds beams from a plurality of microphones within a three-dimensional space; capturing multiple sound signals generated by multiple sounds sources located within the three-dimensional space, where the multiple sound signals are captured based on the multiple sound beams; and, generating a pattern for each of the multiple sound sources.
Certain embodiments disclosed herein also include a non-transitory computer readable medium having stored thereon instructions for causing a processing circuitry to perform a process, the process including: generating multiple sounds beams from a plurality of microphones within a three-dimensional space; capturing multiple sound signals generated by multiple sounds sources located within the three-dimensional space, where the multiple sound signals are captured based on the multiple sound beams; and, generating a pattern for each of the multiple sound sources.
Certain embodiments disclosed herein also include a system for volumetric sounds generation, including: a processing circuitry; and a memory, the memory containing instructions that, when executed by the processing circuitry, configure the system to: generate multiple sounds beams from a plurality of microphones within a three-dimensional space; capture multiple sound signals generated by multiple sounds sources located within the three-dimensional space, where the multiple sound signals are captured based on the multiple sound beams; and, generate a pattern for each of the multiple sound sources.
The subject matter disclosed herein is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the disclosed embodiments will be apparent from the following detailed description taken in conjunction with the accompanying drawings.
It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed embodiments. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality. In the drawings, like numerals refer to like parts through several views.
The various disclosed embodiments include a method and sound processing system for generating volumetric sounds based on a plurality of sound signals generated by a plurality of sounds sources in a three-dimensional space. In an example embodiment, the system includes a plurality of microphones located in proximity to the three-dimensional space. The microphones may be positioned in one or more microphone arrays. The microphones are configured to generate a plurality of receptive sound beams. Responsive to the sound beams, a plurality of sound signals generated within the three-dimensional space by each of the plurality of sounds sources are captured. The system is then configured to generate a pattern for each sound source. The pattern indicates directional coordinates of the sound source, volume characteristics, angles, and the like. Based on the patterns, the system is configured to generate volumetric sounds with respect to the various sounds signals. According to an embodiment, the volumetric sounds enable simulation of audio experience in certain locations in the space.
In one embodiment, the sound processing system 100 may further include a storage in the form of a data storage unit 140 or a database (not shown) for storing, for example, sound signals, patterns, metadata, information from filters and/or other information captured by the sound sensor 110. The data storage 140 may be located on premise, or may be stored remotely, e.g., within a networked cloud storage system.
The filters employed may include circuits working within a predetermined audio frequency range that are used to process the sound signals captured by the sound sensor 110. The filters may be preconfigured, or may be dynamically adjusted with respect to the received metadata.
In various embodiments, one or more of the sound sensors 110, a beam synthesizer 120, and a sound analyzer 130 may be coupled to the data storage unit 140. In another embodiment, the sound processing system 100 may further include a controller (not shown) connected to the beam synthesizer 120. The controller may further include a user interface that allows tracking of a sound source as further described herein below.
According to an embodiment, multiple sound beams are generated within a predetermined space, for example, a sports field or court, an avenue, a show and the like. Responsive thereto, multiple sound signals generated within the three-dimensional space by each of multiple sound sources located therein are captured. Thereafter, a pattern is generated for each sound source based on the captured sound signals. The generation of the pattern may include calculation of a sample of the pattern for each sound signal corresponding to the associated sound beam. The generated samples may be interpolated and based on the interpolation, a three-dimensional pattern of each source can be generated.
According to an embodiment, metadata associated with each sound signal may further be captured by the sound sensor 110. The synthesizer 120 is configured to project the captured sound signals onto a grid corresponding to the predetermined space. The grid may be adaptive through time and configured to enable characterization of the captured sound signals, as further described herein below. According to an embodiment, the grid may be used for identification of interest points within the predetermined space.
As a non-limiting example, upon identification of multiple sound signals captured at a certain portion of the grid, such a portion may be determined as an interest point. As an example of this embodiment, in a basketball game, an interest point may be determined to be the area near the basket. In an embodiment, the interest point may include an area where sound interaction is above a predefined threshold, e.g., if a conversation or single speaker is speaking above 70 decibels.
Following the projection of the sound signals on the grid, the sound signals are analyzed by the sound analyzer 130. The analysis may include one or more beamforming techniques. In an embodiment, the analysis is performed in a time domain. According to this embodiment, an extracted filter is applied to each sound signal. In an embodiment, the filter may be applied by a synthesizer 120. The filtered signals may be summed to a single signal by, e.g., the synthesizer 120.
In another embodiment, the analysis is performed in the frequency domain in which the received sound signal is first segmented. In that embodiment, each of the segments is transformed by, for example, a one-dimensional fast Fourier transform (FFT) or any other wavelet decomposition transformation.
The transformed segments are multiplied by weighted factors. The output is summed for each decomposition element and transformed by an inverse one-dimensional fast Fourier transform (IFFT) or any other wavelet reconstruction transformation.
In an embodiment, one or more weighted factors are generated. The weighted factors are generated by a generalized side lobe canceller (GSC) algorithm. According to this embodiment, it is presumed that the direction of the sources from which sounds are received, the direction of the desired signal, and the magnitudes of those sources are known. The weighted factors are generated by determining a unit gain in the direction of the desired signal source while minimizing the overall root mean square (RMS) noise power.
According to another embodiment, the weighted factors are generated by an adaptive method in which the noise strength impinging each microphone and the noise correlation between the microphones are tracked. In this embodiment, the direction of the desired signal source is received as an input. Based on the received parameters, the expectancy of the output noise is minimized while maintaining a unity gain in the direction of the desired signal. This process is performed separately for each sound interval.
When the disclosed embodiment is implemented to capture specific voices (sound signals) produced by an individual, the microphone array is configured to mute sounds that are generated by side lobes, thereby isolating the specific sound generated by the individual. This creates a sound beam, which allows a system to capture voices only existing within the sound beam itself, preferably with emphasis on the voice of the desired individual. In one embodiment the system is capable of identifying nearby sources of unwanted noise, and of muting such sources.
Beamforming techniques, sound signal filters, and weighted factors are described further in the U.S. Pat. No. 9,788,108, assigned to the common assignee, which is hereby incorporated by reference.
Based on the captured sound signals and the patterns generated, multiple volumetric sounds are generated. The volumetric sounds can be used to simulate an audio experience from different locations within the three-dimensional space, i.e., six degrees of freedom therein.
According to an embodiment, the patterns generated may be represented in higher order ambisonic (HOA) decomposition. HOA is a full-sphere surround sound technique wherein in addition to the horizontal plane, this technique incorporates sound sources above and below a sound capturing unit. The capture sound signals are transformed into HOA coefficients, and thus could be delivered in compact representation to an end user. The sound source HOA representation could be transferred as an object based, i.e., HOA coefficient for each object or scene, or as a combination thereof.
The processing circuitry 132 may be realized as one or more hardware logic components and circuits. For example, and without limitation, illustrative types of hardware logic components that can be used include field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip systems (SOCs), general-purpose microprocessors, microcontrollers, digital signal processors (DSPs), and the like, or any other hardware logic components that can perform calculations or other manipulations of information.
In another embodiment, the memory 134 is configured to store software. Software shall be construed broadly to mean any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code). The instructions cause the processing circuitry 132 to perform the sound analysis described herein.
The storage 136 may be magnetic storage, optical storage, and the like, and may be realized, for example, as flash memory or other memory technology, hard-drives, SSD, or any other medium which can be used to store the desired information. The storage 136 may store one or more sound signals, one or more grids associated with an area, interest points and the like.
The network interface 138 is configured to allow the sound analyzer 130 to communicate with the sound sensor 110, the data storage 140, and the beam synthesizer 120. The network interface 138 may include, but is not limited to, a wired interface (e.g., an Ethernet port) or a wireless port (e.g., an 802.11 compliant WiFi card) configured to connect to a network (not shown).
At S310, multiple sound beams are generated within a three-dimensional space. The sounds beams may be generated by a plurality of microphones configured in one or more microphone arrays. The microphones in the microphone arrays may be positioned or otherwise arranged in a variety of polygons in order to achieve an appropriate coverage of the multiple sound beams. In yet another embodiment, the microphones in the microphone array are arranged on curved lines. Furthermore, the microphones in the microphone array may be arranged in a three-dimensional shape, for example on a three dimensional sphere or a three dimensional object formed of a plurality of hexagons. The microphone arrays may be positioned or otherwise arranged at a predetermined distance from each other to achieve an appropriate coverage of the multiple sound beams. For example, two microphone arrays can be positioned under respective baskets of opposing teams in a basketball court.
At S320, multiple sound signals generated within the three-dimensional space are captured based on the sound beams. The sounds signals are generated by one or more sound sources located within the three-dimensional space. Sound sources may include, but are not limited to, individuals, groups of individuals, large crowds, ambient noise, and the like.
At S330, a pattern is generated for each sound source based on the sound signals generated therefrom. The pattern is indicative of characteristics associated with the sound source, for example, directional, volume, location coordinates within the three-dimensional space, and the like. The generation of the patterns may include calculation of a sample of the pattern for each sound signal corresponding to an associated sound beam. The generated samples may be interpolated and, based on the interpolation, a three-dimensional pattern of each source can be generated. In an embodiment, the generated patterns may be represented by higher order ambisonic (HOA) decomposition.
At S340, a grid is generated within the three-dimensional space. The grid may be generated based on the captured multiple sound signals, and may represent spatial positioning of each of the multiple sound signals within a single space. Thus, each sound signal may be placed on the grid relative to each other sound signal, in order to be reproduced in a virtual three-dimensional space.
At S350, volumetric sounds are generated based on the sound signals and patterns. The generation of volumetric sounds includes simulating sound sources within a three-dimensional space so as to virtually emulate an auditory experience. The generation may include placing sound sources at various locations in a virtual space so that a user will hear a realistic auditory experience rather than sound from a single source. As a non-limiting example, the volumetric sounds enable simulating the audio experience of a viewer attending a live basketball game.
At optional S360, the volumetric sounds are provided to one or more user nodes. User nodes may include user devices, such as smartphones, personal computers, tablets, virtual reality headsets, surround sound audio system, and the like.
A pattern is generated for each sound source 430. The pattern is computed continuously to determine whether each sound source 430 is standing still or in movement. Based on the pattern, the system 100 can generate the perception of the audio experience at different locations within the three-dimensional space. According to an embodiment, the system 100 can simulate the audio experience of a viewer 440 sitting in proximity to the basketball court 410. The audio experience is delivered by the system 100 to a user device such as, for example a virtual reality headset or surround sound audio system. The volume and direction provided to each side of the headset may be customized separately in order to provide an optimal experience.
The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
As used herein, the phrase “at least one of” followed by a listing of items means that any of the listed items can be utilized individually, or any combination of two or more of the listed items can be utilized. For example, if a system is described as including “at least one of A, B, and C,” the system can include A alone; B alone; C alone; A and B in combination; B and C in combination; A and C in combination; or A, B, and C in combination.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the disclosed embodiment and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosed embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
This application claims the benefit of U.S. Provisional Application No. 62/608,580 filed on Dec. 21, 2017, the contents of which are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4741038 | Elko et al. | Apr 1988 | A |
5335011 | Addeo et al. | Aug 1994 | A |
6449593 | Valve | Sep 2002 | B1 |
6707910 | Valve et al. | Mar 2004 | B1 |
6836243 | Kajala et al. | Dec 2004 | B2 |
7706549 | Zhang et al. | Apr 2010 | B2 |
7986794 | Zhang | Jul 2011 | B2 |
9240213 | Gerstlberger et al. | Jan 2016 | B2 |
9508012 | Oguchi et al. | Nov 2016 | B2 |
9509968 | Zimmermann et al. | Nov 2016 | B2 |
20050080616 | Leung | Apr 2005 | A1 |
20130034241 | Pandey et al. | Feb 2013 | A1 |
20130051577 | Morcelli et al. | Feb 2013 | A1 |
20130216046 | Ikeda et al. | Aug 2013 | A1 |
20140029761 | Maenpaa et al. | Jan 2014 | A1 |
20160057539 | Bai | Feb 2016 | A1 |
Number | Date | Country | |
---|---|---|---|
20190200157 A1 | Jun 2019 | US |
Number | Date | Country | |
---|---|---|---|
62608580 | Dec 2017 | US |