This invention relates generally to providing two-channel audio signals to a listener that closely correspond to the sounds that arrive at the ears in the vicinity of the original sound's origins and more particularly, to a device that can rotate the apparent direction of such sounds relative to the user's head, so that as the user's head moves, the sound appears to continue coming from the appropriate direction in space.
For many years, people have made binaural recordings because of the realism that is possible. Using microphones placed in simulated or real human ears, such recordings capture many of the nuances of what gives people the ability to detect the direction of sound. So when listening to such music through headphones, the same cues are received, which lends to a realistic experience.
Binaural sound seems well-suited for virtual reality (VR) or augmented reality (AR) because it is similar to the way the visual portion of such systems work—a video scene is placed in front of the eyes to replace or enhance the real world visual scene with the virtual world scene. Similarly, placing headphones on the ears allow the virtual sound that corresponds to the virtual visual scene.
Video games and other techniques exist for generating synthetic virtual environments. Given the objects in the virtual world, as the wearer of the VR viewer moves her head, head-tracking technology sends information to the computer and then graphics routines can render the virtual visual environment for display in front of the eyes. Similarly, techniques for generating binaural or stereo sound can cause the sound to be generated from the apparent direction between the user's head orientation and each of the sound sources. As the user rotates her head, the relative direction of the various visual and sound sources will change, possibly in different ways. For example, objects to the left will tend to move around the back, and thus right-ward as the user rotates her head to the right, whereas objects in front of the viewer in virtual reality will move toward the left.
The problem is somewhat more involved for creating virtual reality audio of real-world scenes, because there is no a priori knowledge of where all the sound sources and objects are.
People involved in the art have developed methods for obtaining the visual scene from wide-angle stereo-optic cameras that capture a wide angle, for example 180 degrees or 360 degrees around the eyes, of a visual field. Then head-tracking technology wearable by the viewer can select the portion of the imagery from the entire field that corresponds to what is viewable in that direction, moving that imagery to the center of the field of view.
Audio recording technology such as above can be used to record the binaural, virtual-reality sound environment. However, current inventions intended for this purpose do poorly when the user turns his or her head, because there is not a good way to rotate the virtual sound sources in response to head motions in a similar fashion, since the sounds from the various sound sources are all mixed together in the sound stream.
Previous inventions have created ways to create sonic environments that appear to correctly maintain direction of origin of sounds, but they typically, require several microphones and/or several channels of audio so that the sounds can be appropriately recombined, or in the cases where only two channels of transmission are required, the channels are not the same as standard sterophonic or binaural recordings. For example, U.S. Pat. No. 3,997,725 to Gerzon discloses a multidirection sound reproduction system that uses separate omnidirectional and azimuthal signals to create a surround sound effect with arrays of speakers. U.S. Pat. No. 4,086,433 to Gerzon provides various enhancements for irregular arrays of speakers. U.S. Pat. No. 5,594,800 to Gerzon describes a matrix converter approach. U.S. Pat. No. 5,757,927 to Gerzon similarly describes a surround-sound approach using what is called therein “B-Format” signals or W,X,Y. To achieve a similar function, but with fixed speakers surrounding the user. While providing realistic 3D surround sound, these approaches do not directly address the case of a person wearing headphones, in which case the audio would need to change according to head direction. In “3D Binaural Sound Reproduction using a Virtual Ambisonic Approach” by Noisternig, et. al, VECIMS 2003 Conference in Lugano, Switzerland, an approach is presented that rotates the sound in accordance with rotation of the user's head. However, this approach also uses multiple channels of encoded audio, which are combined according to the output of a head-tracking unit. U.S. Pat. No. 6,144,747 to Scofield, et. al. discloses an encoding scheme that takes a 4-channel (quadraphonic) signal and combines the four channels into a binaural-like, two channel signal, so that the sound experienced by the user with nearby left and right speakers seems to arrive like the 4-channel signal would arrive from four loudspeakers. This is a similar surround-sound idea, but does not appear to address the issue of wearing headphones and rotating the head, as well as assumes surround-sound encoding of the audio. In contrast to such approaches, it is preferable for many applications to to be able to use existing two-channel recording technology such as is used for binaural and stereophonic audio, rather than prior art multi-channel encoding technology. Using standard two-channel inputs makes it possible to create surround-sound rotation effects from recordings that are recorded and distributed using standard, commonly-available two-channel techniques. It is also preferable for many approaches for the user to wear standard headphones for hearing the sound.
Yet another approach that could be used for surround sound is beam-forming. A series of audio beam-formers, such as are used for surveillance devices or hearing aids, could be used to obtain a signal from each of several directions. Each signal could then be rotated to appear to come from a corrected direction. However, this approach would have the advantage that the left and right portions of the signal for each beam are irreversibly combined, so that any nuances about the left and right signals coming to the ear from that source are not present in the output signal.
Therefore, several objects and advantages of the present advantage are:
To accept real-world recordings or live streams of dual-channel sound and rotate the sound, so that the various sound sources appear to rotate relative the user's head.
To rotate the sound in a manner such that, to the extent possible, the unique characteristics of the channels of sound are maintained.
For virtual reality of pre-recorded binaural scenes, to cause the sounds to rotate appropriate while a VR viewer is rotated during playback. This will be possible using as few as only two video images corresponding to the total visual field, plus two sound channels corresponding to the two ears.
For binaural recording without the video imagery, as a way to add further realism to playback of music and other recordings, so that a more realistic sonic environment is available with headphones.
For non-binaural, stereo recordings, to give more realism. Even if the exact cues are not available, the sound will appear to rotate as a function of head rotation, still giving more realism than without this effect.
For synthesized music of multiple channels. To produce an effect of the music rotating as the user's head rotates as an enjoyable and enriching experience for the user, possibly helping reduce the “closed-in” feeling often had after listening to headphones for extended periods of time.
For watching movies, even if the video is not VR, to have the sound correspond to the user's head orientation will allow headphones to be used more effectively for movie watching.
For listening in noisy environments, to spatially filter binaural or stereo sound for focusing on sounds from particular directions.
The subject invention is a system that accepts a standard binaural or stereo audio signal and separates the two-channel signal into a series of signals, each which appears to be originating from a separate direction in space relative to the placement of microphones that captured the sound. The invention then accepts another input indicating the orientation of the listener's head. Each of the series of signals is then moved so as to arrive from a corrected angle that is a function of the user's head orientation. The rotated series of signals is then re-combined into right and left signals such that the direction of the signals is modified to take into account any changes in the listener's head orientation.
In another embodiment of the invention, the orientation of the microphones is measured and the two-channel signals from the microphones are similarly broken down into a series of signals coming from different directions, then rotated and recombined so as to give the effect that the orientation of the microphones does not change.
In another embodiment of the invention, the signals coming from the microphones or listened-to by the listener are rotated to give special effects that do not necessarily correspond to any rotation of the listener or of the microphones.
In yet another embodiment of the invention the signals coming from the microphones are spatially filtered to focus on particular directions.
It will be apparent to those with skill in the art that the modules 106, 108, 107, and 109, as well as the modules in
A sound sources extractor 106 processes the input sound 101 to create a set of sound source signals 113, consisting of individual sound source signal 113a, sound source signal 113b, sound source signal 113c, and sound source signal 113d. For convenience, only four sound source signals are shown in
Optionally, an input head angle alpha 102 corresponding to the input sound is also provided along with the input sound. Input head angle alpha 102 could conceivably vary with time, for example, if a portable recording device is used with the microphone operator wearing binaural recording earbuds. If input head angle alpha 102 it is not available, a default of 0 degrees can be assumed, assuming that the audio sound is produced relative to a reference angle of the head. Other default angles could be used to take into account different microphone angles relative to the sound sources of interest. An angle comparer 107 compares the input head head angle alpha 102, if available, to the listener head angle beta 103. Listener head angle beta 103 is measured by a device such as a head tracker, or could be independently derived from some other sensor system.
The reference listener head angle, which is the angle at which listener head angle beta 103 equals zero in the preferred embodiment, may be determined differently in various embodiments of the present invention. In a preferred embodiment, the reference head angle is set to the point at which a listening session begins, such that the virtual sonic environment experienced by the user will be defined as an arbitrary starting direction. In alternate embodiments, the reference head angle may depend on an absolute angle with respect to the earth's surface, if it is relevant to the use of the invention. As discussed later, the reference head angle may also vary with time.
The output of angle comparer 107 is the rotation angle phi 112, indicative of the angle by which the input sound 101 needs to be rotated relative to the listener's head, based on the degree to which listener head angle beta 103 is different from the input head angle alpha 102. Rotation angle phi 112 is also referred to simply as “phi” later in this specification.
If angle comparer 107 is not present, rotation angle phi 112 is alternately supplied by another method, for example, a manual hardware of software input under control of the listener, or under control of another automatic module, or superimposed with input sound 101.
As an example, consider the case where a fixed binaural microphone head is used to make a recording. And assume that a head tracker is used with the playback of the sound. The initial position of the head tracker when starting the playback is preferably used as the reference listener head angle as described above. Then, during playback, as the listener's head moves, the negative of the difference between the listener head angle beta 103 and the zero reference point is used to calculate rotation angle phi 112. For example, if the user turns her head to the left by 30 degrees, the rotation angle phi 112 would be indicative of rotating the sound to the right by 30 degrees to keep the apparent source of the sounds in the same relative to the virtual environment of the listener.
As a further example,
The simplest case, as depicted in the embodiment described above, would have rotation angle phi 112 defined only in the yaw direction, in which heading is measured. However, roll and pitch could also be used for a more fully-immersive playback experience, as is discussed later below, by utilizing vectors of angles instead of scale angles in the same fundamental methodology as in the embodiment above.
Sound sources rotator 108 takes the bank/set of sound source signals 113 and applies a sound-rotation transformation operation to each, to rotate each of the sound source signals 113 according to rotation angle phi 112, thus outputting rotated sound signals 114. In
Sound combiner 109 takes the rotated sound signals 114 from sound sources rotator 108 and combines them into an output sound signal with left channel output Lout 110 and right channel output Rout 111. Sound combiner 109 can simply implement an addition of the various rotated sound signals 114, for example, by summing together all the left channel signals from rotated sound signals 114 into Lout 110, and all the right channel signals from rotated sound signals 114 into Rout 111, along with scaling to make sure the output level is compatible with the playback equipment, or can be more sophisticated, as is discussed below.
If more than the horizontal yaw plane is used in these rotations, one or more angles among input head angle alpha 102, listener head angle beta 103, theta.i and rotation angle phi 112 become vectors representing a composite rotation of roll, pitch, and/or yaw, or any combination of one or more of these angles.
Sound Sources Extractor
Sound sources extractor 106 is a central key to the present invention. Its task is to separate out apparent sound sources in the input sound 101 and calculate an apparent angle for each, in other words, the apparent direction from which each is arriving, so that each source can then be correctly rotated. Note that when this discussion speaks of a “source”, it is not necessarily a one-to-one correspondence with a physical sound-producing object, although it can be. A “source” could alternately correspond to several physical objects, or part of the sound coming from a physical object.
One way to perform the task of sound sources extractor 106 would be to implement a series of bandpass filters that are expected to correspond to the spectral extents of various sound sources and calculate the apparent angle of the output of each filter. This approach would work fine if the various sources in the sonic environment had predominantly non-overlapping spectra. However, in frequency ranges where the spectrum overlaps significantly, the apparent angles would be mixed. The audio distortion would be relatively minimal, however, because the output could be the weighted outputs of the bandpass filters, so most of the original phase information would be retained in the output.
Taking this idea further would be to perform a complete spectral analysis into many smaller frequency bands, perhaps going so far as to compute a Fourier or Laplace transform, or other frequency-extraction scheme, and treat each frequency band as a separate sound source, computing its apparent angle for rotating it appropriately. This alternate embodiment still has a similar issue in that sound sources that have overlapping spectra would tend to be added to come from the net angle. For example, if there were a voice on the left side and a trumpet on the right side, for those frequencies where the two coincide, there would be one signal from the front and none from the two sides for that frequency, so parts of the spectrum would be missing from the left and right. Additionally, even if reconstructed properly, the sound sources rotator would not be able to properly modify the sounds to account for the way that sound waveforms are modified as a function from the direction in which they arrive, since the average arrival angle at each frequency would in effect be used.
A preferred embodiment of the present invention uses an approach by which each filter corresponding to a source can extract information from a relatively wide frequency range, in such a way that the parts of a spectrum of the corresponding sound source will tend to be collected together, and thus be rotated together. To avoid interference between sound sources, not all frequencies within the overall frequency range of the filter should be included, instead only selected frequencies that are likely from the associated real-world sound source. By allowing different parts of a frequency band to be associated with different sources, this allows components of overlapping spectra to be extracted and rotated differently. To do so requires defining a series of frequencies for each filter that represent likely components of the corresponding source signal, and then gathering-together the parts of the input signal that occur in that series of frequencies.
An embodiment to accomplish this would be to have a library of the frequency spectra of a variety of known sound sources. Then the Fourier Transform could be taken and for each item in the library, the amount of energy corresponding to the frequencies in its transform be summed. For example, the average angle for the spectral components of each known source, preferably weighted by the amplitude of the spectral component, could be computed, and then the signals for all components of that sound source rotated by phi. If spectral components overlap between sources, the highest weighted one could receive all of that component's amplitude in its averaged sum, or the outputs included with each source weighted proportionally.
There is a disadvantage of this embodiment in that it requires a library of known objects, and additionally, that it can be computationally expensive to find the Fourier Transform of the signal over each piece of the sound, and the reconstruction of the waveform is very difficult, since the library might not have phase information, and if it does, would require precise generation of all the spectral lines and a need to piece them together over time.
A preferred embodiment of the present invention is to create a relatively simple filter that has similar properties as the library of functions—namely that each filter can cover signals over a wide range, but unlike a bandpass filter, doesn't consider all the frequencies in the range more or less equally. Such a filter should preferably include common patterns of frequencies that are found in real world sounds without relying on extensive libraries with all possible sound types. One useful fact about most natural (and many synthetic) sounds is that they are rich in harmonics. Since mechanical processes that cause sound involve creation of harmonic energy, a filter that has a harmonic frequency response would be ideal for the invention. A simple filter that meets these criteria is a comb filter. The comb filter is based on feeding back the input or output of a filter with a fixed time delay. The fixed time delay in the time domain leads to a periodic response in the frequency domain. So if a comb filter is constructed with the fundamental frequency of a sound in the natural world, it is likely that much of the energy from that sound will be captured in the harmonic responses of that comb filter. Additionally, the frequencies in between the response frequencies of the comb filter are not captured by the filter, so that sounds with different spectral qualities can be detected by other comb filters having different fundamental frequencies and with harmonics that are not all coincident with the filter in question. If comb filters that have fundamental frequencies that are roughly harmonics of each other, sound sources with similar fundamental frequencies, but different harmonic shapes will respond differently to different comb filters.
To cover the entire audio frequency range appropriately, a preferred embodiment is to use fundamental comb filter frequencies in a roughly geometric progression, such as in steps of 10% to 20% starting at the lowest frequency to be rotated. There are advantages to making sure some of the filters do not overlap in harmonics, so that the greatest portion of the entire audio spectrum can be accommodated. Linear, random, or other sets of fundamental frequencies could also be used in the present invention.
The preferred embodiment of the present invention therefore uses a bank of comb filters, starting with a low frequency, for example 50 Hz, and moving upward to a few thousand Hz. Each comb filter can be considered as being able to detect a simple “sound source”, as it will capture many parts of the spectrum of a real-world object. And if the real-world object has a complex waveform, rather than a simple harmonic, a series of the comb filters may in fact represent the physical sound-producing object. The number of sound sources is a trade-off, but as an example, 10 to 30 comb filters could be used in a preferred embodiment of the present invention.
In the text that follows, the term “path” will be used to refer to the signals detected by sound sources extractor 106 and occurring downstream corresponding to one of the bank of comb filters. For example, if a bank of 5 comb filters is used, there will be 5 paths for signals to flow from the outputs of the sound sources extractor 106 through to the sound combiner 109. The subscript “i” will be used to denote the input or processed signal corresponding to the path i or the “ith” comb filter. For example, when discussing one path among the bank of sound sources 113, the text may refer to angle theta within the context of that path, which corresponds to theta.i in the global view of all the paths.
Instead of a basic comb filter, alternate embodiments of the invention can be created, such as by adding additional feedback loops in the comb filters at sub-intervals of the fundamental feedback interval, using both feedback and feedforward versions of the comb filter, etc. Any such modification that keeps the response of the filter roughly corresponding to elements of one or more fundamentals plus their harmonics could be utilized in embodiments of the present invention, and typically, different higher-frequency responses among the filters will help separate sound sources more, such that multiple filters with similar fundamentals but different harmonic responses could be used for example to detect different musical instruments playing the same fundamental note. One particularly useful alternate embodiment is to put a comb filter in series with a simple low-pass filter, so that the harmonics have decreasing response, similar to many real-world sounds. We will refer to the selected comb filter design or any similar variations on a comb filter with the more general term “source filter” in the discussion below. If a multiple-channel signal is used, the term “source filter” may also imply a pair of similar source filters, one for each channel.
The energy, magnitude, or amplitude output of source filters 401a and 402b is found by one of several methods, such as one embodiment using Lowpass magnitude filters 402a and 402b as described above. Another embodiment of the present invention does this by measuring amplitude of the source filter 401a or 401b output at each sample point (e.g., at 44,100 Hz), or by putting the source filter or its output amplitude through a low-pass filter such as lowpass magnitude filters 402a and 402b, or by a peak-or envelope-detecting filter. Updating the apparent direction of the sound, Apparent angle theta.i 408, too quickly results in noise distortion because small changes in the detected direction may occur due to transient sounds, leading to some switching-like noise downstream in sound sources rotator 108, whereas too much low-pass filtering causes unsettling directional shifts as sound sources appear to move around slowly, for example, if a sound source extractor 400 suddenly becomes more representative of (matched to) a sound coming from a different direction, and the apparent angle theta.i 408 slowly moves to the new direction instead of switching immediately. Rather than a fixed filter time constant for all source filters, filtering that varies with the fundamental frequency can be used, for example, using a low-pass filter cutoff frequency proportional to the filter's fundamental frequency. In some situations, filtering of the values will tend to reduce the occurrence of larger angles of theta.i that should be present. This can optionally be accounted-for by multiplying the apparent angle theta.i 408 output by a “fudge factor”, such as a value of 1.2.
In any case, a mathematical head model, in other words, a mathematical model of how the sound reaches the listener's ears is used to derive the apparent angle theta.i. For one embodiment of the model, the technique used to obtain amplitudes from source filters will provide a left and right (L and R) amplitude value for each path and source signal, namely L magnitude 403 and R magnitude 404 in
theta=−pi/2+2 atan(L magnitude/R magnitude) (equation 1)
or another similar mapping that relates that at theta=−90 degrees, the L channel will be maximum and the R channel minimum, and vice versa at +90 degrees, with approximately equal L and R values corresponding to theta=0. Of course alternate mappings of positive and negative or different angle measures, or even simply using ratios or sines and cosines can be done within the scope of the present invention. We will use the convention of Left ear at −90 degrees for the following discussion. Note that the terms “L”, “Left”, and “amplitude L”, as well as the corresponding R terms may be used interchangeably and the context will be apparent to those with ordinary skill in the art. Although this simplification may work well for higher frequencies, lower frequency, longer-wavelength signals tend not to show a strong amplitude relationship. To accommodate this shortcoming, the time delay can optionally be computed from a version of source filters 401a and 402b that are high-passed at their input, for example, with a 400 Hz corner frequency, so that the calculation is effectively made only for the higher-frequency portion of the spectrum captured by source filters 401a and 401b.
The time delay between the two ears of a listener can also be used in the model to derive an apparent angle theta.i 408 of the source corresponding to source extractor channel 400. Using the speed of sound at approximately 343 meters/sec, and given the approximate radius of the head, simple trigonometry can be used to derive an approximate time delay between right ear and left ear sounds for various head pointing angles.
tdelay.left=2r sin(theta)/v.sound (equation 2)
where 2r is the distance between the ears of head 301, theta is the angle theta 302 with which the apparent direction of sound source 303 is rotated with respect to the listener's head, v.sound is the velocity of sound, and tdelay.left is the time delay of the L sound compared to the R sound.
The two models depicted in equation 1 and equation 2 are fused in an embodiment of the present invention to arrive at the best answer, such as by averaging, or by weighting each result according to the variances expected in the readings and calculations at the values in question.
As an alternative to the above simple equation models for amplitude and delay, the Head Related Transfer Function (HRTF) can be used to advantage as a mathematical head model. The HRTF is a function used in the art for generating synthetic sound that appears to have a given direction relative to the listener. The HRTF shows the response of the interior of the ear to sounds originating at a distance. The impulse response of the HRTF shows the response in the ear to an impulse sound at a distance. By analyzing an HRTF appropriate for the listener, the ratios of amplitudes and time delays can be computed for a more realistic head than the “ideal”, simple head that doesn't affect the sound as in the head model depicted in
Various other engineering models known in the art can be used to arrive at more or less accurate estimates of the direction of the source within the scope of the present invention, using the outputs of the source filter, or simple modifications of the source filter such as described above.
The observant reader will note that the above simple model equations result in an ambiguity—that the relative amplitudes and time delays will be equal at two different angles—one with the user's head facing the sound and one away from the sound. A method is needed in sound source extractor 400 to make a decision about which angle to choose. One simple method in a preferred embodiment is to assume that most important events will be taking place in front of the recording head or microphone array, so always to choose the angle corresponding to the head aimed relatively toward the sound source. However, the shape of the ears causes a difference in the spectrum and impulse response for sounds coming from the front vs. rear. The HRTF concept can be used in this case. The Fourier Transform or other frequency-extraction method can be used to compare the spectra of the L and R outputs of the source filter. The difference in frequency response that best matches the differences in frequency response between the HRTFs corresponding to the front-facing and rear-facing cases would be chosen. Alternately, without having to use HRTFs explicitly, spectral differences over a wide range of experimental tests with in-ear microphones could be used to experimentally derive the differences in frequency between sounds arriving from the front and the rear. One simple embodiment of the present invention uses an algorithm determining that if the high-frequency amplitude of the output of source filter 401a compared to the source filter 401b is higher by a certain factor, for example 5 percent, relative to the difference in frequency amplitude over all frequencies between source filters 401a and 401b, then the “toward the sound” direction should be chosen, since the ear facing the source tends to induce more high-frequency effects than the ear with the head partially obscuring a direct path to the source for the “toward the sound” case. In the “away from sound” case, the sound comes from the rear in both ears, so the difference in high-frequency spectrum should be less. The high-frequency content comparison between the outputs of source filters 401a and 401b can be found by Fourier Transforms, by one or more highpass or bandpass filters, by looking at the sum total of high-frequency energy, by looking at one or more specific frequency values, or by finding statistics over the high frequency range such as maximum difference, average difference, and variance of difference, to make the decision as to whether the high-frequency content differential between the filter outputs is of greater magnitude than a threshold value.
To output the L sound-source signal 406 and R sound-source signal 407 for a path in a sound source extractor 400, the outputs of source filters 401a and 401b are used. Optionally, instead of outputting the latest output of source filters 401a and 401b, a time-delayed output from filters 401a and 401b can be used instead. And since comb filters have built-in delay functions, these delayed signals can be extracted from the comb filters instead of from a separate delay module. Since downstream calculations would be computing the amplitudes from a point in time later than the sound being output, it would allow the amplitudes in the theta calculation 405 to in effect consider the input sound 101 characteristics somewhat into the future, and not only the past. This option allows a more timely response of the apparent angle theta.i 408 outputs to the onset of a new sound.
The Sound Sources Rotator
Sound sources rotator 108 takes the extracted sound sources 113 from the sound sources extractor 106 and creates a new version of each sound source that appears to come from a specified direction phi with respect to the angle theta.i of the sound from each source coming from sound sources extractor 106. In other words, the result of sound sources rotator 108 is a sound for each path i that appears to come from angle phi plus theta.i.
In the preferred embodiment of the present invention, sound sources rotator 108 keeps the left and right channels of all sound sources intact as much as possible. This helps to retain as many of input sound 101 original listening properties as possible, which is helpful for maximum fidelity, for example, when listening to music.
The relative contributions of the above three processed signals are determined by factors K1512, K2513, and K3514 and depend on several conditions:
The values for factors K1512, K2513, and K3514 can be found by several means. One is to compute the deviation in angle from the ideal cases expressed by each of the above rules, then weight the factors accordingly, such that closer agreement to the ideal case yields a higher value. Alternately, trigonometric weightings can be used, for example, by using the cosine of the angle between the actual effect of phi and theta.i as compared to the perfect match with one or more rules above and assuming zero for any negative cosine values. For example, in this embodiment, suppose theta.i is 15 degrees and phi is 20 degrees.
A preferred embodiment of the present invention would then take the maximum values for K1 or K2, then distribute the difference between that value and 1.0 between K3 and the smaller of K1 and K2. In the example, this would approximately result in K1=0.94, K2=0.039, and K3=0.0215. Many other variations on the specific technique of computing the K1, K2, and K3 values so that they add up to a constant and are distributed toward the best matches having the greatest effect are possible within the scope of the invention. Ideally, a preferred embodiment will set a factor to 1.0 if there is a perfect match according to the above rules.
Front/Back filters 510a and 510b in the example shown in
Delays 515a and 515b are present to make adjustments to the time of arrival of the Lout 503 and Rout 504 signals for cases where the theta.i+phi term is not extremely close or equal to the ideal cases cited above. Similarly, gain blocks 516a and 516b are provided to adjust the gains of the channels due to such differences. In an embodiment of the present invention, gain blocks 516a and 516b are simply multipliers. In a preferred embodiment of the invention, they are frequency-sensitive gain blocks, for example, frequency-sensitive filters known in the art, that modify the higher frequencies greater than the lower frequencies, to implement the differences in low-frequency and high-frequency perception as described above. To control delays 515a and 515b and gain blocks 516a and 516b, equations similar to equation 1 and equation 2 above, or the other alternative models for signal amplitude and delay, would be used to gently rotate the processed L input 501 and R input 502 signals as will be apparent to those of skill in the art. Optionally, Front/Back Filters 510a and 510b can additionally add a relatively large additional delay if theta.i+phi is from behind the user and theta.i is in front of the user, to accentuate the illusion of the sound coming from behind.
Optionally, Front/Back Filters 510a and 510b and/or Delays 515a and 515b and/or Gain Blocks 516a and 516b could be duplicated and repositioned in the design to follow both the K1512 multipliers 505a and 505b and the K2513 multipliers 506a and 506b, if it is desired to implement these functions separately for the K1 and K2 cases.
Monaural Converter 507 combines the two inputted channels of sound L input 501 and R input 502 from the Sound Source in question (that originated as the outputs of the source filters in the sound sources extractor) into a monaural signal 518. Binaural Generation Filters 517a and 517b then generate a spatialized multi-channel (e.g, binaural) version of the monaural signal 518 with an apparent angle of theta+phi. The simplest way to generate a monaural signal is to sum or average the two channels of sound. However, a preferred embodiment is to take into account the time delay between the two signals L input 501 and R input 502. Inverting the techniques described above, equation 2 can be used to decide which channel to delay and by how much. After applying this delay, the two signals are mixed by adding together. Instead of using equation 2, the HRTF approach can alternately be used by observing the time delay indicated by the HRTF impulse (or other) response for the angle theta.i, then applying that delay before averaging. A more sophisticated version would be to take an approximation to the inverse of the HRTF filter for theta, and apply it to each channel to remove effects of the ear anatomy on the sound qualities.
Binaural Generation Filters 517a and 517b generate a binaural or stereo output for left and right, respectively, at an apparent angle of phi+theta.i. To do so, several techniques are possible. The simplest embodiment is to once again use equations 1 and 2. Rearranging equation 1 provides the following expressions for the L and R channel output multiplicative factors to multiply outputs of Binaural Generation Filters 517a and 517b to get signals 509a and 509b:
Right amplitude=1/2K3 sin(phi+theta+pi/2) (equation 3)
Left amplitude=1/2K3 cos(phi+theta+pi/2) (equation 4)
Preferably, rather than a simple multiplication, these amplitudes are applied in a frequency-selective manner, for example, utilizing high-pass filtering as will be apparent to those with skill in the art, so that only the higher audio frequencies are substantially affected, for example, frequencies above 400 Hz. The monaural signal 518 is multiplied by the above-discussed gains to create the right and left outputs. In the preferred embodiment, the amplitude changes are followed with a time delay affecting left signal 509a using a mathematical head model such as:
tdelay.left=2r sin(phi+theta)/v.sound (equation 5)
If the tdelay.left is negative, then the same value of delay can be applied to the right channel tdelay.right instead. Optionally, for cases where the theta.i+phi corresponds to sound coming from behind, the time delay tdelay.left or tdelay.right can be increased to well beyond the calculated amounts, say by a factor up to 2 or 3, to provide a more convincing experience of the sound coming from behind. An optional embodiment of the invention therefore determines if the phi+theta angle from which the sound is coming is behind the listener (i.e., between 90 and 270 degrees relative to the reference listener head angle), and in such case, increases the time delay for this effect.
Alternately, an HRTF can again be used in Binaural Generation Filters 517a and 517b. This would be in the same sense that it is used in synthesizing surround sound in the art. The monaural signal 518 is convolved with the HRTF impulse response for a resulting apparent angle of theta+phi. The HRTF automatically takes care of the amplitude and time-delay issues. However, the HRTF is a bit more computation intense and often works better for some people who match its characteristics better than others.
An alternate embodiment of the present invention uses only the Monaural Converter 507 and its downstream components, rather than attempting to preserve the original two-channel content as achieved above with the K1 and K2 terms. The result would essentially be equivalent to setting K1 and K2 to be zero and using a constant K3.
Sound Combiner
Sound Combiner 109 takes the various rotated sounds from the bank of rotated signals from sound sources rotator 108 and combines them into a single two-channel (or however many channels are desired) output. In the preferred embodiment, a summation signal is used to accumulate the rotated sounds from the bank of rotated sounds. Various functions of the summation signal may be utilized in the present invention. The simplest version of sound combiner 109 simply adds the outputs from each of the path among the rotated sound signals 114 output by sound sources rotator 108 into the summation signal, and scales the resulting summation signal to be consistent with the listener's needs.
In a more complex embodiment of the present invention, sound combiner 109 takes into account the spectral qualities of adding together the rotated sound signals 114. In this case, the summation signal will not be a simple addition, but an addition of scaled versions of the various rotated sounds signals 114. If the source filters in sound sources extractor 106 are carefully selected to not overlap substantially in the frequency domain, and to have frequency responses that sum together for a flat overall frequency response, little needs to be done. However, if there is significant overlap between the source filters in sound sources extractor 106, sound combiner 109 preferably will adjust the amplitudes of the individual rotated sound signals 114 accordingly to make a more even spectral response of the overall system. For example, in an embodiment, the frequency responses of all the source filters are added together to obtain the frequency response of the overall system, and an optimization process is used to reduce the contributions of some of the rotated sound signals 114 so as to provide a more-flat frequency response. This process preferably includes changing the relative contributions of each of the paths, for example, by multiplying the Lout 503 and Rout 504 values for each sound source rotator 500 by a coefficient, or it could optionally include changing the frequency-decay responses of the source filters, for example by adjusting the cutoff frequencies of low-pass filters that follow the comb filters. The optimization for flatter frequency response can use any known optimization procedure. A preferred embodiment is to use a gradient-descent procedure among the above variables (path contributions, cutoff frequencies), using a figure-of-merit for the overall frequency response of the summation of the frequency response of the source filters of sound sources extractor 106 corresponding to the rotated sound signals 114. The preferred figure of merit measures how flat (ideal) the response is, for example, by measuring the variance of the amplitude values of the spectrum compared to the mean frequency response across the spectrum. Preferably, this optimization occurs at design-time, and the results are used in the run-time listening software or hardware, but the optimization of modifications to the rotated sound signals 114 could optionally be run in real time on the listening hardware/software setup if desired, particularly if dynamically-changing source filters are used in sound sources extractor 106.
Sound Combiner 109 optionally adds bits of filtered Lin 104 and Rin 105 signal from the input sound 101 or bits of monaural combined Lin 104 and Rin 105 input sounds at frequencies where the sum of source filters leaves gaps in the frequency response of the summation of the frequency responses of the source filters in sound sources extractor 106. One special case of this is for low frequencies, such as, for example, below 100 Hz. Since these frequencies are not easy to distinguish by direction, the source filters in sound sources extractor 106 optionally could have fundamental frequencies higher than the cutoff frequency in question, and a low-pass filter with a cutoff near this frequency could be used in sound combiner 109 to add these relatively unprocessed, and hence, very low distortion stereo or binaural signals to the output.
Sound Combiner 109 optionally takes into account that for phi=0 (no rotation required), the existing input sound 101 is already what is needed at outputs Lout 110 and Rout 111, regardless of rotation angle phi 112, because using the original input signals may result in less distortion than separating sound sources and recombining them through the filtering and rotating paths. Taking advantage of this, some or all of the output of Sound Combiner 109 can be the original input sound 101 under such conditions. So that there isn't a discontinuity in sound quality exactly at phi=0, this can be a weighted feature, where a cos(phi) or similar function is used to determine the fraction of the original input signal vs. the fraction of the reconstructed, combined signal. For example, in a preferred embodiment, lobe 701 in front of a user's head 705 in
A related issue arises in reverse if a “hemispheric” assumption is made in sound sources extractor 106, assuming that all sound sources originate in the 180 degrees that are toward the reference direction or reference listener head angle of the system. As a result if this assumption, if the user turns his or her head 705 away from the front, there will be somewhat of a “dead zone”, wherein no sound appears to be coming from the rear. Lobe 704 depicts an example of the degree to which directions appear to have a dead zone from which less sound originates. The dead zone can cause a sense of unnaturalness about the silence from that direction, whereas in the real world, there is seldom such complete silence. It is therefore desirable to “fill in” some sound from the rear to make the auditory experience more interesting and natural if the above hemispheric assumption is made.
In another embodiment of the present invention, Sound combiner 109 implements a type of directional filter by modifying the amplitudes of rotated sound signals 114. The process is implemented in one example embodiment of the present invention as depicted in
Another embodiment of the present invention uses an equivalent mechanism as shown in
This process of shaping the spatial sensitivity to the input sounds enables a listener or other recipient of the output sound such as Lout 110 and Rout 111 to listen to the sound sources corresponding to one or several directions in space. In yet another embodiment of the present invention, Sound sources rotator 108 performs a zero-degree rotation (or equivalently, is omitted). In another embodiment, angle comparer 107 is omitted. In a real-time use of the present invention, this implies that the invention performs its processing relative to the direction the user's head or body is facing, assuming that microphones providing the Lin 104 and Rin 105 signals are attached to the head or body, respectively.
Applications for such embodiments include listening devices such as hearing aids, where it may be desirable to focus on only one direction of sound, treating the sound sources corresponding to other directions as noise. For example, Manually input value 1001 would enable a hearing aid user to listen to sounds not originating from straight ahead, without turning the microphones. Output sound 1003 goes to the earphone elements in a hearing aid, for example, in one such embodiment.
Angle Comparer
Angle comparer 107 determines the rotation angle phi 112 that should be applied to input sound 101 by sound sources rotator 108. If the original recording or music stream is made by a fixed microphone system, such as a synthetic head with embedded binaural microphones, the initial input head angle alpha 102 in
In a case where the recording microphones are not in a fixed orientation, the input head angle alpha 102 may also vary during a recording or streaming, and thus, the rotation angle phi 112 will also be modified as a function of input head angle alpha 102. In this case, the input head angle alpha 102 should be measured, for example, with a person having a recording device while engaging in an outdoor activity. If he or she turns the head while recording, the angle input head angle alpha 102 will change, and thus the rotation angle phi 112 will also be changed to keep the apparent orientation of the sound sources consistent for the listener. So in that case, sound sources rotator 108 will busily be rotating sounds to different angles even if the listener is not moving his or her head.
For some for example portable applications, it may be desirable for the sound to tend to be oriented with a direction aligned with the user's head position, rather from a direction fixed in space. For example, if the user is riding in a bus and the bus goes around a corner, it may be desirable if the user does not have to rotate her head by 90 degrees, long-term, to get the “normal” sound source orientations. Angle comparer 107 can accomplish this by using a high-pass kind of filter or decay filter that slowly returns the rotation angle phi 112 to zero over time, for example, returning most of the way to zero in 20 seconds when the user's head has not turned farther, so that the sound will tend to align itself in that way. In effect, this is equivalent to slowly biasing the reference listener head angle toward the current listener head angle beta 103. Alternately a software or hardware control button could be added to instantly or gradually reset the alignment between the user and the reference listener head angle. Alternately, a body-referenced reference listener head angle could be implemented by independently measuring the orientation of another part of the user's body, such as the torso, or by measuring the orientation of a vehicle or seating mechanism and utilizing that measurement in the calculations of angle comparer 107, as well be apparent to those with skill in the art. Any of the above would preferably be options settable in hardware or software control inputs for the invention.
Not Only Yaw Angle
The above discussion is for the case where the system considers rotations only in the yaw angle (in other words, input head angle alpha 102, listener head angle beta 103 and rotation angle phi 112 are all for rotations within the horizontal plane). The present invention can also be used for pitch (up/down angle) and roll (tilting the head to the side), using essentially the same concepts as disclosed above. One extension is an embodiment using and extending the simple head model of
Not Only Recordings
The above discussion assumes that the present invention is being used for playing back recordings. However, the essence of the present invention also applies to live-streaming of sounds. Since the present invention works with any multi-channel sound source, and doesn't need to pre-process the entire event, it can receive a real-time or slightly-delayed stream of sound data from the sound source, along with optional alpha updates, and perform the functions as described above.
More than Two Channels
If more than two channels of audio are available from the sound source, the invention can be modified to accommodate. sound sources extractor 106 in this embodiment is optionally run on all pairs of sound sources to obtain redundant theta.i values for each path. In addition to reducing errors, this would conceivably also eliminate the ambiguity issue discussed relative to
An optional embodiment of a recording device 600 that provides more than two channels for input sound 101 is shown in
Another embodiment of the present invention is used to combine multi-channel sound into two-channel sound. If more than two microphones are used in the creation of input sound 101, the sound can still be combined into a two-channel stream for compatibility with existing sound distribution and storage mechanisms. In a preferred embodiment, this is done by using a version of the architecture of sound rotation system 100 in
Yet another embodiment of the invention is to use a third microphone on the cable from earbud, such as is currently used in the art for cellphone conversations. The input from this microphone is used in this embodiment, in effect to disambiguate the direction of the sound. Even if it is of lower quality than the in-ear microphones, the signal can be useful for sound sources extractor 106 for determining theta.i for each of the sound source signals 113, and potentially be ignored by sound sources rotator 108 since it is of lower quality. For example, if the microphone is located in front of the user's trunk, sound from the rear will be much more attenuated compared to sound from the front, and this difference can be used within the scope of the algorithms described above to decide whether to use the “facing toward the sound” or “facing away from the sound” angle in the sound source extractor.
Use without Headphones
An embodiment of the present invention is for use without headphones, for example with speaker output. An example of this embodiment is to include a sensor, e.g., infrared or video locating system, that detects where a listener is. Then, similar rotation effects can be used to rotate the apparent stereo direction toward that user. This could be used in gaming, for example, if a tennis ball is being hit, so that the sound of the ball is rotated to be the most realistic in apparent angle for the player that is receiving the ball. This embodiment of the present invention would also be useful for removing the effects of changes to input head angle alpha 102 for sound played back through speakers.
Listening Device
It can be very engaging to listen to the sound of standard stereo or binaural music or other events with the present invention, as a much more realistic, or alternately, interesting, effect is experienced, in that as the listener's head is rotated, the sound experience changes accordingly. To accommodate portability of the approach for use in portable electronics, such as cellphones and mp3 players and the like, a simple, non-obtrusive version of a head tracker to measure listener head angle beta 103 is desirable. One way to do this is shown in
An alternate head tracker for a listening device can be made using the camera in the portable device. If the user's head is in view of one of the cameras, a video-based head tracker similar to, for example, the ViVo Mouse (http://www.vortant.com/vivo-mouse/) can be used to monitor the head pointing relative to the device. Then preferably, the device can measure its own orientation with respect to the external world by using its accelerometer, compass, and rate sensor. This would avoid the need for special head-tracking hardware, but has the disadvantage that the camera would have to be kept roughly pointed in a correct direction to detect the listener's head.
This specification represents the preferred embodiment of the invention. The concepts of the present invention are not necessarily divided into the modules here, such as sound sources extractor, sound sources rotator, sound combiner, and angle comparer, but could be divided into different sections, performed in somewhat different orders, etc. There are many alternate embodiments, such as alternate equations and filtering technique refinements that fall within the scope of the invention that will be apparent to those with skill in the art, once the principles of the invention are understood.
While there has been illustrated and described what is at present considered to be the preferred embodiment of the subject invention, it will be understood by those skilled in the art that various changes and modifications may be made and equivalents may be substituted for elements thereof without departing from the true scope of the invention.
This is a continuation of U.S. application Ser. No. 17/336,583, filed Jun. 2, 2021, which is a continuation in part of U.S. application Ser. No. 16/238,574, filed Jan. 3, 2019, which is a continuation of and claims the benefit of U.S. application Ser. No. 15/613,621, filed Jun. 5, 2017, which, claims the benefit of U.S. Provisional Application No. 62/392,731, filed Jun. 7, 2016.
Number | Name | Date | Kind |
---|---|---|---|
4968154 | Baeg | Nov 1990 | A |
20130216047 | Kuech | Aug 2013 | A1 |
20130272539 | Kim | Oct 2013 | A1 |
Number | Date | Country | |
---|---|---|---|
62392731 | Jun 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17336583 | Jun 2021 | US |
Child | 18099950 | US | |
Parent | 15613621 | Jun 2017 | US |
Child | 16238574 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16238574 | Jan 2019 | US |
Child | 17336583 | US |