The present invention relates generally to sound reproduction. More particularly, the present invention relates o a system and method for providing sound to a listener.
Sound has long been reproduced for listeners using speakers and/or headphones. One method for providing sound to a listener is by binaurally rendering an acoustic scene. Binaural rendering allows for the creation of a three-dimensional stereo sound sensation of the listener actually being in the room with the original sound source.
Rendering binaural scenes is typically done by convolving the left and right ear head-related impulse responses (HRIRs) for a specific spatial direction with a source sound in that direction. For each sound source, a separate convolution operation is needed for both the left ear and the right ear. The output of all of the filtered sources is summed and presented to each ear, resulting in a system where the number of convolution operations grows linearly with the number of sound sources. Furthermore, the HRIR is conventionally measured on a spherical grid of points, so when the direction of the synthesized source is in-between these points a complicated interpolation is necessary.
Therefore, it would be advantageous to be able to provide rendering of binaural scenes using fewer convolution operations and without the complicated interpolation necessary for points in between the points on the spherical grid. It would also be advantageous to take into account a user's head rotation in reference to the simulated acoustic scene.
The foregoing needs are met, to a great extent, by the present invention, wherein in one aspect, a system for reproducing an acoustic scene for a listener includes a computing device configured to process a sound recording of the acoustic scene to produce a binaurally rendered acoustic scene for the listener. The system also includes a position sensor configured to collect motion and position data for a head of the user and also configured to transmit said motion and position data to the computing device, and a sound delivery device configured to receive the binaurally rendered acoustic scene from the computing device and configured to transmit the binaurally rendered acoustic scene to a left ear and a right ear of the listener. In the system the computing device is further configured to utilize the motion and position data from the inertial motion sensor in order to process the sound recording of the acoustic scene with respect to the motion and position of the user's head.
In accordance with another aspect of the present invention, the system can include a sound collection device configured to collect an entire acoustic field in a predetermined spatial subspace. The sound collection device can further include a sound collection device taking the form of at least one selected from the group consisting of a microphone array, pre-mixed content, or software synthesizer. The sound delivery device can take the form of one selected from the group consisting of headphones, earbuds, and speakers. Additionally, the position sensor can take the form of at least one of an accelerometer, gyroscope, three-axis compass, camera, and depth camera. The computing device can be programmed to project head related impulse responses (HRIRs) and the sound recording into the spherical harmonic subspace. The computing device can also be programmed to perform a psychoacoustic approximation, such that rendering of the acoustic scene is done directly from the spherical harmonic subspace. The computing device can be programmed to compute rotations of a sphere in the spherical harmonic subspace by generating a set of sample points on the sphere and calculating the Wigner-D rotation matrix via a method of projecting onto these sample points, rotating the points, and then projecting back to the spherical harmonics, and the computing device can also be programmed to calculate rotation of the sphere using quaternions.
In accordance with another aspect of the present invention, a method for reproducing an acoustic scene for a listener includes collecting sound data from a spherical microphone array and transmitting the sound data to a computing device configured to render the sound data binaurally. The method can also include collecting head position data related to a spatial orientation of the head of the listener and transmitting the head position data to the computing device. The computing device is used to perform an algorithm to render the sound data for an ear of the listener relative to the spatial orientation of the head of the listener. The method can also include transmitting the sound data from the computing device to a sound delivery device configured to deliver sound to the ear of the listener. The method can include the computing device executing the algorithm
The method can also include preprocessing the sound data, such as by interpolating an HRTF (head related transfer function) into an appropriate spherical sampling grid, separating the HRTF into a magnitude spectrum and a pure delay, and smoothing a magnitude of the HRTF in frequency. Collecting head position data can be done with at least one of accelerometer, gyroscope, three- axis compass, camera, and depth camera.
In accordance with yet another aspect of the present invention, a device for transmitting a binaurally rendered acoustic scene to a left ear and a right ear of a listener includes a sound delivery component for transmitting sound to the left ear and to the right ear of the listener and a position sensing device configured to collect motion and position data for a head of the user. The device for transmitting a binaurally rendered acoustic scene is further configured to transmit head position data to a computing device and wherein the device for transmitting a binaurally rendered acoustic scene is further configured to receive sound data for transmitting sound to the left ear and to the right ear of the listener from the computing device, wherein the sound data is rendered relative to the head position data.
In accordance with still another aspect of the present invention, the sound delivery component takes the form of at least one selected from the group consisting of headphones, earbuds, and speakers. The position sensing device can take the form of at least one of an accelerometer, gyroscope, three-axis compass, and depth camera. The computing device is programmed to project head related impulse responses (HRIRs) and the sound recording into the spherical harmonic subspace. The computing device is programmed to perform a psychoacoustic approximation, such that rendering of the acoustic scene is done directly from the spherical harmonic subspace. The computing device can also be programmed to compute rotations of a sphere in the spherical harmonic subspace by generating a set of sample point on the sphere and calculating the Wigner-D rotation matrix via a method of projecting onto these sample points, rotating the points, and then projecting back to the spherical harmonics.
The accompanying drawings provide visual representations which will be used to more fully describe the representative embodiments disclosed herein and can be used by those skilled in the art to better understand them and their inherent advantages. In these drawings, like reference numerals identify corresponding elements and:
The presently disclosed subject matter now will be described more fully hereinafter with reference to the accompanying Drawings, in which some, but not all embodiments of the inventions are shown. Like numbers refer to like elements throughout. The presently disclosed subject matter may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Indeed, many modifications and other embodiments of the presently disclosed subject matter set forth herein will come to mind to one skilled in the art to which the presently disclosed subject matter pertains having the benefit of the teachings presented in the foregoing descriptions and the associated Drawings. Therefore, it is to be understood that the presently disclosed subject matter is not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims.
An embodiment in accordance with the present invention provides a system and method for binaural rendering of complex acoustic scenes. The system for reproducing an acoustic scene for a listener includes a computing device configured to process a sound recording of the acoustic scene to produce a binaurally rendered acoustic scene for the listener. The system also includes a position sensor configured to collect motion and position data for a head of the user and also configured to transmit said motion and position data to the computing device, and a sound delivery device configured to receive the binaurally rendered acoustic scene from the computing device and configured to transmit the binaurally rendered acoustic scene to aloft our and a right ear of the listener. In the system the computing device is further configured to utilize the motion and position data from the inertial motion sensor in order to process the sound recording of the acoustic scene with respect to the motion and position of the user's head.
In one embodiment, illustrated in
The user interface device 10 and the computing module device 20 may communicate with each other over a communication network 30 via their respective communication interfaces as exemplified by element 130 of
Referring now to
Similar to the choice of the processor 100, the configuration of a software of the user interface device 10 and the computing module device 20 (further discussed herein) may affect the choice of memory 110, used in the user interface device 10 and the computing module device 20. Other factors may also affect the choice of memory 110, type, such as price, speed, durability, size, capacity, and reprogrammability. Thus, the memory 110, of user interface device 10 and the computing module device 20 may be, for example, volatile, non-volatile, solid state, magnetic, optical, permanent, removable, writable, rewriteable, or read-only memory. If the memory 110, is removable, examples may include a CD, DVD, or USB flash memory which may be inserted into and removed from a CD and/or DVD reader/writer (not shown), or a USB port (not shown). The CD and/or DVD reader/writer, and the USB port may be integral or peripherally connected to user interface device 10 and the remote database device 20.
In various embodiments, user interface device 10 and the computing module device 20 may be coupled to the communication network 30 (see
Working in conjunction with the communication device 120, the communication interface 130 can provide the hardware for either a wired or wireless connection. For example, the communication interface 130, may include a connector or port for an OBD, Ethernet, serial, or parallel, or other physical connection. In other embodiments, the communication interface 130, may include an antenna for sending and receiving wireless signals for various protocols, such as, Bluetooth, Wi-Fi, ZigBee, cellular telephony, and other radio frequency (RF) protocols. The user interface device 10 and the computing module device 20 can include one or more communication interfaces 130, designed for the same or different types of communication. Further, the communication interface 130, itself can be designed to handle more than one type of communication.
Additionally, an embodiment of the user interface device 10 and the computing module device 20 may communicate information to the user through the display 140, and request user input through the input device 150, by way of an interactive, menu-driven, visual display-based user interface, or graphical user interface (GUI). Alternatively, the communication may be text based only, or a combination of text and graphics. The user interface may be executed, for example, on a personal computer (PC) with a mouse and keyboard, with which the user may interactively input information using direct manipulation of the GUI. Direct manipulation may include the use of a pointing device, such as a mouse or a stylus, to select from a variety of selectable fields, including selectable menus, drop-down menus, tabs, buttons, bullets, checkboxes, text boxes, and the like. Nevertheless, various embodiments of the invention may incorporate any number of additional functional user interface schemes in place of this interface scheme, with or without the use of a mouse or buttons or keys, including for example, a trackball, a scroll wheel, a touch screen or a voice-activated system. Alternately, in order to simplify the system the display 140 and user input device 150 may be omitted or modified as known to or conceivable by one of ordinary skill in the art.
The different components of the user interface device 10, the computing module device 20, and the imaging device 25 can be linked together, to communicate with each other, by the communication bus 160. In various embodiments, any combination of the components can be connected to the communication bus 160, while other components may be separate from the user interface device 10 and the remote database device 20 and may communicate to the other components by way of the communication interface 130.
Some applications of the system and method for analyzing an image may not require that all of the elements of the system be separate pieces. For example, in some embodiments, combining the user interface device 10 and the computing module device 20 may be possible. Such an implementation may be usefully where interact connection is not readily available or portability is essential.
p(θ,φ,ω)=Σn=0∞Σm=−nnpmn(ω)Ymn(θ, φ),
pmn(ω)=∫02π∫0πp(θ,φ,ω)Y*mn(θ,φ) sin θdθdφ Equation 1
where Pmn(ω) are a set of coefficients describing the sound field, Ymn(θ, φ) is the spherical harmonic of order n and degree m, and (•)* is the complex conjugate. The spherical coordinate system described in Equation 1 is used in this work with azimuth angle, φε[0, 2π], and zenith angle, θε[0, π]. The spherical harmonics are defined as
where Pmn(cos θ) is the associated Legendre function and i=√{square root over (−1)} is the imaginary unit.
In any practically realizable system, the sound field must be sampled at the discrete locations of the transducers. The number of sampling points, S, needed to describe a band limited sound field up to maximum order n=N is S≧(N+1)2. However, it is not necessarily the case that the minimum bound, S=(N+1)2, can be achieved without some amount of aliasing error.
In the design of a broadband spherical microphone array, such as could be used in the system described above, it is advantageous to use a spherical baffle or directional microphones to alleviate the issue of nulls in the spherical Bessel function. In this case, the pressure on the sphere due to a unit amplitude plane wave is
pmn(ω)=bn(kr)Y*mn(θs, φs)
where k=2πf/c is the wavenumber, f is the frequency, c is the speed of sound, and bn(kr) is the modal gain, which is dependent on the baffle and microphone directivity. The modal gain is typically very large at low frequencies.
A beamformer, can be used in conjunction with the present invention to spatially filter a sound field by choosing a set of gains for each microphone in the array, w(ω), resulting in an output
where (•)H is the conjugate transpose and S is the number of microphones.
The beamforming can be performed in the spatial domain, however, in accordance with the present invention it is preferable to perform the beamforming in the spherical harmonics domain. For the purposes of the calculation, it is assumed that each microphone has equal cubature weight,
and that incoming sound field is spatially band limited. These two assumptions allow the beamformer to be calculated in the spherical harmonics domain, so that the design is independent of the look direction of the listener and can be applied to arrays with different spherical sampling methods.
The robustness of a beamformer, as used in the present invention, can be quantified as the ratio of the array response in the look direction of the listener to the total array response in the presence of a spatially white noise field. This is called the white noise gain (WNG) and given by
Assuming unity gain in the look direction, this can be written in the spherical harmonics domain as:
where B(ω)=diag [b0(ω)b1(ω)b1(ω)b1(ω) . . . bN(ω)] is the diagonal (N+1)2×(N+1)2 matrix of modal gains.
In the present invention, it is preferred to calculate the optimum robust beamformer coefficients, {tilde over (w)}mn(ω), given a desired target beam pattern, wmn(ω). For a single frequency this can be computed with the following convex minimization,
minimize, {tilde over (w)}mn∥{tilde over (w)}mn−wmn∥22
subject to,
and
Because there is no specific look direction in an arbitrary pattern, the direction, dmn=[Y0,0(θ1,φ1)Y-1,1(θ1,φ1) . . . YN,N(θ1,φ1)]T, is chosen as a point, or set of points, that are a desired maximum response in the target pattern. The exemplary look direction used above has the maximum response in the target pattern, wmn(ω)1. The gain of the target pattern in this direction is assumed to be unity. The minimum WNG constraint is parameterized by δ=10-WNG/10.
The computer software for the present invention also includes a second software component 230, a general method for steering arbitrary patterns using the Wiper D-matrix. In this method the rotation coefficients, Dmm′n, that represent the original field wmn in the rotated coordinate system, wm′n are calculated. These rotation coefficients only affect components within the same order of the expansion,
The computation of the Wigner D-matrix coefficients, Dmm′n, can be done directly or in a recursive manner. Both methods can exhibit numerical stability issues when rotating through certain angles. Instead of computing the function directly, a projection method is preferable, which is both efficient and easy to implement. By way of example, given a field that is described by a set of coefficients in the spherical harmonics domain, pmn, we first project into the spatial domain,
p=Ypmn;
where Y is the matrix of spherical harmonics given by
pr=YRHYpmn=Dpmn
A major issue with this method is that many sampling geometries exhibit strong aliasing errors that result in the distortion of the rotated beam pattern. There are two options to make sure that aliasing does not affect the rotated pattern: spatial oversampling and numerical optimization. A preferred metric to determine the aliasing contributions from each harmonic for a given spherical sampling grid is the Gram matrix, G=YHY. The aliasing error can then be written as
where I is the identity matrix.
The sampling theorem for a spherical surface requires S≧(N+1)2 sample points for a sound field band-limited to order N. However, in general, it is not always possible to sample the sphere at the band-limit, S=(N+1)2, without spatial aliasing errors. Spherical t-designs are also preferred for spatial oversampling since they provide aliasing-free operation for all harmonics below a band limit, t=2N, as seen in
To reduce the error to negligible levels, an optimization method can be used,
pr=YRH(YH)+pmn
where (•)+ indicates the pseudoinverse. In implementation, speedups can be achieved by noting that (YH)+ is independent of the rotation and D is block diagonal. Rotation of the sampling points, (θs, φs), should be done using quaternions to avoid issues when rotating through the poles.
This method allows for sampling at the band-limit with minimal error, which reduces the computational complexity. However, numerical issues can result if the condition number of the sample grid, κ(YHY), is high. By way of example, choosing the sample points that minimize the condition number of the Gram matrix can ensure that these issues do not cause irregularities in the rotated beam pattern.
In this example, the rotated beam pattern can be calculated exactly by inputting the rotated coordinates in
The error between the exact and rotated beams can then be computed as 10 log10∥pexact-Dpmn∥22. For all the rotations tested (every 1 degree in azimuth and zenith) the error was around −300 dB, showing that no distortion in the rotated pattern occurs.
The following applications are included as examples, and are not meant to be limiting. Any application of the above methods and systems known to or conceivable by one of skill in the art could also be used. When rendering a recorded spatial sound field over a loudspeaker array it is important to consider the available gain of the microphone array at low frequencies. Typical sound field rendering approaches such as mode-matching, or energy and velocity vector optimization generate a set of loudspeaker beamforms that do not take the microphone robustness into account. Furthermore, these methods and are not guaranteed to be axisymmetric, especially for irregular loudspeaker arrangements. The beam patterns generated from either approach can be used to calculate their robust versions for auralizing recorded sound fields.
The robust beamforming and steering method can also be used to design a system to render recordings from spherical microphone arrays binaurally. Here the grid of HRTF measurements at each frequency is considered as a pair of spatial filters, hmnt(ω) and hmnr (ω)
The output for a single ear is then
A set of preprocessing steps are performed to ensure that the perceptually relevant details can be well approximated when using a low order approximation of the sound field. The HRTF is first interpolated to an equiangular grid, then it is separated into its magnitude spectrum and a pure delay (estimated from the group delay between 500-2000 Hz), and finally the magnitudes are smoothed in frequency using 1.0 ERB filters.
In current binaural renderers, the interpolation operation must be done in real-time. This severely limits the number of sources that can be synthesized, especially when source motion is desired. It also limits the complexity of the interpolation operation that can be performed. Typically, HRTFs are simply switched (resulting in undesirable transients) or a basic crossfader is used between HRTFs. In this approach, interpolation is done offline, so any type of interpolation is possible, including methods that solve complex optimization problems to determine the spherical harmonic coefficients. Furthermore, since the motion of a source is captured in the source's plane-wave decomposition, the interpolation issue does not exist for moving sources.
The addition of head tracking is also a simple operation in this context. The rotation of a spherical harmonic field was discussed above. This rotation can be applied to the left and right HRTFs individually. However, to eliminate a rotation, it can instead be applied to the acoustic scene, where the scene then rotates in the opposite direction of the head.
Head tracking binaural systems have traditionally been limited to laboratory settings due to the need for expensive electromagnetic-based tracking systems such as the Polhemus FastTrack. However, recent advances in MEMS technology have made it possible to purchase inexpensive 9 degree-of-freedom sensors with similar performance at a fraction of the price. Alternatively, due to the wide proliferation of computing devices with front-facing cameras, a computer-vision based head-tracking approach is also feasible for this type of system.
A head tracking system in this work uses a PNI SpacePoint Fusion 9DOF MEMS sensor. A Kalman filter is used to fuse the data from the 3-axis accelerometer, 3-axis gyroscope, and 3-axis magnetometer and provide a small amount of smoothing. It should be noted that such audio signals can be generated in a virtual world such as gamming to artificially generate images in any direction, based on the user's head position in orientation to the virtual world.
The method 400 can also include an algorithm executed by the computing device being defined as:
The sound data can be preprocessed, which can include the steps of: interpolating an HRTF into an appropriate spherical sampling grid; separating the HRTF into a magnitude spectrum and a pure delay; and smoothing a magnitude of the HRTF in frequency. Collecting head position data is done with at least one of accelerometer, gyroscope, three-axis compass, and depth camera.
Finally, it should be noted that this technique is not limited to headphone playback. As mentioned earlier, binaural scenes can be played back over loudspeakers using crosstalk cancellation filters. In this type of situation it would be preferable to use a vision-based head tracking system, such as a three-dimensional depth camera or any other vision-based head tracking system known to one of skill in the art. Furthermore, as more sophisticated acoustic scene analysis and computer listening devices are created, the desire for binaural processing methods that allow for rotations will become necessary. A spherical microphone array along with this binaural processing method could function as a simple preprocessing model to extract the left and right ear signals while allowing for the computerized steering of the look direction in such a system.
The many features and advantages of the invention are apparent from the detailed specification, and thus, it is intended by the appended claims to cover all such features and advantages of the invention which fall within the true spirit and scope of the invention. Further, since numerous modifications and variations will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention. It should also be noted that the present invention can be used for a number of different applications known to or conceivable by one of skill in the art, such as, but not limited to gaming, education, remote surveillance, military training, and entertainment.
Although the present invention has been described in connection with preferred embodiments thereof, it will be appreciated by those skilled in the art that additions, deletions, modifications, and substitutions not specifically described may be made without departing from the spirit and scope of the invention as defined in the appended claims.
This application claims the benefit of U.S. Provisional Patent Application No. 61/521,780, filed on Aug. 10, 2011, which is incorporated by reference herein, in its entirety.
This invention was made with government support under ID 0534221 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
61521780 | Aug 2011 | US |