The present invention relates to digital audio signal processing, and more particularly to loudspeaker virtualization and cross-talk cancellation devices and methods.
Multi-channel audio inputs designed for multiple loudspeakers can be processed to drive a single pair of loudspeakers and/or headphones to provide a perceived sound field simulating that of the multiple loudspeakers. In addition to creation of such virtual speakers for surround sound effects, signal processing can also provide changes in perceived listening room size and shape by control of effects such as reverberation.
Multi-channel audio is an important feature of DVD players and home entertainment systems. It provides a more realistic sound experience than is possible with conventional stereophonic systems by roughly approximating the speaker configuration found in movie theaters.
Note that the dependence of H1 and H2 on the angle that the speakers are offset from the facing direction of the listener has been omitted.
yields Y1=E1 and Y2=E2.
An efficient implementation of the cross-talk canceller diagonalizes the 2×2 matrix having elements H1 and H2
where M0(ejω)=H1(ejω)+H2(ejω) and S0(ejω)=H1(ejω)−H2(ejω). Thus the inverse becomes simple to compute:
And the cross-talk cancellation is efficiently implemented as sum/difference detectors with the inverse filters 1/M0(ejω) and 1/S0(ejω). This structure is referred to as the “shuffler” cross-talk canceller. U.S. Pat. No. 5,333,200 discloses this plus various other cross-talk signal processing.
Now with cross-talk cancellation, the
For example, the left surround sound virtual speaker could be at an azimuthal angle of about 250 degrees. Thus with cross-talk cancellation, the corresponding two real speaker inputs to create the virtual left surround sound speaker would be:
where H1, H2 are for the left and right real speaker angles (e.g., 30 and 330 degrees), LSS is the (short-time Fourier transform of the) left surround sound signal, and TF3left=H1(250), TF3right=H2(250) are the HRTFs for the left surround sound speaker angle (250 degrees).
Again,
Unfortunately, the transfer functions from the speakers to the ears depend upon the individual's head-related transfer functions (HRTFs) as well as room effects and therefore are not completely known. Instead generalized HRTFs are used to approximate the correct transfer function. Usually generalized HRTFs are able to create a sweet-spot for most listeners, especially when the room is fairly non-reverberant and diffuse.
However, the sweet spot can be quite a small region. That is, to perceive the virtualized sound field properly, a listener's head cannot move much from the central location used for the filter design with HRTFs and cross-talk cancellation. Thus there is a problem of small sweet spot with the current virtualization filter design methods.
The present invention provides virtualization filter designs and methods which balance interaural intensity difference and interaural time difference. This allows for an expansion of the sweet spot for listening.
1. Overview
Preferred embodiment cross-talk cancellers and virtualizers for multi-channel audio expand the small “sweet spot” for listening locations relative to real speakers into a larger “sweet space” by modifying (as a function of frequency) the relative speaker outputs in accordance with a psychoacoustic trade-off between the Interaural Time Difference and the Interaural Intensity Difference. These modified speaker outputs are used in a virtualizing filter; and this makes direction virtualization more robust.
Preferred embodiment systems implement preferred embodiment virtualizing filters with any of several types of hardware: digital signal processors (DSPs), general purpose programmable processors, application specific circuits, or systems on a chip (SoC) such as combinations of a DSP and a RISC processor together with various specialized programmable accelerators such as for FFTs and variable length coding (VLC). A stored program in an onboard or external flash EEPROM or FRAM could implement the signal processing.
2. Pyschoacoustic Basis
The preferred embodiments enlarge the listener's sweet spot by consideration of how directional perception is affected by listener movement within the sound field. Three basic psychoacoustic clues determine perception of the direction of a sound source: (1) Interaural Intensity Difference (IID) which refers to the relative loudness between the two ears of a listener; (2) Interaural Time Difference (ITD) which refers to the difference of times of arrival of a signal at the two ears (generally, people will perceive sounds as coming from the side which is louder and where the signal arrives earlier); and (3) the HRTF, which not only includes IID and ITD, but also frequency dependent filtering which helps clarify direction, because many directions can have the same IID and ITD.
An interesting experiment was performed in the early 1970's by Madsen to determine the effect on perception of direction when IID and ITD do not agree. It turns out that these clues can compensate for each other to a certain degree. For instance if the sound is louder in one ear but arrives earlier in the other ear by the correct amount of time, the sound will be perceived as centered. By finding the IID that compensates for a given ITD, a trade-off function can be established. A very simple approximation to this function is given as
Note that the direction of the trade amount is to the side of the head where the sound arrives first. For example, if a sound reaches the left ear 0.5 ms prior to reaching the right ear, but if the sound intensity at the right ear is about 5.6 dB larger than at the left ear, then this sound will be perceived as originating from a centered source.
Since a sweet space is to be located in a typical listening environment, certain assumptions can be made about the position and orientation of the loudspeakers and listeners. First it is assumed that the speakers are identical uniform point sources. This simplification is not necessary however. What is important is to have the best possible knowledge of the transfer functions between the speakers and the listener at all relevant locations. If some a priori knowledge about the directional response of the speakers at individual frequencies is known, it should be used. The assumption of point sources is to keep things as general as possible. The transfer functions between the speakers and the listener's ears are based on the usual HRTFs. However, the actual transfer functions used are based on angular adjustment and HRTF interpolation. Again the goal is just to have the most accurate transfer functions from the speakers to the listener's ears as possible, so in this sense other HRTF interpolation methods could be used, as long as they also work reasonably well. Two scenarios were considered, one where the listeners always face directly forward, orthogonal with the line connecting the speaker positions. In the second scenario the listeners were always facing the mid-point between the two speakers, as if watching a small TV. Since there is very little difference between the two scenarios, only the facing forward scenario will be considered.
Since one of the goals of the preferred embodiments is to create a virtual surround speaker environment using two speakers, the virtual speaker is assumed to be located at 110 degrees to the left (250 degrees azimuth), at the target virtual left rear surround speaker position. The actual speaker positions were assumed to be at 30 degrees left and 30 degrees right of the center position.
We begin by examining normal cross-talk cancellation as described above at a particular frequency when simulating the virtual source shown in
Since the listener is not necessarily in a central position, these four complex numbers can be all different. Indeed, H1(ejω) and H3(ejω) are the short and long paths from the left speaker to the left and right ears, respectively, and H4(ejω) and H2(ejω) are the short and long paths from the right speaker to the right and left ears, respectively.
Thus for each frequency and each head location, the problem is to solve for the ratio of real speaker outputs (i.e., x+jy) which will yield the desired virtual speaker signals at the ears (i.e., zL, zR) where the four complex matrix elements Re{Hk}+jIm{Hk} are determined by the frequency and head location using (interpolated) standard HRTFs.
First, note that the IID in dB is determined as:
IID=20 log10(|zL|)−20 log10(|zR|)=20 log10(|zL|/|zR|)
Next, the ITD is a little bit trickier because the time difference must be calculated from the phase difference. The ITD in milliseconds (ms) is determined by:
ITD=1000(arg(zL)−arg(zR))/2πf
where f is the frequency in Hz and arg denotes the argument of a complex number and lies in the range −π<arg(z)≦π. Note that this formula is only valid at frequencies less than about 1 kHz, because the wavelength has to be at least twice the width of the head. The absolute error of the IID and ITD are both defined simply as the absolute value of the result of the target value minus the achieved value.
A plot of the absolute error in resulting IID as the ratio of right to left speakers varies inside the unit circle in the complex plane for a listener in the center of the setup in
Likewise
As described in the foregoing, the actual perceived direction will be influenced by both the IID and ITD clues. By converting the ITD clue into a compensating factor in dB units, and adding this factor to the IID values for the corresponding speaker value gives
Of course, the foregoing could be repeated for other listening locations by simply using the corresponding HRTFs as the 2×2 matrix elements.
3. Preferred Embodiment Methods
In order to use CID error to optimize a listening region, first preferred embodiment methods apply the procedure illustrated in the flowchart
More explicitly, for a given listening region perform the nested steps of:
(1) For each frequency fi to be considered (e.g., 4 samples in each Bark band) perform steps (2)-(6);
(2) For each speaker output ratio xm+jym in a (discrete) search space (e.g., a neighborhood of the usual cross-talk cancellation solution for a central head location) perform steps (3)-(5);
(3) For each head location (un, vn) in a listening region about the central head location, compute the resultant perceived signals at the left and right ears using the matrix equation
where the Hk are the HRTFs for frequency fi and head location (un, vn). That is, compute a pair of perceived signals zL, zR for each (un, vn) in the listening region for each given fi and xm+jym.
(4) Compute the CID error for each of the zL, zR pairs from (3); that is, for each location in the listening region, compute the difference between the CID of the computed zL, zR and the CID of the desired signals at the ears (which is the usual cross-talk cancellation solution for a central head location).
(5) From the results of (4), evaluate the CID errors over the listening region for each xm+jym, and thereby find the best xm+jym for the listening region. The “best xm+jym” may be the one which gives the smallest maximum CID error over the listening region, or may be the one which gives the smallest mean square CID error over the listening region, or may be some other measure of CID error over the listening region.
(6) Use the best xm+jym from (5) to define the virtualizing filter for the given frequency f; and repeat for all other frequencies.
The typical number of frequencies used, number of right-to-left ratios (or left-to-right ratios) used, and number of locations in a listening region used for the computations could be over ten thousand. For example, 25 frequencies, 25 ratios and 25 locations requires 15625 computations.
4. Experimental Results
Using the conventional cross-talk cancellation solution at 516.8 Hz,
As before, the shaded area in
Note, however, that the center CID error is now equivalent to about −1.87 dB, pulling the virtual direction slightly toward the center. Also the total error in the box in
In addition to increasing the space with no reversals, the total error can be minimized over some arbitrary region. For instance, trying to reduce the total CID error over a 0.1 m×0.1 m box around the center, the total error can be reduced by over 50% (approximately 53%). In this case the error at the center is equivalent to −0.334 dB.
Another approach is to constrain the solution to keep the center CID error as small as possible while reducing total error. In this example, the total error in the 0.1 m×0.1 m region can still be reduced by 48.6% while keeping the error in the center at the equivalent of −0.049 dB.
Although these examples have focused on one particular frequency and speaker setup, the technique of using CID to optimize various aspects of the sweet spot can be applied in any situation.
Optimizing the current setup (i.e., setting crosstalk cancellation filter frequency response) at various frequencies shows some interesting phenomenon. At bin frequencies which are multiplies of 86.13 Hz, the largest box around the center position without reversals for the traditional cross-talk cancellation solution was calculated for frequencies less than 1014 Hz (11 bins). Then a search was done at each frequency for better solutions. The results are shown in
Another experiment was done, in which the goal was to minimize the CID error in a box 0.2 m×0.2 m around the center location. The results of this effort are shown in
Additional criteria, such as applying a weighting of error within the region, can also be applied. For instance the error near the center can be given more weight than the error near the edges. Also the weighting over the region can be different for different frequencies. Thus a weighting scheme that takes into account the relative importance of different frequencies for the different HRTFs at different locations could be used.
5. Modifications
The preferred embodiments can be modified in various ways while retaining one or more of the features of evaluating CID error to define virtualizing filters for specified listening regions (“sweet spaces”).
For example, the number of and range of frequencies used for evaluations could be varied, such as evaluations from only 10 frequencies to over 100 frequencies and from ranges as small as 100-400 Hz up to 2 kHz; the number of locations in a candidate listening region evaluated could vary from only 10 locations to over 100 locations and the locations could be uniformly distributed in the region or could be concentrated near the center of the region; the number of ratios for evaluations could vary from only 10 ratios to over 100 ratios; listening regions could be elongated rectangular, oval, or other shapes; the listening regions can also be arbitrary volumes or surfaces and can consist of one or more separate regions. The approximation function used to calculate the CID can be changed for different angles, increased bandwidth, and even for different listeners, to best reflect the psychoacoustic tradeoff between IID and ITD in a given situation. Other audio enhancement technologies can be integrated as well, such as room equalization, other cross-talk cancellation technologies, and so on. Even other psychoacoustic enhancement technologies such as bass boost or bandwidth extension and so on may be integrated. Also more than two speakers can be used with corresponding larger transfer function matrices.
This application claims priority from provisional patent application No. 60/804,486 filed Jun. 12, 2006. The following co-assigned copending patent applications disclose related subject matter: application Ser. Nos. 11/364,117 and 11/364,971, both filed Feb. 28, 2006.
Number | Name | Date | Kind |
---|---|---|---|
7215782 | Chen | May 2007 | B2 |
20050078833 | Hess et al. | Apr 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
60804486 | Jun 2006 | US |