The present invention relates to digital audio signal processing, and more particularly to artificial room impulse responses for virtualization devices and methods.
Multi-channel audio is an important feature of DVD players and home entertainment systems. It provides a more realistic sound experience than is possible with conventional stereophonic systems by roughly approximating the speaker configuration found in movie theaters.
By including HRIRs/HRTFs of paths with reflections and attenuations in addition to the direct path from a (virtual) speaker to a listener's ear, the virtual listening environment can be controlled. Such a combination of HRIRs/HRTFs gives a room impulse response or transfer function. A room impulse response is largely unknown, but the direct path HRTFs can be approximated by use of a library of measured HRTFs. For example, Gardner, Transaural 3-D Audio, MIT Media Laboratory Perceptual Computing Section Technical Report No. 342, Jul. 20, 1995, provides HRTFs for every 5 degrees (azimuthal). Then an artificial room impulse response/transfer function can be generated by the superposition of HRIRs/HRTFs corresponding to multiple reflection paths of the sound wave in a virtual room environment together with factors for absorption and phase change upon virtual wall reflections. A widely accepted method for simulating room acoustics called the “image method” can be used to determine a set of angles and distances of virtual speakers corresponding to wall reflections. Each virtual speaker (described by its angle and distance) can be associated with an HRIR (or its corresponding HRTF) attenuated by an amount that depends on the distance and number of reflections along its path. Therefore, the room impulse response corresponding to a speaker and its wall reflections can be obtained by summing the HRIR corresponding to the location of the original speaker with respect to the listener and the HRIRs corresponding to locations imaged by wall reflections. As the distance and number of reflections increase, the corresponding HRIR suffers a stronger attenuation that causes the room impulse response to decay slowly towards the end. An example of a room impulse response generated using this method is shown in
The signal processing can be more explicitly described as follows.
Note that the dependence of H1 and H2 on the angle that the speakers are offset from the facing direction of the listener has been omitted.
yields Y1=E1 and Y2=E2.
Of course, the implementation of such filters would require considerable dynamic range reduction in order to avoid saturation about frequencies with response peaks. For example, with two real speakers each 30 degrees offset as in
has the form illustrated by
For example, the left surround sound virtual speaker could be at an azimuthal angle of about 225 degrees. Thus with cross-talk cancellation, the corresponding two real speaker inputs to create the virtual left surround sound speaker would be:
where H1, H2 are for the left and right real speaker angles (e.g., 30 and 330 degrees), LSS is the (short-term Fourier transform of the) left surround sound signal, and TF3left=H1(225), TF3right=H2(225) are the HRTFs for the left surround sound speaker angle (225 degrees).
Again,
In the case of headphones, the cross-talk problem disappears, and the filtered channel signals can directly drive the headphones as shown in
Generally in multi-channel audio processing, the filtering with HRIRs or HRTFs and/or room impulse responses takes the form of many convolutions of input audio signals with long filters. Typically, a room impulse response from each (virtual) sound source to each ear is used. Since an artificial room impulse response can be several seconds long, this poses a challenging computational problem even for fast digital signal processors. Further, artificial room impulse responses need to be corrected in terms of spectral characteristics due to coloration effects introduced by HRIR filters. And external equalizers would involve additional computational overhead and possibly disrupt phase relations that are important in 3D virtualization systems.
One approach to lowering computational complexity of the filtering convolutions first transforms the input signal and the filter impulse response into the frequency domain (as by FFT) where the convolution transforms into a pointwise multiplication and then inverse transforms the product back into the time domain (as by IFFT) to recover the convolution result. The overlap-add method uses this approach with 0 padding prior to FFT to avoid circular convolution feedback. Further, for filtering with a long impulse response, the impulse response can be sectioned into shorter filters and the filtering (convolution) by each filter section separately computed and the results added to give the overall filtering output.
The present invention provides artificial room impulse responses of the form of a superposition of HRIRs with individually modified HRIRs, and/or with omission of the large-delay contra-lateral portions of the responses, and/or with low computational complexity convolution by truncation of middle sections of the response, and/or by Fourier transform with simplified 0 padding for overlap-add.
a-1d show filters and method flow diagrams.
a-2j illustrate head-related acoustic transfer function and virtualizer geometries and room impulse response.
1. Overview
Preferred embodiments modify artificial room impulse responses (in the form of a superposition of direct path HRIR plus reflective path attenuated HRIRs) by individual HRIR adjustments prior to superposition;
Preferred embodiment systems (e.g., audio/visual receivers, DVD players, digital television sets, etc.) perform preferred embodiment methods with any of several types of hardware: digital signal processors (DSPs), general purpose programmable processors, application specific circuits, or systems on a chip (SoC) such as combinations of a DSP and a RISC processor together with various specialized programmable accelerators such as for FFTs and variable length coding (VLC). A stored program in an onboard or external flash EEPROM or FRAM could implement the signal processing.
2. Modification of Room Impulse Response
The first preferred embodiments allow direct manipulation of an artificial room impulse response off-line during or after its generation. Manipulating the spectrum of a long impulse response is only possible with careful consideration on the magnitude-phase relations that must hold for real, causal systems (Hilbert relations). Methods exist that permit inferring the phase spectrum of an arbitrary magnitude spectrum for particular situations such as minimum, linear, or maximum-phase systems. However, in this case the phase spectrum of the room impulse response must keep its original temporal structure, at least in terms of temporal envelope, which is the basis of the perceived reverberation effect.
a-1b illustrate a first preferred embodiment. The method consists of conducting spectral modification on each HRIR (
More explicitly, let RIR(.) denote an artificial room impulse response from a sound source to a listener's ear, and presume RIR has the form of a sum of HRIRs corresponding to the direct plus various reflective paths with attenuations:
RIR()=Σ1≦i≦K HRIRi()
The summation index i labels the paths considered, and each HRIR(.) typically has only a few non-zero filter coefficients which are offset from 0 according to the delay along the path from source to ear. Indeed, the spikes visible in the lefthand portion of
Then modify each HRIR to correct for spectral coloration as in
arg[Hi-mod(ejω)]=(½π)∫−π<θ<πlog|Hi-mod(ejθ)|cot[(θ−ω)/2]dθ
And then inverse transform Hi-mod(ejω) to get hi-mod(n). Of course hi-mod(n) is minimum-phase by construction and thus packs most energy into the lowest coefficients, but hi-mod(n) may have an infinite number of nonzero coefficients and can be truncated.
The alternative is to let Hi(k) be the N-point FFT of hi(n) where N is at least 2L (the factor of 2 is a “causality” condition for finite length (or periodic) sequences). Then perform the desired spectral modification of |Hi(k)|, such as a bass boost by increasing |Hi(k)| by 6 dB for 0≦k<N/8. Denote this bass boosted spectral magnitude by |Hi-boost(k)|. An approximate minimum-phase for the bass-boosted spectrum can then be defined in terms of log|Hi-boost(k)|; namely, the phase is taken as the FFT of the product of the IFFT of log|Hi-boost(k)| with the unit step:
argHi-boost(k)=FFT{u(n)IFFT[log|Hi-boost(k)|]}
where
Lastly, the delay Di is attached; see
Compared with performing separate equalization, an obvious advantage of the first preferred embodiment is that the processing is performed off-line, resulting in higher computational efficiency. In addition, the present method avoids possible phase disruptions caused by external equalizers that could severely affect the virtualization effect.
On the other hand, modifying the room impulse response after its generation requires careful manipulation of the phase spectrum to maintain the real and causal characteristics of the impulse response. Using a minimum, linear, or maximum-phase spectrum conversion directly on the entire room impulse response is not possible, since the temporal envelope of the impulse response is an important element that cannot be changed. For example, if the entire impulse response is converted into minimum-phase, most of its energy will concentrate at the beginning of the filter, disrupting the temporal structure corresponding to the virtual speakers and their corresponding delays and attenuations.
The preferred embodiment method can successfully modify the magnitude spectrum of the generated impulse response by changing the magnitude spectrum of each HRIR to be overlapped, and also maintain the original envelope of the phase spectrum, since the modified HRIRs are added at the same positions with the same attenuations.
3. Room Impulse Response Convolution Shortening
The second preferred embodiments reduce the number of computations required in frequency-domain convolution of (artificial, modified) room impulse responses by skipping the computation of contra-lateral paths (a path from left side of head to right ear or from right side of head to left ear) for the last few filter sections which results in shorter contra-lateral impulse responses without affecting the resulting quality. This simplification is possible due to the nature of human hearing, which is less sensitive to late reverberation as compared to early arrivals of the sound wave, and to the fact that late reverberation contains little spatial information. Therefore, the trailing portions of room impulse responses do not need to have well-defined ipsi-lateral (path approaching from right side to right ear or from left side to left ear) and contra-lateral impulse responses.
i-2j and 1c illustrate the prior art and the second preferred embodiments as follows.
where the filter length L is the product of a block size B times the number of filter sections (blocks) M, and the m-th filter section fm(n) has non-zero coefficients only in the m-th block, mB≦n<(m+1)B. Typically, the block size is taken as a convenient power of 2 for ease of FFT, such as B=256. Then each of the M convolutions is computed by the steps of: section the input signal into blocks of size B, pad input block and filter section with 0s to size 2B (this avoids circular convolution wrap-around), 2B-point FFT for both input block and filter section (filter section FFT may be precomputed and stored), pointwise multiply transforms, IFFT; and combine the 2B-point results by overlap-add where the overlap is by B samples to give the output of the m-th filter. Lastly, add the M filter outputs. Note that the m-th filter section filtering is equivalent to a length B filter acting on an input signal block which has been delayed by m blocks. That is,
ym(n)=Σ0≦b<Bs(n−b−mB)fm(b+mB)=Σ0≦b<Bsm(n−b)hm(b)
where hm(b)=fm(b+mB) has non-zero coefficients only for 0≦b<B and sm(n)=s(n−mB) is a delayed version of s(n). Thus the FFT for the 0-padding input signal block has already been computed.
In
The second preferred embodiments address the computational issue related to spectral multiplication taking into consideration the peculiarities of room impulse responses and human hearing. It is well known that human hearing is less sensitive to late reverberation as compared to early arrivals of the sound waves, and therefore the late reverberation portion of a room impulse response can be simplified in several ways without affecting the perceptual quality of the sound. This is also true with respect to the spatiality of the sound, which is dictated by the early arrivals of the sound wave. Therefore, the relation of the ipsi-lateral and contra-lateral for the trailing portion of room impulse responses can be modified for better efficiency. As shown in
In general, if the room impulse response has L coefficients, then the last 0.2-0.4 L (20-40%) of the coefficients are omitted, depending upon the size of L, with larger L implying a larger fraction of the contra-lateral coefficient omitted.
4. Room Impulse Response Convolution Simplification
d shows the third preferred embodiment methods to reduce the number of computations required in convolution of room impulse responses by modifying the middle portions of the sectioned filter of
On the other hand, just shortening the filter as a whole would cause a significant perceptual impact, because the total reverberation duration (i.e., filter length) is an important factor that must be preserved.
The third preferred embodiments use the sectioned filter block convolution approach and simplify intermediate filter sections where temporal structures tend to get masked.
It is worth noting that just shortening a filter section by truncating its trailing portion and applying a multiplicative compensation factor to preserve total energy is not a good solution due to the spectral distortion introduced by the truncation. In order to preserve the spectral shape of intermediate filter sections, the preferred embodiments transform them into minimum phase filters. The reason for this, besides the fact that the magnitude spectrum has the same shape as the original filter section, is that minimum phase filters have the property that their energy is maximally concentrated in the first coefficients and tends to decrease towards the end. Thus, the spectral distortion caused by truncation can be minimized. It is also worth noting that the minimum-phase transformation also produces spectral distortion due to the change in the phase spectrum of individual sections, but that does not represent a problem because phase relations are less important except for the first reflections.
The third preferred embodiment proceeds as: first transforms each of the room impulse response filter sections after the first section into minimum-phase filter by reflecting all the z-transform zeros located outside of the unit circle into zeros inside the unit circle; next, truncate the minimum-phase filters in the time domain; and after truncation, apply a multiplicative factor to correct the energy level of the truncated minimum-phase filter to match the original filter section energy level.
More explicitly, convert a filter section (or combined filter sections) to a minimum-phase filter by convolving with an allpass filter determined by the zeros of the filter transfer function which lie outside of the unit circle. Indeed, let h(n) for 0≦n<N be a filter to convert to a minimum-phase filter, hmin(n). As with the first preferred embodiments, h(n) can be considered as an infinite sequence with a finite number of nonzero coefficients or as a finite (or periodic) sequence. The infinite sequence approach gives an exact hmin(n) which may have an infinite number nonzero coefficients but is truncated anyway, and the finite sequence approach gives an approximate hmin(n). For the infinite sequence approach, initially compute the transfer function H(z); then find an allpass filter with transfer function Hallpass(z) so that H(z)=Hmin(z)Hallpass(z). To determine Hallpass(z), first find the zeros of H(z) which lie outside of the unit circle, and then for each such zero (e.g., H(α)=0 for 1<|α|) include the bilinear factor (z−1−α−1)/(1−(α−1)*z−1) in Hallpass(z) (note that * indicates complex conjugate). That is, compute Hmin(z)=H(z) Π (1−(α−1)*z−1)/(z−1−α−1) where the product is over the zeros of H(z) outside of the unit circle. Next, inverse z-transform to recover hmin(n). Then, truncate and correct the energy level to get hmin-trun(n). Lastly, 0 pad (each section if combined filter sections) and compute the FFT for use in the architecture of
5. Zero-padding with FFT
Fourth preferred embodiments reduce the computational complexity of the FFT after 0-padding used in the overlap-add method of filtering by multiplication in the frequency domain, and thus can be applied to the foregoing preferred embodiments. In particular, let x(n) for 0≦n<N be 0 padded to define xpad(n) for 0≦n<2N as
Then the 2N-point FFT of xpad(n) is:
where the N-point inverse FFT expression for x(n) was substituted. Now rearranging:
Xpad(k)=(1/N)Σ0≦i<NX(i)Σ0≦n<Ne−j2πn(k/2−i)/N
Consider the case of k an even integer: k=2m. In this case:
Thus the even frequencies of the zero-padded spectrum can be computed as the frequencies of the non-zero-padded spectrum at one-half the frequency. That is, an N-point FFT of x(n) generates the even frequencies of the 2N-point FFT of xpad(n).
For the odd frequencies of the zero-padded spectrum, take k=2m+1 in the foregoing:
which is a circular convolution in the frequency domain where
S(k)=Σ0≦n<Ne−jπn/Ne−j2πnk/N
is the N-point FFT of s(n)=e−jπn/N and extended by periodicity to negative k. Thus the odd frequencies of the zero-padded spectrum are computed in terms of a convolution with the N-point FFT of x(n).
For notational convenience, define Y(m)=Xpad(2m+1), then taking the N-point inverse FFT gives:
Thus to compute the odd frequencies of Xpad(k) in the frequency domain by convolution, the fastest way is to move back to the time domain, producing the original sequence x(n) and a complex exponential (e−jπn/N) to pre-warp the FFT to look at the odd frequencies, multiplying point-wise, and taking the FFT of the result to get back to the frequency domain and the odd frequencies. Since the original sequence is available and the s(n) can be pre-calculated, all that is required is point-wise multiplication and a forward FFT to obtain the odd frequencies directly.
In short, the fourth preferred embodiment zero-padded 2N-point FFT requires two N-point FFTs and N complex multiplies instead of one 2N-point FFT. However, half of the complex multiplies in the time domain can be combined with twiddle factors in the first stage of many FFT implementations, so only an additional N/2 complex multiplies are required. Hence, about 3N/2 operations can potentially be saved.
An alternative fourth preferred embodiment method approximates the terms in the definition of S(m) to simplify the frequency-domain convolution computation. In particular,
Note that if n=0, then cos[πn(2k+1)/N]=1 and for all other n the cosine is anti-symmetric about N/2: cos[πn(2k+1)/N]=−cos[π(N−n)(2k+1)/N]. And if N is even, n=N/2 gives cos[πn(2k+1)/N]=0. Thus all the cosine terms except n=0 cancel in the summation. In contrast, the sine is symmetric about N/2, so only the n=0 term can be omitted. And thus separating the n=0 terms out of the sum defining S(k) gives:
Y(m)=(1/N)Σ0≦i<NX(i)+(−j/N)Σ1≦i<NX(i)T(m−i)
where
T(k)=Σ1≦n<Nsin[πn(2k+1)/N]
Since T(k) is real-valued and anti-symmetric, it can be thought of as a linear-phase filter which is cyclically convolved with X(k). The first sum in Y(m) needs to be computed only once. However, the convolution needs to be calculated for each m, requiring O(N2) operations to calculate all Y(m). The preferred embodiment methods approximate the convolution with far fewer computations by windowing or other modification of the filter kernel.
Initially, consider the computational simplification in terms of operations. Presume a 2N-point FFT requires K(2Nlog2(2N)) operations and an N-point FFT requires K(Nlog2(N)) operations. If 2NM operations are required to compute the convolution directly (for a filter kernel with M non-zero coefficients), then to save computation requires
2NM<K(2N log2(2N))−K(N log2(N))−N
where the first term on the right is the direct 2N-point FFT complexity to get Xpad(k), the second term is the N-point FFT complexity to get X(k), and the last term is (1/N) Σ0≦i<N X(i) for the non-convolution term of Y(m). This implies
2M<K log2(N)+2K−1
For example, with N=8192 and K=4 then M<29.5 is needed to save computation.
Once the length (M) of the filter kernel has been set, the next step is to create the kernel. This is equivalent to a filter design problem. Perhaps the simplest approach is to truncate the original kernel. A graph of T(k) shows that coefficients near k=0 dominate, and coefficients near k=N/2 have small magnitudes. Hence, define a truncated version of T(k):
Alternatively, multiplication with a window function, such as a Hann window, can similarly reduce the number of nonzero filer coefficients but with a smoother transition in filter coefficient magnitude. Of course, other filter design methods could be used to approximate T(k) by a filter with a small number of nonzero filter coefficients.
6. Modifications
The preferred embodiments can be modified in various ways; for example, vary the sizes of blocks of samples, vary the size of FFTs, vary the sizes of filter partitions, truncate more or less of filter sections, use other spectrum modifications such as tapering, and so forth.
This application claims priority from provisional patent application Nos. 60/657,234, filed Feb. 28, 2005 and 60/756,045, filed Jan. 4, 2006. The following co-assigned copending applications disclose related subject matter: Appl. Ser. No.: 11/125,927, filed May 10, 2005.
Number | Name | Date | Kind |
---|---|---|---|
5946400 | Matsuo | Aug 1999 | A |
6741711 | Sibbald | May 2004 | B1 |
7024259 | Sporer et al. | Apr 2006 | B1 |
20050117762 | Sakurai et al. | Jun 2005 | A1 |
20060045294 | Smyth | Mar 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
60657234 | Feb 2005 | US | |
60756045 | Jan 2006 | US |