Room impulse response

BACKGROUND OF THE INVENTION

The present invention relates to digital audio signal processing, and more particularly to artificial room impulse responses for virtualization devices and methods.

Multi-channel audio is an important feature of DVD players and home entertainment systems. It provides a more realistic sound experience than is possible with conventional stereophonic systems by roughly approximating the speaker configuration found in movie theaters. FIG. 2b illustrates an example of 5-channel audio processing known as “virtual surround” which consists of creating the illusion of a 5-channel speaker system using a conventional pair of loudspeakers. This technique makes use of the impulse responses (time domain) or transfer functions (frequency domain) from each virtual loudspeaker to a listener's ears; these are the head-related impulse responses (HRIRs) or head-related transfer functions (HRTFs) and depend upon the angles and distance between the speaker and the facing direction of the listener.

By including HRIRs/HRTFs of paths with reflections and attenuations in addition to the direct path from a (virtual) speaker to a listener's ear, the virtual listening environment can be controlled. Such a combination of HRIRs/HRTFs gives a room impulse response or transfer function. A room impulse response is largely unknown, but the direct path HRTFs can be approximated by use of a library of measured HRTFs. For example, Gardner, Transaural 3-D Audio, MIT Media Laboratory Perceptual Computing Section Technical Report No. 342, Jul. 20, 1995, provides HRTFs for every 5 degrees (azimuthal). Then an artificial room impulse response/transfer function can be generated by the superposition of HRIRs/HRTFs corresponding to multiple reflection paths of the sound wave in a virtual room environment together with factors for absorption and phase change upon virtual wall reflections. A widely accepted method for simulating room acoustics called the “image method” can be used to determine a set of angles and distances of virtual speakers corresponding to wall reflections. Each virtual speaker (described by its angle and distance) can be associated with an HRIR (or its corresponding HRTF) attenuated by an amount that depends on the distance and number of reflections along its path. Therefore, the room impulse response corresponding to a speaker and its wall reflections can be obtained by summing the HRIR corresponding to the location of the original speaker with respect to the listener and the HRIRs corresponding to locations imaged by wall reflections. As the distance and number of reflections increase, the corresponding HRIR suffers a stronger attenuation that causes the room impulse response to decay slowly towards the end. An example of a room impulse response generated using this method is shown in FIG. 2h.

The signal processing can be more explicitly described as follows. FIG. 2e shows functional blocks of the 2-speaker implementation for the 5-channel arrangement of FIG. 2b; this implementation requires cross-talk cancellation for the real speakers which appears in the lower right block in FIG. 2e. Here cross-talk denotes the signal from the right speaker that is heard at the left ear and vice-versa. The basic solution to eliminate cross-talk was proposed in U.S. Pat. No. 3,236,949 and is explained as follows. Consider a listener facing two loudspeakers as shown in FIG. 2a. Let X₁(e^jω) and X₂(e^jω) denote the (short-term) Fourier transforms of the analog signals which drive the left and right loudspeakers, respectively, and let Y₁(e^jω) and Y₂(e^jω) denote the Fourier transforms of the analog signals actually heard at the listener's left and right ears, respectively. Presuming a symmetrical speaker arrangement, the system can then be characterized by two HRTFs, H₁(e^jω) and H₂(e^jω), which respectively relate to the short and long paths from speaker to ear; that is, H₁(e^jω) is the transfer function from left speaker to left ear or right speaker to right ear, and H₂(e^jω) is the transfer function from left speaker to right ear and from right speaker to left ear. This situation can be described as a linear transformation from X₁, X₂to Y₁, Y₂with a 2×2 matrix having elements H₁and H₂:

$[\begin{matrix} Y_{1} \\ Y_{2} \end{matrix}] = [\begin{matrix} H_{1} & H_{2} \\ H_{2} & H_{1} \end{matrix}] [\begin{matrix} X_{1} \\ X_{2} \end{matrix}]$

Note that the dependence of H₁and H₂on the angle that the speakers are offset from the facing direction of the listener has been omitted.

FIG. 3 shows a cross-talk cancellation system in which the input electrical signals (short-term Fourier transformed) E₁(e^jω), E₂(e^jω) are modified to give the signals X₁, X₂which drive the loudspeakers. (Note that the input signals E₁E₂are the recorded signals, typically recorded using either a pair of moderately-spaced omni-directional microphones or a pair of adjacent uni-directional microphones with an approximately 60 degree angle between the two microphone directions.) This conversion from E₁, E₂into X₁, X₂is also a linear transformation and can be represented by a 2×2 matrix. If the target is to reproduce signals E₁, E₂at the listener's ears (so Y₁=E₁and Y₂=E₂) and thereby cancel the effect of the cross-talk (due to H₂not being 0), then the 2×2 matrix should be the inverse of the 2×2 matrix having elements H₁and H₂. That is, taking

$[\begin{matrix} X_{1} \\ X_{2} \end{matrix}] = {[\begin{matrix} H_{1} & H_{2} \\ H_{2} & H_{1} \end{matrix}]}^{- 1} [\begin{matrix} E_{1} \\ E_{2} \end{matrix}] = \frac{1}{H_{1}^{2} - H_{2}^{2}} [\begin{matrix} H_{1} & - H_{2} \\ - H_{2} & H_{1} \end{matrix}] [\begin{matrix} E_{1} \\ E_{2} \end{matrix}]$

yields Y₁=E₁and Y₂=E₂.

Of course, the implementation of such filters would require considerable dynamic range reduction in order to avoid saturation about frequencies with response peaks. For example, with two real speakers each 30 degrees offset as in FIG. 2a, the log magnitude of

$\frac{1}{H_{1}^{2} - H_{2}^{2}}$

has the form illustrated by FIG. 2g. The range is from 0 Hz to 24000 Hz sampled every 93.75 Hz (using an FFT length of 512). The gain has been scaled so that the minimum gain is 1.0 (0 dB on the log scale). Note the large peak near 8000 Hz (near bin 90). This large peak in turn limits the available dynamic range.

For example, the left surround sound virtual speaker could be at an azimuthal angle of about 225 degrees. Thus with cross-talk cancellation, the corresponding two real speaker inputs to create the virtual left surround sound speaker would be:

$[\begin{matrix} X_{1} \\ X_{2} \end{matrix}] = \frac{1}{H_{1}^{2} - H_{2}^{2}} [\begin{matrix} H_{1} & - H_{2} \\ - H_{2} & H_{1} \end{matrix}] [\begin{matrix} TF 3_{left} \cdot LSS \\ TF 3_{right} \cdot LSS \end{matrix}]$

where H₁, H₂are for the left and right real speaker angles (e.g., 30 and 330 degrees), LSS is the (short-term Fourier transform of the) left surround sound signal, and TF3_left=H₁(225), TF3_right=H₂(225) are the HRTFs for the left surround sound speaker angle (225 degrees).

Again, FIG. 2e shows functional blocks for a virtualizer with the cross-talk canceller to implement 5-channel audio with two real speakers as in FIG. 2b; each channel signal is filtered by the corresponding pair of HRTFs for the corresponding (virtual) speaker's offset angle and distance, and the filtered signals summed and input into the cross-talk canceller and the two cross-talk-cancelled outputs then drive the two real speakers.

In the case of headphones, the cross-talk problem disappears, and the filtered channel signals can directly drive the headphones as shown in FIGS. 2c and 2f. Also, FIG. 2d illustrates an approximate symmetry between forward and rear speaker locations.

Generally in multi-channel audio processing, the filtering with HRIRs or HRTFs and/or room impulse responses takes the form of many convolutions of input audio signals with long filters. Typically, a room impulse response from each (virtual) sound source to each ear is used. Since an artificial room impulse response can be several seconds long, this poses a challenging computational problem even for fast digital signal processors. Further, artificial room impulse responses need to be corrected in terms of spectral characteristics due to coloration effects introduced by HRIR filters. And external equalizers would involve additional computational overhead and possibly disrupt phase relations that are important in 3D virtualization systems.

One approach to lowering computational complexity of the filtering convolutions first transforms the input signal and the filter impulse response into the frequency domain (as by FFT) where the convolution transforms into a pointwise multiplication and then inverse transforms the product back into the time domain (as by IFFT) to recover the convolution result. The overlap-add method uses this approach with 0 padding prior to FFT to avoid circular convolution feedback. Further, for filtering with a long impulse response, the impulse response can be sectioned into shorter filters and the filtering (convolution) by each filter section separately computed and the results added to give the overall filtering output.

SUMMARY OF THE INVENTION

The present invention provides artificial room impulse responses of the form of a superposition of HRIRs with individually modified HRIRs, and/or with omission of the large-delay contra-lateral portions of the responses, and/or with low computational complexity convolution by truncation of middle sections of the response, and/or by Fourier transform with simplified 0 padding for overlap-add.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1
a-1d show filters and method flow diagrams.

FIGS. 2
a-2j illustrate head-related acoustic transfer function and virtualizer geometries and room impulse response.

FIG. 3 is a cross-talk cancellation system.

DESCRIPTION OF THE PREFERRED EMBODIMENTS

1. Overview

Preferred embodiments modify artificial room impulse responses (in the form of a superposition of direct path HRIR plus reflective path attenuated HRIRs) by individual HRIR adjustments prior to superposition; FIGS. 1a-1b illustrate functional blocks. Further preferred embodiments lower computational complexity by omission of large-delay contra-lateral portions of room impulse responses and/or truncation of middle sections of the responses.

Preferred embodiment systems (e.g., audio/visual receivers, DVD players, digital television sets, etc.) perform preferred embodiment methods with any of several types of hardware: digital signal processors (DSPs), general purpose programmable processors, application specific circuits, or systems on a chip (SoC) such as combinations of a DSP and a RISC processor together with various specialized programmable accelerators such as for FFTs and variable length coding (VLC). A stored program in an onboard or external flash EEPROM or FRAM could implement the signal processing.

2. Modification of Room Impulse Response

The first preferred embodiments allow direct manipulation of an artificial room impulse response off-line during or after its generation. Manipulating the spectrum of a long impulse response is only possible with careful consideration on the magnitude-phase relations that must hold for real, causal systems (Hilbert relations). Methods exist that permit inferring the phase spectrum of an arbitrary magnitude spectrum for particular situations such as minimum, linear, or maximum-phase systems. However, in this case the phase spectrum of the room impulse response must keep its original temporal structure, at least in terms of temporal envelope, which is the basis of the perceived reverberation effect.

FIGS. 1
a-1b illustrate a first preferred embodiment. The method consists of conducting spectral modification on each HRIR (FIG. 1b) prior to its superposition to form the room impulse response (FIG. 1a). The spectral modification block described in FIG. 1b consists of introducing changes in the magnitude spectrum before the calculation of the corresponding minimum phase spectrum. In addition, the initial delay corresponding to the original HRIR path is attached to the beginning of the modified impulse response. The spectral shaping can be done in several ways in the frequency domain by simply manipulating the bins of the magnitude spectrum. For example, a simple bass boost effect can be obtained by adding 6 dB at the first ⅛ bins of the magnitude spectrum.

More explicitly, let RIR(.) denote an artificial room impulse response from a sound source to a listener's ear, and presume RIR has the form of a sum of HRIRs corresponding to the direct plus various reflective paths with attenuations:

RIR()=Σ_1≦i≦KHRIR_i()

The summation index i labels the paths considered, and each HRIR(.) typically has only a few non-zero filter coefficients which are offset from 0 according to the delay along the path from source to ear. Indeed, the spikes visible in the lefthand portion of FIG. 2h indicate individual reflective paths.

Then modify each HRIR to correct for spectral coloration as in FIG. 1b. For example, presume HRIR_i(n)=0 for n<D_iand n≧D_i+L; that is, the impulse response has length L and an onset delay of D_i. For notational convenience, take h_i(n)=HRIR_i(n+D_i) and after spectral modification of h_i(n) the delay is restored to get the spectrally modified HRIR_i(n). Now the spectral modification can be done offline, and thus h_i(n) can be treated as either a finite sequence or an infinite sequence with only a finite number of nonzero coefficients. For the infinite sequence approach the z-transform H_i(z) is used and values on the unit circle are the Fourier transform; i.e., H_i(e^jω). In magnitude-phase representation: H_i(z)=|H_i(z)|exp(arg[H_i(z)]). Thus, modify the spectral magnitude (e.g., boost the bass) by multiplying |H_i(e^jω)| by a function of ω (e.g., 2 for |ω|<π/8 and 1 elsewhere) to give |H_i-mod(e^jω)|. Then the corresponding minimum phase is computed as:

arg[H_i-mod(e^jω)]=(½π)∫_−π<θ<πlog|H_i-mod(e^jθ)|cot[(θ−ω)/2]dθ

And then inverse transform H_i-mod(e^jω) to get h_i-mod(n). Of course h_i-mod(n) is minimum-phase by construction and thus packs most energy into the lowest coefficients, but h_i-mod(n) may have an infinite number of nonzero coefficients and can be truncated.

The alternative is to let H_i(k) be the N-point FFT of h_i(n) where N is at least 2L (the factor of 2 is a “causality” condition for finite length (or periodic) sequences). Then perform the desired spectral modification of |H_i(k)|, such as a bass boost by increasing |H_i(k)| by 6 dB for 0≦k<N/8. Denote this bass boosted spectral magnitude by |H_i-boost(k)|. An approximate minimum-phase for the bass-boosted spectrum can then be defined in terms of log|H_i-boost(k)|; namely, the phase is taken as the FFT of the product of the IFFT of log|H_i-boost(k)| with the unit step:

argH_i-boost(k)=FFT{u(n)IFFT[log|H_i-boost(k)|]}

where

$\begin{matrix} u (n) = 2 for 1 \underline{<} n < N / 2 \\ = 1 for n = 0, N / 2 \\ = 0 for N / 2 < n < N \end{matrix}$

Lastly, the delay D_iis attached; see FIG. 1b.

Compared with performing separate equalization, an obvious advantage of the first preferred embodiment is that the processing is performed off-line, resulting in higher computational efficiency. In addition, the present method avoids possible phase disruptions caused by external equalizers that could severely affect the virtualization effect.

On the other hand, modifying the room impulse response after its generation requires careful manipulation of the phase spectrum to maintain the real and causal characteristics of the impulse response. Using a minimum, linear, or maximum-phase spectrum conversion directly on the entire room impulse response is not possible, since the temporal envelope of the impulse response is an important element that cannot be changed. For example, if the entire impulse response is converted into minimum-phase, most of its energy will concentrate at the beginning of the filter, disrupting the temporal structure corresponding to the virtual speakers and their corresponding delays and attenuations.

The preferred embodiment method can successfully modify the magnitude spectrum of the generated impulse response by changing the magnitude spectrum of each HRIR to be overlapped, and also maintain the original envelope of the phase spectrum, since the modified HRIRs are added at the same positions with the same attenuations.

3. Room Impulse Response Convolution Shortening

The second preferred embodiments reduce the number of computations required in frequency-domain convolution of (artificial, modified) room impulse responses by skipping the computation of contra-lateral paths (a path from left side of head to right ear or from right side of head to left ear) for the last few filter sections which results in shorter contra-lateral impulse responses without affecting the resulting quality. This simplification is possible due to the nature of human hearing, which is less sensitive to late reverberation as compared to early arrivals of the sound wave, and to the fact that late reverberation contains little spatial information. Therefore, the trailing portions of room impulse responses do not need to have well-defined ipsi-lateral (path approaching from right side to right ear or from left side to left ear) and contra-lateral impulse responses.

FIGS. 2
i-2j and 1c illustrate the prior art and the second preferred embodiments as follows. FIG. 2i shows filter sectioning into blocks and frequency-domain computations generally; each individual filter section convolution is performed by the overlap-add method: segment at input signal into blocks, transform to frequency domain, pointwise multiply, transform back to time domain, and overlap and add to form the output signal for the filter section. Lastly, the outputs of the filter sections are added. In more detail, presume an input signal s(n) and filter coefficients f(n), the filter output y(n) is the convolution:

$\begin{matrix} y (n) = \sum_{0 \underline{<} / < L} s (n - l) f (l) \\ = \sum_{0 \underline{<} / < L} s (n - l) [\sum_{0 \underline{<} m < M} f_{m} (l)] \\ = \sum_{0 \underline{<} m < M} \sum_{0 \underline{<} b < B} s (n - b - mB) f_{m} (b + mB) \end{matrix}$

where the filter length L is the product of a block size B times the number of filter sections (blocks) M, and the m-th filter section f_m(n) has non-zero coefficients only in the m-th block, mB≦n<(m+1)B. Typically, the block size is taken as a convenient power of 2 for ease of FFT, such as B=256. Then each of the M convolutions is computed by the steps of: section the input signal into blocks of size B, pad input block and filter section with 0s to size 2B (this avoids circular convolution wrap-around), 2B-point FFT for both input block and filter section (filter section FFT may be precomputed and stored), pointwise multiply transforms, IFFT; and combine the 2B-point results by overlap-add where the overlap is by B samples to give the output of the m-th filter. Lastly, add the M filter outputs. Note that the m-th filter section filtering is equivalent to a length B filter acting on an input signal block which has been delayed by m blocks. That is,

y_m(n)=Σ_0≦b<Bs(n−b−mB)f_m(b+mB)=Σ_0≦b<Bs_m(n−b)h_m(b)

where h_m(b)=f_m(b+mB) has non-zero coefficients only for 0≦b<B and s_m(n)=s(n−mB) is a delayed version of s(n). Thus the FFT for the 0-padding input signal block has already been computed.

In FIG. 2i, S₀through S_M−1are the input audio signal blocks with 0 padding, and F₀through F_M−1are the corresponding filter section blocks with 0 padding. The filter in the preferred embodiments is a (modified) room impulse response. Calculation of the FFT for filter coefficients (F₀, F₁, . . . , F_M−1) can be done off-line. For uniform block-size partitioning, the signals S₁, . . . , S_M−1are respectively equal to the previous values of S₀, . . . , S_M−2. Thus, it is not necessary to compute the FFT of S₁, . . . , S_M−1but rather load previously saved data. This scheme can be applied to the room impulse response convolution problem by partitioning the ipsi-lateral and contra-lateral filters into sections and performing signal mixing in the frequency domain prior to the inverse FFT, as shown in FIG. 2j (only the process of creating the left output signal is shown).

The second preferred embodiments address the computational issue related to spectral multiplication taking into consideration the peculiarities of room impulse responses and human hearing. It is well known that human hearing is less sensitive to late reverberation as compared to early arrivals of the sound waves, and therefore the late reverberation portion of a room impulse response can be simplified in several ways without affecting the perceptual quality of the sound. This is also true with respect to the spatiality of the sound, which is dictated by the early arrivals of the sound wave. Therefore, the relation of the ipsi-lateral and contra-lateral for the trailing portion of room impulse responses can be modified for better efficiency. As shown in FIG. 1c the second preferred embodiments make use of this fact by skipping the multiplication by the contra-lateral filter sections location at the end of the room impulse response. For example, if the original filter has 10 sections and the 8 last contra-lateral sections are skipped, the number of spectral multiplications can be reduced by 40%. FIG. 1c illustrates the case where the 3 last filter sections have no contra-lateral paths.

In general, if the room impulse response has L coefficients, then the last 0.2-0.4 L (20-40%) of the coefficients are omitted, depending upon the size of L, with larger L implying a larger fraction of the contra-lateral coefficient omitted.

4. Room Impulse Response Convolution Simplification

FIG. 1
d shows the third preferred embodiment methods to reduce the number of computations required in convolution of room impulse responses by modifying the middle portions of the sectioned filter of FIG. 2i. Indeed, a room impulse response can be characterized by the pattern of sound reflections that are expected for the particular modeled environment. It is known from psychoacoustics that the human ear is sensitive to the first reflections that arrive after the direct signal, but tends not to distinguish the large number of reflections that occur afterwards; i.e., they are perceived as a collective effect. Therefore, it is expected that the impulse response can be greatly simplified without affecting the output quality.

On the other hand, just shortening the filter as a whole would cause a significant perceptual impact, because the total reverberation duration (i.e., filter length) is an important factor that must be preserved.

The third preferred embodiments use the sectioned filter block convolution approach and simplify intermediate filter sections where temporal structures tend to get masked.

It is worth noting that just shortening a filter section by truncating its trailing portion and applying a multiplicative compensation factor to preserve total energy is not a good solution due to the spectral distortion introduced by the truncation. In order to preserve the spectral shape of intermediate filter sections, the preferred embodiments transform them into minimum phase filters. The reason for this, besides the fact that the magnitude spectrum has the same shape as the original filter section, is that minimum phase filters have the property that their energy is maximally concentrated in the first coefficients and tends to decrease towards the end. Thus, the spectral distortion caused by truncation can be minimized. It is also worth noting that the minimum-phase transformation also produces spectral distortion due to the change in the phase spectrum of individual sections, but that does not represent a problem because phase relations are less important except for the first reflections.

The third preferred embodiment proceeds as: first transforms each of the room impulse response filter sections after the first section into minimum-phase filter by reflecting all the z-transform zeros located outside of the unit circle into zeros inside the unit circle; next, truncate the minimum-phase filters in the time domain; and after truncation, apply a multiplicative factor to correct the energy level of the truncated minimum-phase filter to match the original filter section energy level. FIG. 1d illustrates the minimum-phase, truncation, energy level correction as “modify”. Note that FIG. 1d implicitly presumes that the filter sections all have the same size (equal to the block size). An alternative combines two or more of the filter sections into a single large section prior to the transformation to minimum-phase and truncation plus energy level correction; after truncation and energy level correction, the combined sections can be separated into original size sections. In this case the truncation may result in some filter sections being all 0s. For example, with a block size M=256, the first four filter sections (1024 coefficients representing ˜20 ms at 48 kHz sampling) are left unchanged; the next eight sections are combined into a single 2048 coefficient filter for conversion to minimum-phase and truncated to 512 coefficients; the next sixteen sections are combined into a single 4096 coefficient filter for conversion to minimum-phase and truncation to 1024 coefficients; and the last 1024 coefficients are left unchanged in order to maintain the overall filter length. The eight sections combined and truncated give two non-zero sections followed by six sections with all 0 coefficients, and the sixteen sections combined and truncated gives four non-zero sections followed by twelve sections with all 0 coefficients. Then this modified room impulse response is used as shown in FIG. 1d with the FFTs for the modified room impulse response filter sections precomputed and stored.

More explicitly, convert a filter section (or combined filter sections) to a minimum-phase filter by convolving with an allpass filter determined by the zeros of the filter transfer function which lie outside of the unit circle. Indeed, let h(n) for 0≦n<N be a filter to convert to a minimum-phase filter, h_min(n). As with the first preferred embodiments, h(n) can be considered as an infinite sequence with a finite number of nonzero coefficients or as a finite (or periodic) sequence. The infinite sequence approach gives an exact h_min(n) which may have an infinite number nonzero coefficients but is truncated anyway, and the finite sequence approach gives an approximate h_min(n). For the infinite sequence approach, initially compute the transfer function H(z); then find an allpass filter with transfer function H_allpass(z) so that H(z)=H_min(z)H_allpass(z). To determine H_allpass(z), first find the zeros of H(z) which lie outside of the unit circle, and then for each such zero (e.g., H(α)=0 for 1<|α|) include the bilinear factor (z⁻¹−α⁻¹)/(1−(α⁻¹)*z⁻¹) in H_allpass(z) (note that * indicates complex conjugate). That is, compute H_min(z)=H(z) Π (1−(α⁻¹)*z⁻¹)/(z⁻¹−α⁻¹) where the product is over the zeros of H(z) outside of the unit circle. Next, inverse z-transform to recover h_min(n). Then, truncate and correct the energy level to get h_min-trun(n). Lastly, 0 pad (each section if combined filter sections) and compute the FFT for use in the architecture of FIG. 1d.

5. Zero-padding with FFT

Fourth preferred embodiments reduce the computational complexity of the FFT after 0-padding used in the overlap-add method of filtering by multiplication in the frequency domain, and thus can be applied to the foregoing preferred embodiments. In particular, let x(n) for 0≦n<N be 0 padded to define x_pad(n) for 0≦n<2N as

$\begin{matrix} x_{pad} (n) = x (n) & for 0 \underline{<} n < N \\ = 0 & for N \underline{<} n < 2 N \end{matrix} .$

Then the 2N-point FFT of x_pad(n) is:

$\begin{matrix} X_{pad} (k) = \sum_{0 \underline{<} n < 2 N} x_{pad} (n) ⅇ^{- j2π kn / 2 N} \\ = \sum_{0 \underline{<} n < N} x (n) ⅇ^{- j 2 π kn / 2 N} \\ = \sum_{0 \underline{<} n < N} [(1 / N) \sum_{0 \underline{<} i < N} X (i) ⅇ^{j 2 π in / N}] ⅇ^{- j 2 π kn / 2 N} \end{matrix}$

where the N-point inverse FFT expression for x(n) was substituted. Now rearranging:

X_pad(k)=(1/N)Σ_0≦i<NX(i)Σ_0≦n<Ne^{−j2πn(k/2−i)/N}

Consider the case of k an even integer: k=2m. In this case:

$\begin{matrix} \sum_{0 \underline{<} n < N} ⅇ^{- j 2 π n (k / 2 - i) / N} = \sum_{0 \underline{<} n < N} ⅇ^{- j 2 π n (m - i) / N} \\ = N δ_{m - i} so \\ X_{pad} (2 m) = (1 / N) \sum_{0 \underline{<} i < N} X (i) N δ_{m - i} \\ = X (m) \end{matrix}$

Thus the even frequencies of the zero-padded spectrum can be computed as the frequencies of the non-zero-padded spectrum at one-half the frequency. That is, an N-point FFT of x(n) generates the even frequencies of the 2N-point FFT of x_pad(n).

For the odd frequencies of the zero-padded spectrum, take k=2m+1 in the foregoing:

$\begin{matrix} X_{pad} (2 m + 1) = (1 / N) \sum_{0 \underline{<} i < N} X (i) \sum_{0 \underline{<} n < N} ⅇ^{j2 π n (i - (2 m + 1) / 2) / N} \\ = (1 / N) \sum_{0 \underline{<} i < N} X (i) \sum_{0 \underline{<} n < N} ⅇ^{- j π n / N} ⅇ^{j 2 π n (i - m) / N} \\ = (1 / N) \sum_{0 \underline{<} i < N} X (i) S (m - i) \end{matrix}$

which is a circular convolution in the frequency domain where

S(k)=Σ_0≦n<Ne^−jπn/Ne^−j2πnk/N

is the N-point FFT of s(n)=e^−jπn/Nand extended by periodicity to negative k. Thus the odd frequencies of the zero-padded spectrum are computed in terms of a convolution with the N-point FFT of x(n).

For notational convenience, define Y(m)=X_pad(2m+1), then taking the N-point inverse FFT gives:

$\begin{matrix} y (n) = (1 / N) \sum_{0 \underline{<} m < N} Y (m) ⅇ^{j 2 π mn / N} \\ = (1 / N) \sum_{0 \underline{<} m < N} (1 / N) \sum_{0 \underline{<} i < N} X (i) S (m - i) ⅇ^{j 2 π mn / N} \\ = x (n) s (n) \end{matrix}$

Thus to compute the odd frequencies of X_pad(k) in the frequency domain by convolution, the fastest way is to move back to the time domain, producing the original sequence x(n) and a complex exponential (e^−jπn/N) to pre-warp the FFT to look at the odd frequencies, multiplying point-wise, and taking the FFT of the result to get back to the frequency domain and the odd frequencies. Since the original sequence is available and the s(n) can be pre-calculated, all that is required is point-wise multiplication and a forward FFT to obtain the odd frequencies directly.

In short, the fourth preferred embodiment zero-padded 2N-point FFT requires two N-point FFTs and N complex multiplies instead of one 2N-point FFT. However, half of the complex multiplies in the time domain can be combined with twiddle factors in the first stage of many FFT implementations, so only an additional N/2 complex multiplies are required. Hence, about 3N/2 operations can potentially be saved.

An alternative fourth preferred embodiment method approximates the terms in the definition of S(m) to simplify the frequency-domain convolution computation. In particular,

$\begin{matrix} S (k) = \sum_{0 \underline{<} n < N} ⅇ^{- j π n / N} ⅇ^{- j 2 π nk / N} \\ = \sum_{0 \underline{<} n < N} \cos [π n (2 k + 1) / N] - j \sin [π n (2 k + 1) / N] \end{matrix}$

Note that if n=0, then cos[πn(2k+1)/N]=1 and for all other n the cosine is anti-symmetric about N/2: cos[πn(2k+1)/N]=−cos[π(N−n)(2k+1)/N]. And if N is even, n=N/2 gives cos[πn(2k+1)/N]=0. Thus all the cosine terms except n=0 cancel in the summation. In contrast, the sine is symmetric about N/2, so only the n=0 term can be omitted. And thus separating the n=0 terms out of the sum defining S(k) gives:

Y(m)=(1/N)Σ_0≦i<NX(i)+(−j/N)Σ_1≦i<NX(i)T(m−i)

where

T(k)=Σ_1≦n<Nsin[πn(2k+1)/N]

Since T(k) is real-valued and anti-symmetric, it can be thought of as a linear-phase filter which is cyclically convolved with X(k). The first sum in Y(m) needs to be computed only once. However, the convolution needs to be calculated for each m, requiring O(N²) operations to calculate all Y(m). The preferred embodiment methods approximate the convolution with far fewer computations by windowing or other modification of the filter kernel.

Initially, consider the computational simplification in terms of operations. Presume a 2N-point FFT requires K(2Nlog₂(2N)) operations and an N-point FFT requires K(Nlog₂(N)) operations. If 2NM operations are required to compute the convolution directly (for a filter kernel with M non-zero coefficients), then to save computation requires

2NM<K(2N log₂(2N))−K(N log₂(N))−N

where the first term on the right is the direct 2N-point FFT complexity to get X_pad(k), the second term is the N-point FFT complexity to get X(k), and the last term is (1/N) Σ_0≦i<NX(i) for the non-convolution term of Y(m). This implies

2M<K log₂(N)+2K−1

For example, with N=8192 and K=4 then M<29.5 is needed to save computation.

Once the length (M) of the filter kernel has been set, the next step is to create the kernel. This is equivalent to a filter design problem. Perhaps the simplest approach is to truncate the original kernel. A graph of T(k) shows that coefficients near k=0 dominate, and coefficients near k=N/2 have small magnitudes. Hence, define a truncated version of T(k):

$\begin{matrix} T_{trun} (k) = T (k) & for M / 2 \underline{<} k < M / 2 \\ = 0 & otherwise \end{matrix}$

Alternatively, multiplication with a window function, such as a Hann window, can similarly reduce the number of nonzero filer coefficients but with a smoother transition in filter coefficient magnitude. Of course, other filter design methods could be used to approximate T(k) by a filter with a small number of nonzero filter coefficients.

6. Modifications

The preferred embodiments can be modified in various ways; for example, vary the sizes of blocks of samples, vary the size of FFTs, vary the sizes of filter partitions, truncate more or less of filter sections, use other spectrum modifications such as tapering, and so forth.

Number	Name	Date	Kind
5946400	Matsuo	Aug 1999	A
6741711	Sibbald	May 2004	B1
7024259	Sporer et al.	Apr 2006	B1
20050117762	Sakurai et al.	Jun 2005	A1
20060045294	Smyth	Mar 2006	A1

	Number	Date	Country
	60657234	Feb 2005	US
	60756045	Jan 2006	US

Room impulse response

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications

Term Extension

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

US Referenced Citations (5)

Provisional Applications (2)