1. Field of the Invention
The present invention relates to signal processing techniques. More particularly, the present invention relates to methods for processing audio signals.
2. Description of the Related Art
The majority of the stereo spreader designs implemented today use a so called stereo shuffling topology that splits an incoming stereo signal into its mid (M=L+R) and side (S=L−R) components and then processes those S and M signals with complementary low and highpass filters. The cutoff frequencies of these low and high-pass filters are generally tuned by ear. The resultant S′ and M′ signals are recombined such that 2L=M+S and 2R=M−S. Unfortunately, the end result usually yields a soundfield that is beyond the physical loudspeaker arc but is not precisely localized in space. What is desired is an improved stereo spreading method.
The M-S matrix can have other novel applications to spatial audio beyond the stereo spreader.
It is often desirable to reproduce binaural material over loudspeakers. In general, the aim of a crosstalk canceller is to cancel out the contra-lateral transmission path Hc such that the signal from the left speaker is heard at the left eardrum only and the signal from the right speaker is heard at the right eardrum only.
Traditional feedback crosstalk canceller designs require that the interaural transfer function (ITF) be constrained to be less than 1.0 for all frequencies. Tuning the spectral response of a traditional recursive crosstalk canceller filter design in order to control the perceived timbre is difficult or impractical. It is desirable to provide an improved crosstalk cancellation circuit that can allow tuning of the timbre of the canceller output without seriously affecting the spatial characteristics. Further it would be desirable to avoid possible sources of instability or signal clipping.
The present invention describes techniques that can be used to provide novel methods of spatial audio rendering using adapted M-S matrix shuffler topologies. Such techniques include headphone and loudspeaker-based binaural signal simulation and rendering, stereo expansion, multichannel upmix and pseudo multichannel surround rendering.
In accordance with another invention, a novel crosstalk canceller design methodology and topology combining a minimum-phase equalization filter and a feed-forward crosstalk filter is provided. The equalization filter can be adapted to tune the timbre of the crosstalk canceller output without affecting the spatial characteristics. The overall topology avoids possible sources of instability or signal clipping.
In one embodiment, the cross-talk cancellation uses a feed-forward cross-talk matrix cascaded with a spectral equalization filter. In one variation, this equalization filter is lumped within a binaural synthesis process preceding the cross-talk matrix. The design of the equalization filter includes limiting the magnitude frequency response at low frequencies.
These and other features and advantages of the present invention are described below with reference to the drawings.
Reference will now be made in detail to preferred embodiments of the invention. Examples of the preferred embodiments are illustrated in the accompanying drawings. While the invention will be described in conjunction with these preferred embodiments, it will be understood that it is not intended to limit the invention to such preferred embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well known mechanisms have not been described in detail in order not to unnecessarily obscure the present invention.
It should be noted herein that throughout the various drawings like numerals refer to like parts. The various drawings illustrated and described herein are used to illustrate various features of the invention. To the extent that a particular feature is illustrated in one drawing and not another, except where otherwise indicated or where the structure inherently prohibits incorporation of the feature, it is to be understood that those features may be adapted to be included in the embodiments represented in the other figures, as if they were fully illustrated in those figures. Unless otherwise indicated, the drawings are not necessarily to scale. Any dimensions provided on the drawings are not intended to be limiting as to the scope of the invention but merely illustrative.
The M-S shuffler matrix, also known as the stereo shuffler, was first introduced in the context of a coincident-pair microphone recording to adjust its width when played over two speakers. In reference to the left and right channels of a modern stereo recording, the M component can be considered to be equivalent to the sum of the channels and the S component equivalent to the difference. A typical M-S matrix is implemented by calculating the sum and difference of a two channel input signal, applying some filtering to one or both of those sum and difference channels, and once again calculating a sum and difference of the filtered signals, as shown in
The MS shuffler matrix has two important properties that will be used many times throughout this document: (1) The stereo shuffler has no effect at frequencies where the both the sum and difference filters are simple gains of 0.5. For example, for the topology given in
The head related transfer function (HRTF) is often used as the basis for 3-D audio reproduction systems. The HRTF relates to the frequency dependent time and amplitude differences that are imposed on the wave front emanating from any sound source that are attributed to the listener's head (and body). Every source from any direction will yield two associated HRTFs. The ipsilateral HRTF, Hi, represents the path taken to the ear nearest the source and the contralateral HRTF, Hc, represents the path taken to the farthest ear. A simplified representation of the head-related signal paths for symmetrical two-source listening is depicted in
The audio signal path diagram shown in
Such a topology is often used when desired to simulate a typical stereo loudspeaker listening experience over headphones. In this case, the ipsilateral and contralateral HRTFs have been previously measured and are implemented as minimum phase digital filters. The time delays on the contralateral path, represented by Z−ITD, represent an integer-sample time delay that emulates the time difference due to different signal path lengths between the source and the nearest and farthest ears. The traditional HRTF implementation topology of
The sum and difference HRTF filters shown in
In one embodiment, we cross fade the magnitudes of the sum and difference HRTF function's frequency response to unity at higher frequencies. This facilitates cost effective implementation and may also provide a way of minimizing undesirable high frequency timbre changes. After calculating the minimum-phase of the new magnitude response we are left with an implementation that performs the appropriate HRTF filtering at lower frequencies and transitions to an effect bypass at higher frequencies (using Property 1, described above). An example is provided in
In accordance with another embodiment, we utilize the fact that we do not need to take the complex frequency response of the sum and difference filters into consideration until final implementation. We smooth the HRTF magnitude response to a differing degree in different frequency bands without worrying about consequences to the phase response. This can be done using either critical band smoothing or by splitting the frequency response into a fixed number of bands (for example, low, mid and high) and performing a radically different degree of smoothing per band. This allows us to preserve the most important head-related spatial cues (at the lowest frequencies) and smooth away the more-listener specific HRTF characteristics, such as those dependant on pinnae shape, at mid and high frequencies. By minimum phasing the resulting magnitude responses we ensure that the spatial attributes of the binaural signals are preserved at lower frequencies with greater (although less perceptually significant) errors at higher frequencies. An example is provided in
This kind of smoothing and crossfading-to-unity significantly simplifies the sum and difference filter frequency responses. That, together with the fact that the sum and difference filters have been implemented using minimum phase functions (i.e. no need for a time delay) yields very low order IIR filter requirements for implementation. This low complexity of the sum and difference filter frequency responses, together with no requirement to directly implement an ITD makes it possible to consider analogue implementations where, before, they would have been very difficult or impossible.
In accordance with yet another embodiment, a novel crossfade between the full 3D effect and an effect bypass is implemented by the M-S shuffler implementation of an HRTF pair. Such a crossfade implementation is illustrated in
In accordance with another embodiment, the ability to crossfade between full 3D effect and no 3D effect allows us to provide the listener with interesting spatial transitions when the 3D effect is enabled and disabled. These transitions can help provide the listener with cues regarding what the effect is doing. It can also minimize the instantaneous timbre changes that can occur as a result of the 3D processing, which may be deemed undesirable to some listeners. In this case, the rate of change between CGF_SUM and CGF_DIFF can differ, allowing for interesting spatial transitions not possible with a traditional DSP effect crossfade. The listener could also be presented with a manual control that could allow him/her to choose the ‘amount’ of 3D effect applied to their source material according to personal taste. The scope of this embodiment of the present invention is not limited to any type of control. That is, the invention can be implemented using any type of suitable control, for a non-limiting example, a “slider” on a graphical user interface of a portable electronic device or generated by software running on a host computer.
It is often desirable to reproduce binaural material over loudspeakers. The role of the crosstalk canceller is to post-process binaural signals so that the impact of the signal paths between the speakers and the ears are negated at the listeners' eardrums. A typical crosstalk cancellation system is shown in
In general, the joint minimum-phase property of sum and difference filters for the crosstalk canceller implies that we can apply the same techniques as used in the symmetric HRTF pair M-S matrix implementation.
That is, the filter magnitude responses can be crossfaded to unity at higher frequencies, performing accurate spatial processing at lower frequencies and ‘doing no harm’ at higher frequencies. This is particularly of interest to crosstalk cancellation, where the inversion of the speaker signal path sums and differences can yield significant high frequency gains (perceived as undesirable resonance) when the listener is not exactly at the desired listening sweetspot. It is often better to opt to do nothing to the incoming signal than do potentially harmful processing.
The filter magnitude responses can also be smoothed by differing degrees based on increasing frequency, with higher frequency bands smoothed more than lower frequency bands, yielding low implementation cost and feasibility of analog implementations.
Accordingly, in one embodiment we apply a crossfading circuit around the sum and difference filters that allows the user to chose the amount of desired crosstalk cancellation and also to provide an interesting way to transition between headphone-targeted processing (HRTFs only) and loudspeaker-targeted (HRTFs+crosstalk cancellation).
A virtual loudspeaker pair is a conceptual name given to the process of using a combination of binaural synthesis and crosstalk cancellation in cascade to generate the perception of a symmetric pair of loudspeaker signals from specific directions typically outside of the actual loudspeaker arc. The most common application of this technique is the generation of virtual surround speakers in a 5.1 channel playback system. In this case, the surround channels of the 5.1 channel system are post-processed such that they are implemented as virtual speakers to the side or (if all goes well), behind the listener using just two front loudspeakers.
A typical virtual surround system is shown in
This invention permits the design of virtual loudspeakers at specific locations in space and for specific loudspeaker set ups using objective methodology that can be shown to be optimal using objective means.
The described design provides several advantages including improvements in the quality of the widened images. The widened stereo sound images generated using this method are tighter and more focused (localizable) than with traditional shuffler-based designs. The new design also allows precise definition of the listening arc subtended by the new soundstage, and allows for the creation of a pair of virtual loudspeakers anywhere around the listener using a single minimum phase filter. Another advantage is providing accurate control of virtual stereo image width for a given spacing of the physical speaker pair.
This design preferably includes a single minimum phase filter. This makes analogue implementation an easy option for low cost solutions. For example, of a pair of virtual loudspeakers can be placed anywhere around the listener using a single minimum phase filter.
The new design also allows preservation of the timbre of center-panned sounds in the stereo image. Since the mid (mono) component of the signal is not processed, center-panned (‘phantom center’) sources are not affected and hence their timbre and presence are preserved.
It has already been shown that both of these sections could be individually implemented in an M-S shuffler configuration. For example, in this virtual surround speaker case the HRTFs could be implemented as shown in
These two M-S shuffler matrices can be combined to generate a virtual loudspeaker pair. Using MS matrix property 2 we eliminate one of the M-S matrices by simply multiplying the HRTF and crosstalk sum and difference functions of each individual matrix and using the result for our new virtual speaker sum and difference functions. The new sum and difference EQ functions can now be defined by
Any listener specific, but direction independent, HRTF contributions would cancel out of any loudspeaker-based virtual speaker implemented in this manner, assuming that all HRTF measurements were taken in the same session. This implies that measured HRTFs would require minimal post-processing. The new virtual speaker matrix is shown in
Since VSSUM and VSDIFF are derived from the product of two minimum phase functions, they can both be implemented as minimum phase functions of their magnitude response without appreciable timbre or spatial degradation of the resulting soundfield. This, in turn, implies that they inherit some of the advantageous characteristics of the HRTF and crosstalk shuffler implementations, i.e.
In accordance with any embodiment, the filter magnitude responses are crossfaded substantially to unity at higher frequencies, performing accurate spatial processing at lower frequencies and ‘doing no harm’ at higher frequencies. This is particularly of interest to virtual speaker based products, where the inversion of the speaker signal path sums and differences can yield high gains when the listener is not exactly at the desired listening sweetspot.
In accordance with yet another embodiment, the filter magnitude responses are smoothed by differing degrees based on increasing frequency, with higher frequency bands smoothed more than lower frequency bands, yielding low implementation cost and feasibility of analog implementations.
In a further embodiment, we apply crossfading circuits around the sum and difference filters that allow the user to chose the amount of desired 3D processing and also to provide an interesting way to transition between 3D processing and no processing.
The scope of the invention is not limited to a single frequency for cutting off crosstalk cancellation and an HRTF response. Thus, in one embodiment, we cross-fade to unity at a different frequency for the numerator and denominator of equation 1 and equation 2. This would allow us to avoid crosstalk cancellation above frequencies for which typical head movement distances are much greater than the wavelength of impinging higher frequency signals and still provide the listener with HRTF cues relating to the virtual source location up to a different, less constraining frequency range. This technique could also be used, for example, in a system where the same 3D audio algorithm is used for both headphone and loudspeaker reproduction. In this case, we could implement an algorithm that performs virtual loudspeaker processing up to some lower (for a non-limiting example, <500 Hz,) frequency and HRTF based virtualization above that frequency.
The ‘virtual loudspeaker’ M-S matrix topology can be used to provide a stereo spreader or stereo widening effect, whereby the stereo soundstage is perceived beyond the physical boundaries of the loudspeakers. In this case, a pair of virtual speakers, with a wider speaker arc (e.g., ±30°) is generated using a pair of physical speakers that have a narrower arc (e.g., ±10°).
A common desirable attribute of such stereo widening systems, and one that is rarely met, is the preservation of timber for center panned sources, such as vocals, when the stereo widening effect is enabled. Preserving the center channel has several advantages other than the requirement of timbre preservation between effect on and effect off. This may be important for applications such as AM radio transmission or internet audio broadcasting of downmixed virtualized signals.
Typical VSSUM and VSDIFF filter frequency responses derived from HRTFs measured at 10° and 30° are shown in
An intuitive answer to this problem might be to simply remove the VSSUM filter. However, removing this filter would disturb the inter-channel level and phase at the shuffler's outputs and, consequently, the interaural level and phase at the listener's ears. In order to preserve the center channel timbre while preserving the spatial attributes of the design we utilize an additional EQ.
In accordance with another embodiment, in order to fully retain the timbre of the front-center image we select the additional EQ such that:
Such a configuration yields the most ideal M-S matrix based stereo spreader solution that does not affect the original center panned images while retaining the spatial attributes of the original design.
It transpires; as a result of this additional filtering that stereo-panned images are now being filtered by some function between 1 and EQ=1/VS SUM, relative to the original virtual speaker implementation, depending on their panned position, with hard-panned images exhibiting the largest timbre differences. For many applications, this is an undesirable outcome.
An ideal solution needs to make a compromise between undesirably filtered center panned sources and undesirably filtered hard panned sources. The problem here is that, for timbre preservation, we want the additional sum EQ filter to be close to EQSUM=1/VSSUM while we want the additional difference EQ filter to be close to EQDIFF=1, but both additional EQs must be the same in order to preserve the interaural phase.
In accordance with yet another embodiment we perform a weighted interpolation between the two extremes and model the resulting filter. The weighting is preferrably based on the requirements of the final system. For example, if the application assumes that there will be a prevalent amount of monophonic content, (perhaps a speaker system for a portable DVD player) EQDIFF and EQSUM might be designed to be closer to 1/VSSUM to better preserve dialogue.
In accordance with yet another embodiment we specify the EQ filter in terms of a geometric mean function.
Using this method, the perceptual impact of center-panned timbre modification is halved (in terms of dB) compared to our original implementation. This modification implies that stereo-panned images are now being filtered by some function between 1 and EQ=1/√{square root over (VSSUM)}, relative to the original virtual speaker implementation, again half the perceptual impact as before.
In accordance with still another embodiment, we design the filters such that
at higher frequencies. Hi(θVS) and Hi(θS) represent the ipsilateral HRTFs corresponding to the virtual source position and the physical loudspeaker positions, respectively. In this case, we assume the incident sound waves from the loudspeaker to the contralateral ear are shadowed by the head at higher frequencies. This would mean that we are predominantly concerned with canceling the ipsilateral HRTF corresponding to the speaker and replacing it with the ipsilateral HRTF corresponding to the virtual sound source.
Multi-channel upmix allows the owner of a multichannel sound system to redistribute an original two channel mix between more than two playback channels. A set of N modified M-S shuffler matrices can provide a cost efficient method of generating a 2N-channel upmix, where the 2N output channels are distributed as N (left, Right) pairs.
Accordingly, in one embodiment, an M-S shuffler matrix is used to generate a 2N-channel upmix.
Total energy:
Front energy=LF2+RF2=gMF2·M2+gSF2·S2
Back energy=LB2+RB2=gMB2·M2+gSB2·S2
Total energy=(gMF2+gMB2)·M2+(gSF2+gSB2)·S2
Energy and balance preservation condition:
For any signal (L,R), output energy must be equal to input energy.
This means:
(gMF2+gMB2)·M2+(gSF2+gSB2)·S2=L2+R2=M2+S2.
In order to verify this condition for any (L,R) and therefore any (M,S), we need:
gMF2+gMB2=1 and gSF2+gSB2=1
In accordance with yet another embodiment, control is provided for the front-back energy distribution of the M and/or S components. For a non-limiting example, the upmix parameters can be made available to the listener using a set of four volume and balance controls (or sliders):
Proposed volume and balance control parameters:
M Level=10·log 10(gMF2+gMB2) default: 0 dB
S Level=10·log 10(gSF2+gSB2) default: 0 dB
M Front-Back Fader=gMB2/(gMF2+gMB2) range: 0-100%
S Front-Back Fader=gSB2/(gSF2+gSB2) range: 0-100%
For M/S balance preservation, M Level=S Level.
In one variation, improved performance is expected from decorrelating the back channels relative to the front channels. For example, some delays and allpass filters can be inserted into some or all of the upmix channel output paths, as shown in
In accordance with yet another embodiment, the output of the upmix is virtualized using any traditional headphone or loudspeaker virtualization techniques, including those described above, as shown in the generalized 2-2N channel upmix shown in
In this figure, SUMi and DIFFi represent the sum and difference filter specifications of a the i'th symmetrical virtual headphone or loudspeaker pair.
In another embodiment and according to the second property of M-S matrices, described at the start of the specification, the upmix gains and the virtualization filters are combined. A generalized implementation of such a combined upmix and virtualizer implementation is shown in
One approach to obtain a compelling surround effect includes setting the S fader towards the back and the M fader towards the front. If we preserve the balance, this would cause gSB>gMB and gMF>gSF. The width of the frontal image would therefore be reduced. In one embodiment, this is corrected by widening the front virtual speaker angle.
The M-S shuffler based upmix structure can be used as a method of applying early reflections to a virtual loudspeaker rendering over headphones. In this case, the delay and allpass filter parameters are adjusted such that their combined impulse response resembles a typical room response. The M and S gains within the early reflection path are also tuned to allow the appropriate balance of mid versus side components used as inputs to the room reflection simulator. These reflections can be virtualized, with the delay and allpass filters having a dual role of front/back decorrelator and/or early reflection generator or they can be added as a separate path directly into the output mix, as shown in an example implementation in
Although the upmix has been described as a 2-N channel upmix, the description as such has been for illustrative purposes and not intended to be limiting. That is, the scope of the invention includes at least any M-N channel upmix (M<N).
As described earlier, any stereo signal can be apportioned into two mono components; a sum and a difference signal. A monophonic input (i.e. one that has the same content on the left and right channels) is 100% sum and 0% difference. By deriving a synthetic difference signal component from the original monophonic input and mixing back, as we do in any regular M-S shuffler, we can generate a sense of space equivalent to an original stereo recording. This concept is illustrated on
Of course, if the input was purely monophonic, the output of the first ‘difference’ operation would be zero and this difference operation would be unnecessary in practice. For maximum effect, the processing involved in generating the simulated difference signal should be such that it generates an output that is temporally decorrelated with respect to the original signal. This could be in separate embodiments an allpass filter or a monophonic reverb, for example. In its simplest form, this operation could be a basic N-sample delay, yielding an output that is equivalent to a traditional pseudo stereo algorithm using the complementary comb method first proposed by Lauridsen.
In accordance with another embodiment, this implementation is expanded to a 1-N (N<2) channel ‘pseudo surround’ output by simulating additional difference channel components and applying them to additional channels.
The monophonic components of the additional channels could also be decorrelated relative to one another and the input if so desired, in one embodiment. A generalized 1-2N pseudo surround implementation in accordance with one embodiment is shown in
In one embodiment control of the front-back energy distribution of the M and/or S components is provided.
Proposed volume and balance control parameters:
M Level=10·log 10(gMF2+gMB2) default: 0 dB
S Level=10·log 10(gSF2+gSB2) default: 0 dB
M Front-Back Fader=gMB2/(gMF2+gMB2) range: 0-100%
S Front-Back Fader=gSB2/(gSF2+gSB2) range: 0-100%
For M/S balance preservation, M Level=S Level.
While the main purpose of this kind of algorithm is to create a pseudo surround signal from a monophonic 2-channel (LIN+RIN) or single channel (LIN only) input, it works well as applied to a stereo input source.
In accordance with other embodiments, any of the above pseudo-stereo implementations are further enhanced by applying any headphone or speaker 3D audio virtualization technologies, including those described above, to the outputs of the pseudo stereo/surround algorithm. This concept is generalized in
Cross-talk Canceller with Independent Control of Spatial and Spectral Attributes
Assuming symmetric listening and a symmetrical listener, the ipsilateral and contralateral HRTFs between the loudspeaker and the listener's eardrums are illustrated in
Since this filter affects both channels equally and since the human auditory system is sensitive to phase differences only, the EQCTC filter is implemented minimum phase in accordance with the present invention.
A typical EQCTC curve is shown in
In fact, because EQCTC can now be used to equalize the virtual sources reproduced by our crosstalk canceller without affecting the spatial attributes of the virtual source positions. This is useful in optimizing the crosstalk canceller design for particular directions (for example, left surround and right surround in a virtual 5.1 implementation).
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.
This application is a continuation of and claims priority to U.S. patent application Ser. No. 11/833,403, filed Aug. 7, 2007, which claims priority from provisional U.S. Patent Application Ser. No. 60/821,702, filed Aug. 7, 2006, titled “STEREO SPREADER AND CROSSTALK CANCELLER WITH INDEPENDENT CONTROL OF SPATIAL AND SPECTRAL ATTRIBUTES”, the disclosures of which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
60821702 | Aug 2006 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11835403 | Aug 2007 | US |
Child | 14144546 | US |