The invention relates to methods (sometimes referred to as headphone virtualization methods) and systems for generating a binaural signal in response to a multi-channel audio input signal, by applying a binaural room impulse response (BRIR) to each channel of a set of channels (e.g., to all channels) of the input signal. In some embodiments, at least one feedback delay network (FDN) applies a late reverberation portion of a downmix BRIR to a downmix of the channels.
Headphone virtualization (or binaural rendering) is a technology that aims to deliver a surround sound experience or immersive sound field using standard stereo headphones.
Early headphone virtualizers applied a head-related transfer function (HRTF) to convey spatial information in binaural rendering. A HRTF is a set of direction- and distance-dependent filter pairs that characterize how sound transmits from a specific point in space (sound source location) to both ears of a listener in an anechoic environment. Essential spatial cues such as the interaural time difference (ITD), interaural level difference (ILD), head shadowing effect, spectral peaks and notches due to shoulder and pinna reflections, can be perceived in the rendered HRTF-filtered binaural content. Due to the constraint of human head size, the HRTFs do not provide sufficient or robust cues regarding source distance beyond roughly one meter. As a result, virtualizers based solely on a HRTF usually do not achieve good externalization or perceived distance.
Most of the acoustic events in our daily life happen in reverberant environments where, in addition to the direct path (from source to ear) modeled by HRTF, audio signals also reach a listener's ears through various reflection paths. Reflections introduce profound impact to auditory perception, such as distance, room size, and other attributes of the space. To convey this information in binaural rendering, a virtualizer needs to apply the room reverberation in addition to the cues in the direct path HRTF. A binaural room impulse response (BRIR) characterizes the transformation of audio signals from a specific point in space to the listener's ears in a specific acoustic environment. In theory, BRIRs include all acoustic cues regarding spatial perception.
The multi-channel audio input signal may also include a low frequency effects (LFE) or subwoofer channel, identified in
In some conventional virtualizers, the input signal undergoes time domain-to-frequency domain transformation into the QMF (quadrature mirror filter) domain, to generate channels of QMF domain frequency components. These frequency components undergo filtering (e.g., in QMF-domain implementations of subsystems 2, . . . , 4 of
In general, each full frequency range channel of a multi-channel audio signal input to a headphone virtualizer is assumed to be indicative of audio content emitted from a sound source at a known location relative to the listener's ears. The headphone virtualizer is configured to apply a binaural room impulse response (BRIR) to each such channel of the input signal. Each BRIR can be decomposed into two portions: direct response and reflections. The direct response is the HRTF which corresponds to direction of arrival (DOA) of the sound source, adjusted with proper gain and delay due to distance (between sound source and listener), and optionally augmented with parallax effects for small distances.
The remaining portion of the BRIR models the reflections. Early reflections are usually primary or secondary reflections and have relatively sparse temporal distribution. The micro structure (e.g., ITD and ILD) of each primary or secondary reflection is important. For later reflections (sound reflected from more than two surfaces before being incident at the listener), the echo density increases with increasing number of reflections, and the micro attributes of individual reflections become hard to observe. For increasingly later reflections, the macro structure (e.g., the reverberation decay rate, interaural coherence, and spectral distribution of the overall reverberation) becomes more important. Because of this, the reflections can be further segmented into two parts: early reflections and late reverberations.
The delay of the direct response is the source distance from the listener divided by the speed of sound, and its level is (in absence of walls or large surfaces close to the source location) inversely proportional to the source distance. On the other hand, the delay and level of the late reverberations is generally insensitive to the source location. Due to practical considerations, virtualizers may choose to time-align the direct responses from sources with different distances, and/or compress their dynamic range. However, the temporal and level relationship among the direct response, early reflections, and late reverberation within a BRIR should be maintained.
The effective length of a typical BRIR extends to hundreds of milliseconds or longer in most acoustic environments. Direct application of BRIRs requires convolution with a filter of thousands of taps, which is computationally expensive. In addition, without parameterization, it would require a large memory space to store BRIRs for different source position in order to achieve sufficient spatial resolution. Last but not least, sound source locations may change over time, and/or the position and orientation of the listener may vary over time. Accurate simulation of such movement requires time-varying BRIR impulse responses. Proper interpolation and application of such time-varying filters can be challenging if the impulse responses of these filters have many taps.
A filter having the well-known filter structure known as a feedback delay network (FDN) can be used to implement a spatial reverberator which is configured to apply simulated reverberation to one or more channels of a multi-channel audio input signal. The structure of an FDN is simple. It comprises several reverb tanks (e.g., the reverb tank comprising gain element g1 and delay line z−n1, in the FDN of
For example, the commercially available Dolby Mobile headphone virtualizer includes a reverberator having FUN-based structure which is operable to apply reverb to each channel of a five-channel audio signal (having left-front, right-front, center, left-surround, and right-surround channels) and to filter each reverbed channel using a different filter pair of a set of five head related transfer function (“HRTF”) filter pairs. The Dolby Mobile headphone virtualizer is also operable in response to a two-channel audio input signal, to generate a two-channel “reverbed” binaural audio output (a two-channel virtual surround sound output to which reverb has been applied). When the reverbed binaural output is rendered and reproduced by a pair of headphones, it is perceived at the listener's eardrums as HRTF-filtered, reverbed sound from five loudspeakers at left front, right front, center, left rear (surround), and right rear (surround) positions. The virtualizer upmixes a downmixed two-channel audio input (without using any spatial cue parameter received with the audio input) to generate five upmixed audio channels, applies reverb to the upmixed channels, and downmixes the five reverbed channel signals to generate the two-channel reverbed output of the virtualizer. The reverb for each upmixed channel is filtered in a different pair of HRTF filters.
In a virtualizer, an FDN can be configured to achieve certain reverberation decay time and echo density. However, the FDN lacks the flexibility to simulate the micro structure of the early reflections. Further, in conventional virtualizers the tuning and configuration of FDNs has mostly been heuristic.
Headphone virtualizers which do not simulate all reflection paths (early and late) cannot achieve effective externalization. The inventors have recognized that virtualizers which employ FDNs that try to simulate all reflection paths (early and late) usually have no more than limited success in simulating both early reflections and late reverberation and applying both to an audio signal. The inventors have also recognized that virtualizers which employ FDNs but do not have the capability to control properly spatial acoustic attributes such as reverb decay time, interaural coherence, and direct-to-late ratio, might achieve a degree of externalization but at the price of introducing excess timbral distortion and reverberation.
In a first class of embodiments, the invention is a method for generating a binaural signal in response to a set of channels (e.g., each of the channels, or each of the full frequency range channels) of a multi-channel audio input signal, including steps of: (a) applying a binaural room impulse response (BRIR) to each channel of the set (e.g., by convolving each channel of the set with a BRIR corresponding to said channel), thereby generating filtered signals, including by using at least one feedback delay network (FDN) to apply a common late reverberation to a downmix (e.g., a monophonic downmix) of the channels of the set; and (b) combining the filtered signals to generate the binaural signal. Typically, a bank of FDNs is used to apply the common late reverberation to the downmix (e.g., with each FDN applying common late reverberation to a different frequency band). Typically, step (a) includes a step of applying to each channel of the set a “direct response and early reflection” portion of a single-channel BRIR for the channel, and the common late reverberation has been generated to emulate collective macro attributes of late reverberation portions of at least some (e.g., all) of the single-channel BRIRs.
A method for generating a binaural signal in response to a multi-channel audio input signal (or in response to a set of channels of such a signal) is sometimes referred to herein as a “headphone virtualization” method, and a system configured to perform such a method is sometimes referred to herein as a “headphone virtualizer” (or “headphone virtualization system” or “binaural virtualizer”).
In typical embodiments in the first class, each of the FDNs is implemented in a filterbank domain (e.g., the hybrid complex quadrature mirror filter (HCQMF) domain or the quadrature mirror filter (QMF) domain, or another transform or subband domain which may include decimation), and in some such embodiments, frequency-dependent spatial acoustic attributes of the binaural signal are controlled by controlling the configuration of each FDN employed to apply late reverberation. Typically, a monophonic downmix of the channels is used as the input to the FDNs for efficient binaural rendering of audio content of the multi-channel signal. Typical embodiments in the first class include a step of adjusting FDN coefficients corresponding to frequency-dependent attributes (e.g., reverb decay time, interaural coherence, modal density, and direct-to-late ratio), for example, by asserting control values to the feedback delay network to set at least one of input gain, reverb tank gains, reverb tank delays, or output matrix parameters for each FDN. This enables better matching of acoustic environments and more natural sounding outputs.
In a second class of embodiments, the invention is a method for generating a binaural signal in response to a multi-channel audio input signal having channels, by applying a binaural room impulse response (BRIR) to each channel of a set of the channels of the input signal (e.g., each of the input signal's channels or each full frequency range channel of the input signal), including by: processing each channel of the set in a first processing path configured to model, and apply to said each channel, a direct response and early reflection portion of a single-channel BRIR for the channel; and processing a downmix (e.g., a monophonic (mono) downmix) of the channels of the set in a second processing path (in parallel with the first processing path) configured to model, and apply a common late reverberation to the downmix Typically, the common late reverberation has been generated to emulate collective macro attributes of late reverberation portions of at least some (e.g., all) of the single-channel BRIRs. Typically, the second processing path includes at least one FDN (e.g., one FDN for each of multiple frequency bands). Typically, a mono downmix is used as the input to all reverb tanks of each FUN implemented by the second processing path. Typically, mechanisms are provided for systematic control of macro attributes of each FUN in order to better simulate acoustic environments and produce more natural sounding binaural virtualization. Since most such macro attributes are frequency dependent, each FDN is typically implemented in the hybrid complex quadrature mirror filter (HCQMF) domain, the frequency domain, domain, or another filterbank domain, and a different or independent FDN is used for each frequency band. A primary benefit of implementing the FDNs in a filterbank domain is to allow application of reverb with frequency-dependent reverberation properties. In various embodiments, the FDNs are implemented in any of a wide variety of filterbank domains, using any of a variety of filterbanks, including, but not limited to real or complex-valued quadrature mirror filters (QMF), finite-impulse response filters (FIR filters), infinite-impulse response filters (IIR filters), discrete Fourier transforms (DFTs), (modified) cosine or sine transforms, Wavelet transforms, or cross-over filters. In a preferred implementation, the employed filterbank or transform includes decimation (e.g., a decrease of the sampling rate of the frequency-domain signal representation) to reduce the computational complexity of the FDN process.
Some embodiments in the first class (and the second class) implement one or more of the following features:
1. a filterbank domain (e.g., hybrid complex quadrature mirror filter-domain) FDN implementation, or hybrid filterbank domain FDN implementation and time domain late reverberation filter implementation, which typically allows independent adjustment of parameters and/or settings of the FDN for each frequency band (which enables simple and flexible control of frequency-dependent acoustic attributes), for example, by providing the ability to vary reverb tank delays in different bands so as to change the modal density as a function of frequency;
2. The specific downmixing process, employed to generate (from the multi-channel input audio signal) the downmixed (e.g., monophonic downmixed) signal processed in the second processing path, depends on the source distance of each channel and the handling of direct response in order to maintain proper level and timing relationship between the direct and late responses;
3. An all-pass filter (APF) is applied in the second processing path (e.g., at the input or output of a bank of FDNs) to introduce phase diversity and increased echo density without changing the spectrum and/or timbre of the resulting reverberation;
4. Fractional delays are implemented in the feedback path of each FDN in a complex-valued, multi-rate structure to overcome issues related to delays quantized to the downsample-factor grid;
5. In the FDNs, the reverb tank outputs are linearly mixed directly into the binaural channels, using output mixing coefficients which are set based on the desired interaural coherence in each frequency band. Optionally, the mapping of reverb tanks to the binaural output channels is alternating across frequency bands to achieve balanced delay between the binaural channels. Also optionally, normalizing factors are applied to the reverb tank outputs to equalize their levels while conserving fractional delay and overall power;
6. Frequency-dependent reverb decay time and/or modal density is controlled by setting proper combinations of reverb tank delays and gains in each frequency band to simulate real rooms;
7. one scaling factor is applied per frequency band (e.g., at either the input or output of the relevant processing path), to:
8. Simple parametric models are implemented for controlling essential frequency-dependent attributes of the late reverberation, such as reverb decay time, interaural coherence, and/or direct-to-late ratio.
Aspects of the invention include methods and systems which perform (or are configured to perform, or support the performance of) binaural virtualization of audio signals (e.g., audio signals whose audio content consists of speaker channels, and/or object-based audio signals).
In another class of embodiments, the invention is a method and system for generating a binaural signal in response to a set of channels of a multi-channel audio input signal, including by applying a binaural room impulse response (BRIR) to each channel of the set, thereby generating filtered signals, including by using a single feedback delay network (FDN) to apply a common late reverberation to a downmix of the channels of the set; and combining the filtered signals to generate the binaural signal. The FDN is implemented in the time domain. In some such embodiments, the time-domain FDN includes:
The input filter may be implemented to generate (preferably as a cascade of two filters configured to generate) the first filtered downmix such that each BRIR has a direct-to-late ratio (DLR) which matches, at least substantially, a target DLR.
Each reverb tank may be configured to generate a delayed signal, and may include a reverb filter (e.g., implemented as a shelf filter or a cascade of shelf filters) coupled and configured to apply a gain to a signal propagating in said each of the reverb tanks, to cause the delayed signal to have a gain which matches, at least substantially, a target decayed gain for said delayed signal, in an effort to achieve a target reverb decay time characteristic (e.g., a T60 characteristic) of each BRIR.
In some embodiments, the first unmixed binaural channel leads the second unmixed binaural channel, the reverb tanks include a first reverb tank configured to generate a first delayed signal having a shortest delay and a second reverb tank configured to generate a second delayed signal having a second-shortest delay, wherein the first reverb tank is configured to apply a first gain to the first delayed signal, the second reverb tank is configured to apply a second gain to the second delayed signal, the second gain is different than the first gain, the second gain is different than the first gain, and application of the first gain and the second gain results in attenuation of the first unmixed binaural channel relative to the second unmixed binaural channel. Typically, the first mixed binaural channel and the second mixed binaural channel are indicative of a re-centered stereo image. In some embodiments, the IACC filtering and mixing stage is configured to generate the first mixed binaural channel and the second mixed binaural channel such that said first mixed binaural channel and said second mixed binaural channel have an IACC characteristic which at least substantially matches a target IACC characteristic.
Typical embodiments of the invention provide a simple and unified framework for supporting both input audio consisting of speaker channels, and object-based input audio. In embodiments in which BRIRs are applied to input signal channels which are object channels, the “direct response and early reflection” processing performed on each object channel assumes a source direction indicated by metadata provided with the audio content of the object channel In embodiments in which BRIRs are applied to input signal channels which are speaker channels, the “direct response and early reflection” processing performed on each speaker channel assumes a source direction which corresponds to the speaker channel (i.e., the direction of a direct path from an assumed position of a corresponding speaker to the assumed listener position). Regardless of whether the input channels are object or speaker channels, the “late reverberation” processing is performed on a downmix (e.g., a monophonic downmix) of the input channels and does not assume any specific source direction for the audio content of the downmix.
Other aspects of the invention are a headphone virtualizer configured (e.g., programmed) to perform any embodiment of the inventive method, a system (e.g., a stereo, multi-channel, or other decoder) including such a virtualizer, and a computer readable medium (e.g., a disc) which stores code for implementing any embodiment of the inventive method.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a virtualizer may be referred to as a virtualizer system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a virtualizer system (or virtualizer).
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., audio, or video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
Throughout this disclosure including in the claims, the expression “analysis filterbank” is used in a broad sense to denote a system (e.g., a subsystem) configured to apply a transform (e.g., a time domain-to-frequency domain transform) on a time-domain signal to generate values (e.g., frequency components) indicative of content of the time-domain signal, in each of a set of frequency bands. Throughout this disclosure including in the claims, the expression “filterbank domain” is used in a broad sense to denote the domain of the frequency components generated by a transform or an analysis filterbank (e.g., the domain in which such frequency components are processed). Examples of filterbank domains include (but are not limited to) the frequency domain, the quadrature mirror filter (QMF) domain, and the hybrid complex quadrature mirror filter (HCQMF) domain. Examples of the transform which may be applied by an analysis filterbank include (but are not limited to) a discrete-cosine transform (DCT), modified discrete cosine transform (MDCT), discrete Fourier transform (DFT), and a wavelet transform. Examples of analysis filterbanks include (but are not limited to) quadrature mirror filters (QMF), finite-impulse response filters (FIR filters), infinite-impulse response filters (IIR filters), cross-over filters, and filters having other suitable multi-rate structures.
Throughout this disclosure including in the claims, the term “metadata” refers to separate and different data from corresponding audio data (audio content of a bitstream which also includes metadata). Metadata is associated with audio data, and indicates at least one feature or characteristic of the audio data (e.g., what type(s) of processing have already been performed, or should be performed, on the audio data, or the trajectory of an object indicated by the audio data). The association of the metadata with the audio data is time-synchronous. Thus, present (most recently received or updated) metadata may indicate that the corresponding audio data contemporaneously has an indicated feature and/or comprises the results of an indicated type of audio data processing.
Throughout this disclosure including in the claims, the term “couples” or “coupled” is used to mean either a direct or indirect connection. Thus, if a first device couples to a second device, that connection may be through a direct connection, or through an indirect connection via other devices and connections.
Throughout this disclosure including in the claims, the following expressions have the following definitions:
The notation that a multi-channel audio signal is an “x.y” or “x.y.z” channel signal herein denotes that the signal has “x” full frequency speaker channels (corresponding to speakers nominally positioned in the horizontal plane of the assumed listener's ears),“y” LFE (or subwoofer) channels, and optionally also “z” full frequency overhead speaker channels (corresponding to speakers positioned above the assumed listener's head, e.g., at or near a room's ceiling).
The expression “IACC” herein denotes interaural cross-correlation coefficient in its usual sense, which is a measure of the difference between audio signal arrival times at a listener's ears, typically indicated by a number in a range from a first value indicating that the arriving signals are equal in magnitude and exactly out of phase, to an intermediate value indicating that the arriving signals have no similarity, to a maximum value indicating identical arriving signals having the same amplitude and phase.
Many embodiments of the present invention are technologically possible. It will be apparent to those of ordinary skill in the art from the present disclosure how to implement them. Embodiments of the inventive system and method will be described with reference to
System 20 may be a decoder which is coupled to receive an encoded audio program, and which includes a subsystem (not shown in
Subsystem 12 (with subsystem 15) is configured to convolve channel X1 with BRIR1 (the BRIR for the corresponding source direction and distance), subsystem 14 (with subsystem 15) is configured to convolve channel XNwith BRIRN (the BRIR for the corresponding source direction), and so on for each of the N−2 other BRIR subsystems. The output of each of subsystems 12, . . . , 14, and 15 is a time-domain signal including a left channel and a right channel Addition elements 16 and 18 are coupled to the outputs of elements 12, . . . , 14, and 15. Addition element 16 is configured to combine (mix) the left channel outputs of the BRIR subsystems, and addition element 18 is configured to combine (mix) the right channel outputs of the BRIR subsystems. The output of element 16 is the left channel, L, of the binaural audio signal output from the virtualizer of
Important features of typical embodiments of the invention are apparent from comparison of the
More specifically, subsystem 12 of
Addition element 16 can be implemented to simply sum corresponding Left binaural channel samples (the Left channel outputs of subsystems 12, . . . , 14, and 15) to generate the Left channel of the binaural output signal, assuming that appropriate level adjustments and time alignments are implemented in the subsystems 12, . . . , 14, and 15. Similarly, addition element 18 can also be implemented to simply sum corresponding Right binaural channel samples (e.g., the Right channel outputs of subsystems 12, . . . , 14, and 15) to generate the Right channel of the binaural output signal, again assuming that appropriate level adjustments and time alignments are implemented in the subsystems 12, . . . , 14, and 15.
Subsystem 15 of
Subsystems 12, . . . , 14 of
In the
Typically, when rendered and reproduced by a pair of headphones, a typical binaural audio signal output from element 210 is perceived at the listener's eardrums as sound from “N” loudspeakers (where N>2 and N is typically equal to 2, 5 or 7) at any of a wide variety of positions, including positions in front of, behind, and above the listener. Reproduction of output signals generated in operation of the
Direct response and early reflection processing subsystem 100 can be implemented in any of a variety of ways (in either the time domain or a filterbank domain), with the preferred implementation for any specific application depending on various considerations, such as (for example) performance, computation, and memory. In one exemplary implementation, subsystem 100 is configured to convolve each channel asserted thereto with a FIR filter corresponding to the direct and early responses associated with the channel, with gain and delay properly set so that the outputs of subsystems 100 may be simply and efficiently combined (in element 210) with those of subsystem 200.
As shown in
In principle, each input channel (to subsystem 100 and subsystem 201 of
With reference to the
In a typical implementation each of the FDNs 203, 204, . . . , and 205, is implemented in the QMF domain, and filterbank 202 transforms the mono downmix from subsystem 201 into the QMF domain (e.g., the hybrid complex quadrature mirror filter (HCQMF) domain), so that the signal asserted from filterbank 202 to an input of each of FDNs 203, 204, . . . , and 205 is a sequence of QMF domain frequency components. In such an implementation, the signal asserted from filterbank 202 to FUN 203 is a sequence of QMF domain frequency components in a first frequency band, the signal asserted from filterbank 202 to FUN 204 is a sequence of QMF domain frequency components in a second frequency band, and the signal asserted from filterbank 202 to FDN 205 is a sequence of QMF domain frequency components in a “K”th frequency band. When analysis filterbank 202 is so implemented, synthesis filterbank 207 is configured to apply a QMF domain-to-time domain transform to the 2K sequences of output QMF domain frequency components from the FDNs, to generate the left channel and right channel late-reverbed time-domain signals which are output to element 210.
For example, if K=3 in the
Optionally, control subsystem 209 is coupled to each of the FDNs 203, 204, . . . , 205, and configured to assert control parameters to each of the FDNs to determine the late reverberation portion (LBRIR) which is applied by subsystem 200. Examples of such control parameters are described below. It is contemplated that in some implementations control subsystem 209 is operable in real time (e.g., in response to user commands asserted thereto by an input device) to implement real time variation of the late reverberation portion (LBRIR) applied by subsystem 200 to the monophonic downmix of input channels.
For example, if the input signal to the
D=[1 1 1 1 1]
After all-pass filtering (in element 301 in each of FDNs 203, 204, . . . , and 205), the mono downmix is up-mixed to the four reverb tanks in a power-conservative way:
Alternatively (as an example), we can choose to pan the left-side channels to the first two reverb tanks, the right-side channels to the last two reverb tanks, and the center channel to all reverb tanks. In this case, downmixing subsystem 201 would be implemented to form two downmix signals:
In this example, the upmixing to the reverb tanks (in each of FDNs 203, 204, . . . , and 205) is:
Because there are two downmix signals, the all-pass filtering (in element 301 in each of FDNs 203, 204, . . . , and 205) needs to be applied twice. Diversity would be introduced for the late responses of (L, Ls), (R, Rs) and C despite all of them having the same macro attributes. When the input signal channels have different source distances, proper delays and gains would still need to be applied in the downmixing process.
We next describe considerations for specific implementations of downmixing subsystem 201, and subsystems 100 and 200 of the
td=d/vs
where d is the distance between the sound source and the listener and vs is the speed of sound. Furthermore, the gain of the direct response is proportional to 1/d. If these rules are preserved in the handling of direct responses of channels with different source distances, subsystem 201 can implement a straight downmixing of all channels because the delay and level of the late reverberation is generally insensitive to the source location.
Due to practical considerations, virtualizers (e.g., subsystem 100 of the virtualizer of
Virtualizers (e.g., subsystem 100 of the virtualizer of
The feedback delay network of
The FDN of
Element 302 is configured to add the output of matrix 308 which corresponds to delay line z−n1 (i.e., to apply feedback from the output of delay line z−n1 via matrix 308) to the input of the first reverb tank. Element 303 is configured to add the output of matrix 308 which corresponds to delay line z−n2 (i.e., to apply feedback from the output of delay line z−n2 via matrix 308) to the input of the second reverb tank. Element 304 is configured to add the output of matrix 308 which corresponds to delay line z−n3 (i.e., to apply feedback from the output of delay line z−n3 via matrix 308) to the input of the third reverb tank. Element 305 is configured to add the output of matrix 308 which corresponds to delay line z−n4 (i.e., to apply feedback from the output of delay line z−n4 via matrix 308) to the input of the fourth reverb tank.
Input gain element 300 of the FDN of
FDNs of the
If we assume the direct response (applied by subsystem 100 of
Gin=sqrt(ln(106)/(T60*DLR)),
where T60 is the reverb decay time defined as the time it takes for the reverberation to decay by 60 dB (it is determined by the reverb delays and reverb gains discussed below), and “ln” denotes the natural logarithmic function.
The input gain factor, Gin, may be dependent on the content that is being processed. One application of such content dependency is to ensure that the energy of the downmix in each time/frequency segment is equal to the sum of the energies of the individual channel signals that are being downmixed, irrespective of any correlation that may exist between the input channel signals. In that case, the input gain factor can be (or can be multiplied by) a term similar or equal to:
in which i is an index over all downmix samples of a given time/frequency tile or subband, y(i) are the downmix samples for the tile, and xi(j) is the input signal (for channel Xi) asserted to the input of downmixing subsystem 201.
In a typical QMF-domain implementation of the FDN of
In implementing the reverb tank delays, z−ni, the reverb delays ni should be mutually prime numbers to avoid the reverb modes aligning at the same frequency. The sum of the delays should be large enough to provide sufficient modal density in order to avoid artificial sounding output. But the shortest delays should be short enough to avoid excess time gap between the late reverberation and the other components of the BRIR.
Typically, the reverb tank outputs are initially panned to either the left or the right binaural channel Normally, the sets of reverb tank outputs being panned to the two binaural channels are equal in number and mutually exclusive. It is also desired to balance the timing of the two binaural channels. So if the reverb tank output with the shortest delay goes to one binaural channel, the one with the second shortest delay would go the other channel.
The reverb tank delays can be different across frequency bands so as to change the modal density as a function of frequency. Generally, lower frequency bands require higher modal density, thus the longer reverb tank delays.
The amplitudes of the reverb tank gains, gi, and the reverb tank delays jointly determine the reverb decay time of the FDN of
T60=−3ni/log10(|gi|)/FFRM
where FFRM is the frame rate of filterbank 202 (of
The unitary feedback matrix 308 provides even mixing among the reverb tanks in the feedback path.
To equalize the levels of the reverb tank outputs, gain elements 309 apply a normalization gain, 1/|gi| to the output of each reverb tank, to remove the level impact of the reverb tank gains while preserving fractional delays introduced by their phases.
Output mixing matrix 312 (also identified as matrix Mout) is a 2×2 matrix configured to mix the unmixed binaural channels (the outputs of elements 310 and 311, respectively) from initial panning to achieve output left and right binaural channels (the L and R signals asserted at the output of matrix 312) having desired interaural coherence. The ummixed binaural channels are close to being uncorrelated after the initial panning because they do not consist of any common reverb tank output. If the desired interaural coherence is Coh, where |Coh|≤1, output mixing matrix 312 may be defined as:
Because the reverb tank delays are different, one of the unmixed binaural channels would lead the other constantly. If the combination of reverb tank delays and panning pattern is identical across frequency bands, sound image bias would result. This bias can be mitigated if the panning pattern is alternated across the frequency bands such that the mixed binaural channels lead and trail each other in alternating frequency bands. This can be achieved by implementing the output mixing matrix 312 so as to have form as set forth in the previous paragraph in odd-numbered frequency bands (i.e., in the first frequency band (processed by FDN 203 of
where the definition of β remains the same. It should be noted that matrix 312 can be implemented to be identical in the FDNs for all frequency bands, but the channel order of its inputs may be switched for alternating ones of the frequency bands (e.g., the output of element 310 may be asserted to the first input of matrix 312 and the output of element 311 may be asserted to the second input of matrix 312 in odd frequency bands, and the output of element 311 may be asserted to the first input of matrix 312 and the output of element 310 may be asserted to the second input of matrix 312 in even frequency bands.
In the case that frequency bands are (partially) overlapping, the width of the frequency range over which matrix 312's form is alternated can be increased (e.g., it could alternated once for every two or three consecutive bands), or the value of β in the above expressions (for the form of matrix 312) can be adjusted to ensure that the average coherence equals the desired value to compensate for spectral overlap of consecutive frequency bands.
If the above-defined target acoustic attributes T60, Coh, and DLR are known for the FDN for each specific frequency band in the inventive virtualizer, each of the FDNs (each of which may have the structure shown in
We next describe an example of how a target reverb decay time (T60) for the FDN for each specific frequency band of an embodiment of the inventive virtualizer can be determined, by determining the target reverb decay time (T60) for each of a small number of frequency bands. The level of FDN response decays exponentially over time. T60 is inversely proportional to the decay factor, df (defined as dB decay over a unit of time):
T60=60 /df.
The decay factor, df, depends on frequency and generally increases linearly versus the log-frequency scale, so the reverb decay time is also a function of frequency which generally decreases as frequency increases. Therefore, if one determines (e.g., sets) the T60 values for two frequency points, the T60 curve for all frequencies is determined. For example, if the reverb decay times for frequency points fA and fB are T60,A and T60,B, respectively, the T60 curve is defined as:
We next describe an example of how a target Interaural coherence (Coh) for the FDN for each specific frequency band of an embodiment of the inventive virtualizer can be achieved by setting a small number of control parameters. The Interaural coherence (Coh) of the late reverberation largely follows the pattern of a diffuse sound field. It can be modeled by a sinc function up to a cross-over frequency fC, and a constant above the cross-over frequency. A simple model for the Coh curve is:
where the parameters Cohmin and Cohmax satisfy −1≤Cohmin<Cohmax≤1, and control the range of Coh. The optimal cross-over frequency fC depends on the head size of the listener. A too high fC leads to internalized sound source image, while a too small value leads to dispersed or split sound source image.
We next describe an example of how a target direct-to-late ratio (DLR) for the FDN for each specific frequency band of an embodiment of the inventive virtualizer can be achieved by setting a small number of control parameters. The Direct-to-late ratio (DLR), in dB, generally increases linearly versus the log-frequency scale. It can be controlled by setting DLR1K (DLR in dB@1 kHz) and DLRslope (in dB per 10× frequency). However, low DLR in the lower frequency range often results in excessive combing artifact. In order to mitigate the artifact, two modifying mechanisms are added to the control the DLR:
The resulting DLR curve in dB is defined as:
It should be noted that DLR changes with source distance even in the same acoustic environment. Therefore, both DLR1K and DLRmin here are the values for a nominal source distance, such as 1 meter.
Variations on the embodiments disclosed herein have one or more of the following features:
For applications in which system latency is critical and the delay caused by analysis and synthesis filterbanks is prohibitive, the filterbank-domain FDN structure of typical embodiments of the inventive virtualizer can be translated into the time domain, and each FDN structure can be implemented in the time domain in a class of embodiments of the virtualizer. In time domain implementations, the subsystems which apply the input gain factor (Gin), reverb tank gains (gi), and normalization gains (l/|gi|) are replaced by filters with similar amplitude responses in order to allow frequency-dependent controls. The output mixing matrix (Mout) is also replaced by a matrix of filters. Unlike for the other filters, the phase response of this matrix of filters is critical as power conservation and interaural coherence might be affected by the phase response. The reverb tank delays in a time domain implementation may need to be slightly varied (from their values in a filterbank domain implementation) to avoid sharing the filterbank stride as a common factor. Due to various constraints, the performance of time-domain implementations of the FDNs of the inventive virtualizer might not exactly match that of filterbank-domain implementations thereof.
With reference to
The
Whenever the setting of the late reverberation portion LBRIR is to be modified, impulse generator 211 is operated to assert a unit impulse to element 202, and the resulting output from filterbank 207 is captured and asserted to filter 208 (to set the filter 208 to apply the new LBRIR determined by the output of filterbank 207). To accelerate the time lapse from the LBRIR setting change to the time that the new LBRIR takes effect, the samples of the new LBRIR can start replacing the old LBRIR as they becomes available. To shorten the inherent latency of the FDNs, initial zeros of the LBRIR can be discarded. These options provide flexibility and allow the hybrid implementation to provide potential performance improvement (relative to that provided by a filterbank domain implementation), at a cost of added computation from the FIR filtering.
For applications where system latency is critical, but computation power is less of a concern, the side-chain filterbank-domain late reverberation processor (e.g., that implemented by elements 211, 202, 203, 204, . . . , 205, and 207 of
The various FDN parameters and thus the resulting late-reverberation attributes can be manually tuned and subsequently hard-wired into an embodiment of the inventive late reverberation processing subsystem, for example by means of one or more presets that can be adjusted (e.g., by operating control subsystem 209 of
1. The end-user may manually control the FDN parameters, for example by means of a user-interface on a display (e.g., implemented by an embodiment of control subsystem 209 of
2. The author of the audio content to be virtualized may provide settings or desired parameters that are conveyed with the content itself, for example by metadata provided with the input audio signal. Such metadata may be parsed and employed (e.g., by an embodiment of control subsystem 209 of
3. A playback device may be aware of its location or environment, by means of one or more sensors. For example, a mobile device may use GSM networks, global positioning system (GPS), known WiFi access points, or any other location service to determine where the device is. Subsequently, data indicative of location and/or environment may be employed (e.g., by an embodiment of control subsystem 209 of
4. In relation to the location of the playback device, a cloud service or social media may be used to derive the most common settings consumers are using in a certain environment. Additionally, users may upload their current settings to a cloud or social media service, in association with the (known) location to make available for other users, or themselves;
5. A playback device may contain other sensors such as a camera, light sensor, microphone, accelerometer, gyroscope, to determine the activity of the user and the environment the user is in, to optimize FDN parameters for that particular activity and/or environment;
6. The FDN parameters may be controlled by the audio content. Audio classification algorithms, or manually-annotated content may indicate whether segments of the audio comprise speech, music, sound effects, silence, and alike. FDN parameters may be adjusted according to such labels. For example, the direct-to-reverberation ratio may be reduced for dialog to improve the dialog intelligibility. Additionally, video analysis may be used to determine the location of a current video segment, and FDN parameters may be adjusted accordingly to more closely simulate the environment depicted in the video; and/or
7. A solid-state playback system may use different FDN settings as a mobile device, e.g., settings may be device dependent. A solid-state system present in a living room may simulate a typical (fairly reverberant) living room scenario with distant sources, while a mobile device may render content closer to the listener.
Some implementations of the inventive virtualizer include FDNs (e.g., an implementation of the FDN of
In a first class of embodiments, the invention is a headphone virtualization method for generating a binaural signal in response to a set of channels (e.g., each of the channels, or each of the full frequency range channels) of a multi-channel audio input signal, including steps of: (a) applying a binaural room impulse response (BRIR) to each channel of the set (e.g., by convolving each channel of the set with a BRIR corresponding to said channel, in subsystems 100 and 200 of
In typical embodiments in the first class, each of the FDNs is implemented in the hybrid complex quadrature minor filter (HCQMF) domain or the quadrature minor filter (QMF) domain, and in some such embodiments, frequency-dependent spatial acoustic attributes of the binaural signal are controlled (e.g., using control subsystem 209 of
Typical embodiments in this class include a step of adjusting (e.g., using control subsystem 209 of
In a second class of embodiments, the invention is a method for generating a binaural signal in response to a multi-channel audio input signal, by applying a binaural room impulse response (BRIR) to each channel (e.g., by convolving each channel with a corresponding BRIR) of a set of the channels of the input signal (e.g., each of the input signal's channels or each full frequency range channel of the input signal), including by: processing each channel of the set in a first processing path (e.g., implemented by subsystem 100 of
Some embodiments in the first class (and the second class) implement one or more of the following features:
1. a filterbank domain (e.g., hybrid complex quadrature minor filter-domain) FDN implementation (e.g., the FDN implementation of
2. The specific downmixing process, employed to generate (from the multi-channel input audio signal) the downmixed (e.g., monophonic downmixed) signal processed in the second processing path, depends on the source distance of each channel and the handling of direct response in order to maintain proper level and timing relationship between the direct and late responses;
3. An all-pass filter (e.g., APF 301 of
4. Fractional delays are implemented in the feedback path of each FDN in a complex-valued, multi-rate structure to overcome issues related to delays quantized to the downsample-factor grid;
5. In the FDNs, the reverb tank outputs are linearly mixed directly into the binaural channels (e.g., by matrix 312 of
6. Frequency-dependent reverb decay time is controlled (e.g., using control subsystem 209 of
7. one scaling factor is applied (e.g., by elements 306 and 309 of
8. Simple parametric models are implemented (e.g., by control subsystem 209 of
In some embodiments (e.g., for applications in which system latency is critical and the delay caused by analysis and synthesis filterbanks is prohibitive), the filterbank-domain FDN structures of typical embodiments of the inventive system (e.g., the FDN of
In the
Downmixing subsystem 201 (of late reverberation processing subsystem 221) is configured to downmix the channels of the multi-channel input signal into a mono downmix (which is time domain signal), and FDN 220 is configured to apply the late reverberation portion to the mono downmix
With reference to
Unitary matrix 415 (corresponding to unitary matrix 308 of
When the delay (n1) applied by line 410 is shorter than that (n2) applied by line 411, the delay applied by line 411 is shorter than that (n3) applied by line 412, and the delay applied by line 412 is shorter than that (n4) applied by line 413, the outputs of gain elements 417 and 419 (of the first and third reverb tanks) are asserted to inputs of addition element 422, and the outputs of gain elements 418 and 420 (of the second and fourth reverb tanks) are asserted to inputs of addition element 423. The output of element 422 is asserted to one input of IACC and mixing filter 424, and the output of element 423 is asserted to the other input of IACC filtering and mixing stage 424.
Examples of implementations of gain elements 417-420 and elements 422, 423, and 424 of
The unmixed binaural channels (output from elements 310 and 311 of
Thus, in the
and the output mixing matrix 312 in even-numbered frequency bands may be implemented to multiply the two inputs asserted thereto by a matrix having the following form:
Alternatively, the above-noted sound image bias in the binaural output channels can be mitigated by implementing matrix 312 to be identical in the FDNs for all frequency bands, if the channel order of its inputs is switched for alternating ones of the frequency bands (e.g., the output of element 310 may be asserted to the first input of matrix 312 and the output of element 311 may be asserted to the second input of matrix 312 in odd frequency bands, and the output of element 311 may be asserted to the first input of matrix 312 and the output of element 310 may be asserted to the second input of matrix 312 in even frequency bands).
In the
More specifically, in a typical implementation of the FDN of
In this implementation, choice of the following gain values may result in an undesirable bias of the output sound image (indicated by the binaural channels output from element 424) to one side (i.e., to the left or right channel): g1=0.5, g2=0.5, g3=0.5, and g4=0.5. In accordance with an embodiment of the invention, the gain values g1, g2, g3, and g4 (applied by elements 417, 418, 419, and 420, respectively) are chosen as follows to center the sound-image: g1=0.38, g2=0.6, g3=0.5, and g4=0.5. Thus, the output stereo image is re-centered in accordance with an embodiment of the invention by attenuating the earliest-arriving signal (which has been panned to one side, by element 422 in the example) relative to the second-latest arriving signal (i.e., by choosing g1<g3), and boosting the second-earliest signal (which has been panned to the other side, by element 423 in the example), relative to the latest arriving signal (i.e., by choosing g4<g2).
Typical implementations of the time-domain FDN of
where g=0.6. All-pass filter 301 of
In some implementations of the time-domain FDN of
In
Each of filters 406, 407, 408, and 409, and each of elements 406A, 407A, 408A, and 409A of the
In some embodiments, the decay gains (decaygaini) applied by elements 406A, 407A, 408A, and 409A are determined as follows:
decaygaini=10((−60*(ni;/Fs)/T)/20),
where i is the reverb tank index (i.e., element 406A applies decaygain1, element 407A applies decaygain2, and so on), ni is the delay of the ith reverb tank (e.g., n1 is the delay applied by delay line 410), Fs is the sampling rate, T is the desired reverb decay time (T60) at a predetermined low frequency.
Thus, in a class of embodiments, the invention is a system (e.g., that of
The input filter may be implemented to generate (preferably as a cascade of two filters configured to generate) the first filtered downmix such that each BRIR has a direct-to-late ratio (DLR) which matches, at least substantially, a target DLR.
Each reverb tank may be configured to generate a delayed signal, and may include a reverb filter (e.g., implemented as a shelf filter or a cascade of shelf filters) coupled and configured to apply a gain to a signal propagating in said each of the reverb tanks, to cause the delayed signal to have a gain which matches, at least substantially, a target decayed gain for said delayed signal, in an effort to achieve a target reverb decay time characteristic (e.g., a T60 characteristic) of each BRIR.
In some embodiments, the first unmixed binaural channel leads the second unmixed binaural channel, the reverb tanks include a first reverb tank (e.g., the reverb tank of
Aspects of the invention include methods and systems (e.g., system 20 of
In some embodiments, the inventive virtualizer is or includes a general purpose processor coupled to receive or to generate input data indicative of a multi-channel audio input signal, and programmed with software (or firmware) and/or otherwise configured (e.g., in response to control data) to perform any of a variety of operations on the input data, including an embodiment of the inventive method. Such a general purpose processor would typically be coupled to an input device (e.g., a mouse and/or a keyboard), a memory, and a display device. For example, the
While specific embodiments of the present invention and applications of the invention have been described herein, it will be apparent to those of ordinary skill in the art that many variations on the embodiments and applications described herein are possible without departing from the scope of the invention described and claimed herein. It should be understood that while certain forms of the invention have been shown and described, the invention is not to be limited to the specific embodiments described and shown or the specific methods described.
Number | Date | Country | Kind |
---|---|---|---|
201410178258.0 | Apr 2014 | CN | national |
This application is a continuation of U.S. patent application Ser. No. 16/777,599 filed Jan. 30, 2020, which is a continuation of U.S. patent application Ser. No. 16/541,079 filed Aug. 14, 2019, now U.S. Pat. No. 10,555,109, which is a continuation of U.S. patent application Ser. No. 15/109,541 filed Jul. 1, 2016, now U.S. Pat. No. 10,425,763, which is a U.S. national phase of PCT International Application No. PCT/US2014/071100 filed Dec. 18, 2014, which claims the benefit of priority to Chinese Patent Application No. 201410178258.0 filed 29 Apr. 2014; U.S. Provisional Patent Application No. 61/923,579 filed 3 Jan. 2014; and U.S. Provisional Patent Application No. 61/988,617 filed 5 May 2014, each of which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5371799 | Lowe | Dec 1994 | A |
7903824 | Christof | Mar 2011 | B2 |
8265284 | Falck | Sep 2012 | B2 |
8515104 | Dickins | Aug 2013 | B2 |
20050053249 | Yuan | Mar 2005 | A1 |
20050063551 | Cheng | Mar 2005 | A1 |
20080008342 | Sauk | Jan 2008 | A1 |
20090103738 | Julien | Apr 2009 | A1 |
20110135098 | Kuhr | Jun 2011 | A1 |
20110170721 | Dickins | Jul 2011 | A1 |
20110211702 | Harald | Sep 2011 | A1 |
20110261966 | Jonas | Oct 2011 | A1 |
20110317522 | Florencio | Dec 2011 | A1 |
20120082319 | Jean-Marc | Apr 2012 | A1 |
20120213375 | Mahabub | Aug 2012 | A1 |
20120263311 | Bernhard | Oct 2012 | A1 |
20130202125 | De Sena | Aug 2013 | A1 |
20130216059 | Jae | Aug 2013 | A1 |
20130272527 | Werner | Oct 2013 | A1 |
20140270216 | Tsilfidis | Sep 2014 | A1 |
Number | Date | Country |
---|---|---|
1655651 | Aug 2005 | CN |
101366081 | Feb 2009 | CN |
101661746 | Mar 2010 | CN |
101843114 | Sep 2010 | CN |
101933344 | Dec 2010 | CN |
101960866 | Jan 2011 | CN |
102187690 | Sep 2011 | CN |
102187691 | Sep 2011 | CN |
102667918 | Sep 2012 | CN |
103355001 | Oct 2013 | CN |
1072089 | Jan 2001 | EP |
1794744 | Jun 2007 | EP |
WO-2012093352 | Jul 2012 | EP |
2007336080 | Dec 2007 | JP |
2009531906 | Sep 2009 | JP |
2009543479 | Dec 2009 | JP |
2012513138 | Jun 2012 | JP |
1020070094723 | Sep 2007 | KR |
2011105972 | Aug 2012 | RU |
9914983 | Mar 1999 | WO |
2008034221 | Mar 2008 | WO |
2010054360 | May 2010 | WO |
2012093352 | Jul 2012 | WO |
2013111038 | Aug 2013 | WO |
2014111829 | Jul 2014 | WO |
Entry |
---|
Breebaart, J. et al “MPEG Surround Binaural Coding Proposal Philips/VAST Audio” MPEG Meeting ISO/IEC JTC1/SC29/WG11, Mar. 29, 2006. |
Choi, Daniel Dhaham “Auditory Virtual Environment with Dynamic Room Characteristics for Music Performances” Rensselaer Polytechnic Institute, Dissertations Publishing, 2013. |
Faller, Christof “Parametric Multichannel Audio Coding Synthesis of Coherence Cues” IEEE Transactions on Audio, Speech and Language Processing. |
Frenette, Jasmin “Reducing Artificial Reverberation Algorithm Requirements Using Time-Varian Feedback Delay Networks” University of Miami Thesis. |
Hacihabiboglu, H. et al. “Perception-Based Simplification for Binaural Room Auralisation”, Proc. of the 12th International Conference on Auditory Display, London, UK, Jun. 20-23, 2006. |
Jakka, Julia “Binaural to Multichannel Audio Upmix” Department of Electrical and Communications Engineering Laboratory of Acoustics and Audio Signal Processing, Jun. 2005. |
Jot, Jean-Marc “Efficient Models for Reverberation and Distance Rendering in Computer Music and Virtual Audio Reality” Jun. 2005, Proc. Int. Computer Music Conf. pp. 236-243. |
Jot, Jean-Marc et al. “Digital Delay Networks for Designing Artificial Reverberators” Proc. of the 90th AES Convention, Feb. 19, 1991. |
Jot, Jean-Marc et al. “Digital Signal Processing Issues in the Context of Binaural and Transaural Stereophony” Feb. 1995, presented at the 98th Convention, Audio Engineering Society, pp. 1-54. |
Menzer, F. et al. “Binaural Reverberation Using a Modified Jot Reverberator with Frequency-Dependent Interaural Coherence Matching” AES Convention, May 2009. |
Menzer, Fritz “Binaural Audio Signal Processing Using Interaural Coherance Matching” Ecole Polytechnique Federal de Lausanne Thesis No. 4643, 2010. |
Menzer, Fritz “Binaural Reverberation Using Two-Parallel Feedback Delay Networks” AES 40th International Conference, Tokyo, Japan, Oct. 8-10, 2010, pp. 1-10. |
Pallone, G. et al. “Technical Description of the Orange Proposal for MPEG-H 3D Audio” MPEG Meeting ISO/IEC JTC1/SC29/WG11, Jul. 24, 2013. |
Number | Date | Country | |
---|---|---|---|
20210051435 A1 | Feb 2021 | US |
Number | Date | Country | |
---|---|---|---|
61988617 | May 2014 | US | |
61923579 | Jan 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16777599 | Jan 2020 | US |
Child | 17012076 | US | |
Parent | 16541079 | Aug 2019 | US |
Child | 16777599 | US | |
Parent | 15109541 | US | |
Child | 16541079 | US |