The present invention relates to audio coding. More in particular, the present invention relates to a device for and a method of converting an audio input signal into a binaural output signal, wherein the input signal comprises at least one audio channel and parameters representing additional channels.
It is well known to record and reproduce binaural audio signals, that is, audio signals which contain specific directional information to which the human ear is sensitive. Binaural recordings are typically made using two microphones mounted in a dummy human head, so that the recorded sound corresponds to the sound captured by the human ear and includes any influences due to the shape of the head and the ears. Binaural recordings differ from stereo (that is, stereophonic) recordings in that the reproduction of a binaural recording requires a headset, whereas a stereo recording is made for reproduction by loudspeakers. While a binaural recording allows a reproduction of all spatial information using only two channels, a stereo recording would not provide the same spatial perception.
Regular dual channel (stereophonic) or multiple channel (e.g. 5.1) recordings may be transformed into binaural recordings by convolving each regular signal with a set of perceptual transfer functions. Such perceptual transfer functions model the influence of the human head, and possibly other objects, on the signal. A well-known type of perceptual transfer function is the so-called Head-Related Transfer Function (HRTF). An alternative type of perceptual transfer function, which also takes into account reflections caused by the walls, ceiling and floor of a room, is the Binaural Room Impulse Response (BRIR).
In the case of multiple channel signals, transforming the signals into binaural recording signals with a set of perceptual functions typically implies a convolution of perceptual functions with the signals of all channels. As a typical convolution is computationally demanding, the signals and the HRTF are typically transformed to the frequency (Fourier) domain where the convolution is replaced with a computationally far less demanding multiplication.
It is further well known to reduce the number of audio channels to be transmitted or stored by representing the original number of channels by a smaller number of channels and parameters indicative of the relationships between the original channels. A set of stereo signals may thus be represented by a single (mono) channel plus a number of associated spatial parameters, while a set of 5.1 signals may be represented by two channels and a set of associated spatial parameters, or even by a single channel plus the associated spatial parameters. This “downmixing” of multiple audio channels in spatial encoders, and the corresponding “upmixing” of audio signals in spatial decoders, is typically carried out in a transform domain or sub-band domain, for example the QMF (Quadrature Mirror Filter) domain.
When downmixed input channels are to be converted into binaural output channels, the Prior Art approach is to first upmix the input channels using a spatial decoder to produce upmixed intermediary channels, and then convert these upmixed intermediary channels into binaural channels. This procedure typically produces five or six intermediary channels, which then have to be reduced to two binaural channels. First expanding and then reducing the number of channels is clearly not efficient and increases the computational complexity. In addition, reducing the five or six intermediary channels meant for multiple channel loudspeaker reproduction to only two channels meant for binaural reproduction inevitably introduces artifacts and therefore decreases the sound quality.
The QMF domain referred to above is similar, but not identical, to the frequency (Fourier transform) domain. If a spatial decoder is to produce binaural output signals, the downmixed audio signals would first have to be transformed to the QMF domain for upmixing, then be inversely QMF transformed to produce time domain intermediary signals, subsequently be transformed to the frequency domain for multiplication with the (Fourier transformed) HRTF, and finally be inversely transformed to produce time domain output signals. It will be clear that this procedure is not efficient, as several transforms must be performed in succession.
The number of computations involved in this Prior Art approach would make it very difficult to design a hand-held consumer device, such as a portable MP3 player, capable of producing binaural output signals from downmixed audio signals. Even if such a device could be implemented, its battery life would be very short due to the required computational load.
It is an object of the present invention to overcome these and other problems of the Prior Art and to provide a spatial decoder unit capable of producing a pair of binaural output channels from a set of downmixed audio channels represented by one or more audio input channels and an associated set of spatial parameters, which decoder has an increased efficiency.
Accordingly, the present invention provides a spatial decoder unit for producing a pair of binaural output channels using spatial parameters and one or more audio input channels, the device comprising a parameter conversion unit for converting the spatial parameters into binaural parameters using parameterized perceptual transfer functions, and a spatial synthesis unit for synthesizing a pair of binaural channels using the binaural parameters and the audio channels.
By converting the spatial parameters into binaural parameters, the spatial synthesis unit can directly synthesize a pair of binaural channels, without requiring an additional binaural synthesis unit. As no superfluous intermediary signals are produced, the computational requirements are reduced while the introduction of artifacts is substantially eliminated.
In the spatial decoder unit of the present invention, the synthesis of the binaural channels can be carried out in the transform domain, for example the QMF domain, without requiring the additional steps of transformation to the frequency domain and the subsequent inverse transformation to the time domain. As two transform steps can be omitted, both the number of computations and the memory requirements are significantly reduced. The spatial decoder unit of the present invention can therefore relatively easily be implemented in a portable consumer device.
Furthermore, in the spatial decoder unit of the present invention, binaural channels are produced directly from downmixed channels, each binaural channel comprising binaural signals for binaural reproduction using a headset or a similar device. The parameter conversion unit derives the binaural parameters used for producing the binaural channels from spatial (that is, upmix) parameters. This derivation of the binaural parameters involves parameterized perceptual transfer functions, such as HRTFs (Head-Related Transfer Functions) and/or Binaural Room Impulse Responses (BRIRs). According to the present invention, therefore, the processing of the perceptual transfer functions is performed in the parameter domain, while in the Prior Art this processing was carried out in the time domain or the frequency domain. This may result in a further reduction of the computational complexity as the resolution in the parameter domain is typically lower than the resolution in the time domain or the frequency domain.
It is preferred that the parameter conversion unit is arranged for combining in the parameter domain, in order to determine the binaural parameters, all perceptual transfer function contributions the input (downmix) audio channels would make to the binaural channels. In other words, the spatial parameters and the parameterized perceptual transfer functions are combined in such a manner that the combined parameters result in a binaural output signal having similar statistical properties to those obtained in the Prior Art method involving upmixed intermediary signals.
In a preferred embodiment, the spatial decoder unit of the present invention further comprises one or more transform units for transforming the audio input channels into transformed audio input channels, and a pair of inverse transform units for inversely transforming the synthesized binaural channels into the pair of binaural output channels, wherein the spatial synthesis unit is arranged for operating in a transform domain or sub-band domain, preferably the QMF domain.
The spatial decoder unit of the present invention may comprise two transform units, the parameter conversion unit being arranged for utilizing perceptual transfer function parameters involving three channels only, two of these three channels incorporating the contributions of composite front and rear channels. In such an embodiment, the parameter conversion unit may be arranged for processing channel level (e.g. CLD), channel coherence (e.g. ICC), channel prediction (e.g. CPC) and/or phase (e.g. IPD) parameters.
In an alternative embodiment, the spatial decoder unit of the present invention may comprise only a single transform unit, and may further comprise a decorrelation unit for decorrelating the transformed single channel output by the single transform unit. In such an embodiment, the parameter conversion unit may be arranged for processing channel level (e.g. CLD), channel coherence (e.g. ICC), and/or phase (e.g. IPD) parameters.
The spatial decoder unit of the present invention may additionally comprise a stereo reverberation unit. Such a stereo reverberation unit may be arranged for operating in the time domain or in a transform domain or sub-band (e.g. QMF) domain.
The present invention also provides a spatial decoder device for producing a pair of binaural output channels from an input bitstream, the device comprising a demultiplexer unit for demultiplexing the input bitstream into at least one downmix channel and signal parameters, a downmix decoder unit for decoding the at least one downmix channel, and a spatial decoder unit for producing a pair of binaural output channels using the spatial parameters and the at least one downmix channel, wherein the spatial decoder unit comprises a parameter conversion unit for converting the spatial parameters into binaural parameters using parameterized perceptual transfer functions, and a spatial synthesis unit for synthesizing a pair of binaural channels using the binaural parameters and the at least one downmix channel.
In addition, the present invention provides a consumer device and an audio system comprising a spatial decoder unit and/or spatial decoder device as defined above. The present invention further provides a method of producing a pair of binaural output channels using spatial parameters and one or more audio input channels, the method comprising the steps of converting the spatial parameters into binaural parameters using parameterized perceptual transfer functions, and synthesizing a pair of binaural channels using the binaural parameters and the audio channels. Further aspects of the method according to the present invention will become apparent from the description below.
The present invention additionally provides a computer program product for carrying out the method as defined above. A computer program product may comprise a set of computer executable instructions stored on a data carrier, such as a CD or a DVD. The set of computer executable instructions, which allow a programmable computer to carry out the method as defined above, may also be available for downloading from a remote server, for example via the Internet.
The present invention will further be explained below with reference to exemplary embodiments illustrated in the accompanying drawings, in which:
The application of perceptual transfer functions, such as Head-Related Transfer Functions (HRTFs), in accordance with the Prior Art is schematically illustrated in
Those skilled in the art will know that HRTFs may be determined by making both regular (stereo) recordings and binaural recordings, and deriving a transfer function which represents the shaping of the binaural recording relative to the regular recording. Binaural recordings are made using two microphones mounted in a dummy human head, so that the recorded sound corresponds to the sound captured by the human ear and includes any influences due to the shape of the head and the ears, and even the presence of hair and shoulders.
If the HRTF processing takes place in the time domain, the HRTFs are convolved with the (time domain) audio signals of the channels. Typically, however, the HRTFs are transformed to the frequency domain, and the resulting transfer functions and the frequency spectra of the audio signals are then multiplied (Fourier transform units and inverse Fourier transform units are not shown in
After HRTF processing by the appropriate HRTF unit 31, the resulting left and right signals are added by a respective adder 32 to yield the (time domain) left binaural signal lb and the right binaural signal rb.
The exemplary Prior Art binaural synthesis device 3 of
The spatial encoder device 1 comprises a spatial encoding (SE) unit 11, a downmix encoding (DE) unit 12 and a multiplexer (Mux) 13. The spatial encoding unit 11 receives five audio input channels lf (left front), lr (left rear), rf (right front), rr (right rear) and c (center). The spatial encoding unit 11 downmixes the five input channels to produce two channels l (left) and r (right), as well as signal parameters sp (it is noted that the spatial encoding unit 11 may produce a single channel instead of the two channels l and r). In the embodiment shown, where five channels are downmixed to two channels (a so-called 5-2-5 configuration), the signal parameters sp may for example comprise:
It is noted that “lfe” is an optional low frequency (sub-woofer) channel, and that the “rear” channels are also known as “surround” channels.
The two downmix channels l and r produced by the spatial encoding unit 11 are fed to the downmix encoding (DE) unit 12, which typically uses a type of coding aimed at reducing the amount of data. The thus encoded downmix channels l and r, and the signal parameters sp, are multiplexed by the multiplexer unit 13 to produce an output bit stream bs.
In an alternative embodiment (not shown), five (or six) channels are downmixed to a single (mono) channel (a so-called 5-1-5 configuration), and the signal parameters sp may for example comprise:
In this alternative embodiment the encoded downmix channel s, as well as the signal parameters sp, are also multiplexed by the multiplexer unit 13 to produce an output bit stream bs.
If this bitstream bs were to be used to produce a pair of binaural channels, the Prior Art approach would be to first upmix the two downmix channels l and r (or, alternatively, the single downmix channel) to produce the five or six original channels, and then convert these five or six channels into two binaural channels. An example of this Prior Art approach is illustrated in
The spatial decoder device 2′ according to the Prior Art comprises a demultiplexer (Demux) unit 21′, a downmix decoding unit 22′, and a spatial decoder unit 23′. A binaural synthesis device 3 is coupled to the spatial decoder unit 23′ of the spatial decoder device 2′.
The demultiplexer unit 21′ receives a bitstream bs, which may be identical to the bitstream bs of
An exemplary structure of the spatial decoder unit 23′ of the Prior Art is shown in
It is noted that in the example of
The configuration of
The present invention provides a far more efficient processing by integrating the binaural synthesis device in the spatial decoder device and effectively carrying out the binaural processing in the parameter. A merely exemplary embodiment of a spatial decoder unit according to the present invention is schematically illustrated in
The inventive spatial decoder unit 23 shown merely by way of non-limiting example in
The transform units 231 each receive a downmix channel l and r respectively (see also
The spatial synthesis unit 232 may be similar or identical to the Prior Art spatial synthesis unit 232′ of
an average level per frequency band for the left transfer function as a function of azimuth (angle in a horizontal plane), elevation (angle in a vertical plane), and distance,
an average level per frequency band for the right transfer function as a function of azimuth, elevation and distance, and
an average phase or time difference per frequency band as a as a function of azimuth, elevation and distance.
In addition, the following parameters may be included:
a coherence measure of the left and right transfer functions per HRTF frequency band as a function of azimuth, elevation and distance, and/or
absolute phase and/or time parameters for the left and right transfer functions as a function of azimuth, elevation and distance.
The actual HRTF parameters used may depend on the particular embodiment.
The spatial synthesis unit 232 may determine the binaural channels Lb and Rb using the following formula:
where the index k denotes the QMF hybrid (frequency) band index and the index m denotes the QMF slot (time) index. The parameters hij of the matrix Hk are determined by the binaural parameters (bp in
The parameters hij of the matrix Hk may be determined in the following way in the case of two downmix channels (5-2-5 configuration). In the Prior Art spatial decoder unit of
In accordance with a further aspect of the present invention the parameter conversion unit (234 in
The operation of the two-to-three upmix unit 230′ can be described by the following matrix operation:
with matrix entries mij dependent on the spatial parameters. The relation of spatial parameters and matrix entries is identical to those of a 5.1 MPEG surround decoder. For each of the three resulting signals l, r and c, the effect is determined of the perceptual transfer function (in the present example: HRTF) parameters which correspond to the desired (perceived) position of these sound sources. For the center channel (c), the spatial parameters of the sound source position can be applied directly, resulting in two output signals for center, lB(c) and rB(c):
As can be observed from equation (3), the HRTF parameter processing consists of a multiplication of the signal with average power levels Pl and Pr corresponding to the sound source position of the center channel, while the phase difference is distributed symmetrically. This process is performed independently for each QMF band, using the mapping from HRTF parameters to QMF filter bank on the one hand, and mapping from spatial parameters to QMF band on the other hand.
For the left (l) channel, the HRTF parameters from the left-front and left-rear channels are combined into a single contribution, using the weights wlf and wrf. The resulting composite parameters simulate the effect of both the front and rear channels in a statistical sense. The following equations are used to generate the binaural output pair (lb, rb) for the left channel:
The weights wlr and wrf depend on the CLD parameter of the 1-to-2 unit for lf and lr:
In a similar fashion, the binaural output for the right channel is obtained according to:
It is noted that the phase modification term is applied to the contra-lateral ear in both cases. Furthermore, since the human auditory system is largely insensitive to binaural phase for frequencies above approx. 2 kHz, the phase modification term only needs to be applied in the lower frequency region. Hence for the remainder of the frequency range, real-valued processing suffices (assuming real-valued mij).
It is further noted that the equations above assume incoherent addition of the (HRTF) filtered signals of lf and lr. One possible extension would be to include the transmitted Inter-Channel Coherence (ICC) parameters of lf and lr (and of lf and rr) in the equations as well to account for front/rear correlation.
All processing steps described above can be combined in the parameter domain to result in a single, signal-domain 2×2 matrix:
As will be clear from the above, the present invention essentially processes the binaural (that is, HRTF) information in the parameter domain, instead of in the frequency or time domain as in the Prior Art. In this way, significant computational savings may be obtained.
The spatial decoder device 2 according to the present invention shown merely by way of non-limiting example in
In the configuration of
lb=Hl(lf)lf+Hl(rf)rf+Hl(lr)lr+Hl(Rr)Rr+Hl(c)c (16)
rb=Hr(lf)lf+Hr(rf)rf+Hr(lr)lr+Hr(rr)rr+Hr(c)c (17)
Given the spatial parameters which describe statistical properties and inter-relations of the channels lf, rf, lr, rr and c, and the parameters of the HRTF impulse responses, it is possible to estimate the statistical properties (that is, an approximation of the binaural parameters) of the binaural output pair lb, rb as well. More specifically, the average energy (for each channel), the average phase difference and the coherence can be estimated and subsequently re-instated by means of decorrelation and matrixing of the mono input signal.
The binaural parameters comprise a (relative) level change for each of the two binaural output channels (and hence define a Channel Level Difference parameter), an (average) phase difference and a coherence measure (per transform domain time/frequency tile).
As a first step, the relative powers (with respect to the power of the mono input signal) of the five (or six) channel (5.1) signal are computed using the transmitted CLD parameters. The relative power of the left-front channel is given by:
Similarly, the relative powers of the other channels are given by:
σ2rf=r1(CLDfs)r1(CLDfc)r2(CLDf) (21a)
σc2=r1(CLDfs)r2(CLDfc) (21b)
σls2=r2(CLDfs)r1(CLDs) (21c)
σrs2=r2(CLDfs)r2(CLDs) (21d)
The expected value of the relative power σL2 of the left binaural output channel (with respect to the mono input channel), the expected value of the relative power σR2 of the right binaural output channel, and the expected value of the cross product LBRB* can then be calculated. The coherence of the binaural output (ICCB) is then given by:
and the average phase angle (IPDB) is given by:
IPDB=arg(LBRB*) (23)
The channel level difference (CLDB) of the binaural output is given by:
Finally, the overall (linear) gain of the binaural output compared to the mono input, gB, is given by:
gB√{square root over (σL2+σR2)} (25)
The matrix coefficients required to re-instate the IPDB, CLDB, ICCB and gB parameters in the binaural matrix are simply obtained from a conventional parametric stereo decoder, extended with overall gains gB:
Further embodiments of the spatial decoder unit of the present invention may contain a reverberation unit. It has been found that adding reverberation improves the perceived distance when binaural sound is produced. For this reason, the spatial decoder unit 23 of
In the embodiment of
The present invention additionally provides a consumer device, such as a hand-held consumer device, and an audio system comprising a spatial decoder unit or spatial decoder device as defined above. The hand-held consumer device may be constituted by an MP3 player or similar device. A consumer device is schematically illustrated in
The present invention is based upon the insight that the computational complexity of a combined spatial decoder device and a binaural synthesis device may be significantly reduced by modifying the spatial parameters in accordance with the binaural information. This allows the spatial decoder device to carry out spatial decoding and perceptual transfer function processing effectively in the same signal processing operation, while avoiding the introduction of any artifacts.
It is noted that any terms used in this document should not be construed so as to limit the scope of the present invention. In particular, the words “comprise(s)” and “comprising” are not meant to exclude any elements not specifically stated. Single (circuit) elements may be substituted with multiple (circuit) elements or with their equivalents.
It will be understood by those skilled in the art that the present invention is not limited to the embodiments illustrated above and that many modifications and additions may be made without departing from the scope of the invention as defined in the appending claims.
Number | Date | Country | Kind |
---|---|---|---|
05108405 | Sep 2005 | EP | regional |
06110231 | Feb 2006 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2006/053040 | 8/31/2006 | WO | 00 | 3/12/2008 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2007/031896 | 3/22/2007 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20030035553 | Baumgarte et al. | Feb 2003 | A1 |
20060045274 | Aarts et al. | Mar 2006 | A1 |
Number | Date | Country |
---|---|---|
W02004028204 | Apr 2004 | WO |
Entry |
---|
Herre et al: “The Reference Model Architecture for MPEG Spatial Audio Coding”; Audio Engineering Society Convention Paper 6447, May 28, 2005, pp. 1-13. |
Engdegard et al: “Synthetic Ambience in Parametric Stereo Coding”; Audio Engineering Society Convention Paper 6074, May 8, 2004, pp. 1-12. |
Herre et al: “MP3 Surround: Efficient and Compatible Coding of Multi-Channel Audio”; Audio Engineering Society Convention Paper 6049, 116th Convention,May 8-11, 2004, pp. 1-14. |
Baumgarte et al: “Binaural Cue Coding-Part1: Psychoacoustic Fundamentals and Design Principles”; IEEE Transactions on Speech and Audio Processing, vol. 11, No. 6, pp. 509-519, Nov. 2003. |
Faller et al: “Binaural Cue Coding-Part II: Schemes and Applications”; IEEE Transactions on Speech and Audio Processing, Vo. 11, No. 6, pp. 520-531, Nov. 2003. |
Number | Date | Country | |
---|---|---|---|
20080205658 A1 | Aug 2008 | US |