The invention relates to the processing of sound data for the purpose of spatialized sound playing.
The three-dimensional spatialization (called “3D rendition”) of compressed audio signals takes place in particular during the decompression of a 3D audio signal, for example compression-encoded and represented on a certain number of channels, onto a different number of channels (two for example in order to allow playing 3D audio effects on a headset).
The term “binaural” means playing on a stereophonic headset a sound signal which nevertheless has spatialization effects. The invention is not however limited to the aforesaid technique and applies, in particular, to techniques derived from “binaural”, such as the techniques of playing sound called TRANSAURAL (registered trademark), i.e. on distant loud speakers. Such techniques can then use “cross-talk cancellation”, which consists in cancelling crossed acoustic channels, such that a sound thus processed and then emitted by the loud speakers can be perceived by only one of the two ears of a listener. These two techniques of playing sound, binaural and transaural, will be denoted below by the same terms “binaural sound restitution”.
Thus, more generally, the invention relates to the transmission of multi-channel audio signals and to their conversion for a spatialized sound restitution (with 3D rendition) on two channels. The restitution device (simple headset with earphones for example) is most often imposed by a user's equipment. The conversion can for example be for the purpose of sound restitution of a scene initially in the 5.1 multi-channel format (or 7.1, or another) by a simple audio listening headset (in binaural technique).
The invention also of course relates to the restitution, in the context of a game or of a video recording for example, of one or more sound samples stored in files, in order to spatialize them.
Among the techniques known in the field of binaural sound spatialization, different approaches have been proposed.
In particular, dual-channel binaural synthesis consists, with reference to
These transfer functions, commonly called “HRTF” functions (Head Related Transfer Functions), represent the acoustic transfer between the positions in space and the auditory canal of each of the listener's ears. The term “HRIR” (for “Head Related Impulse Response”) refers to their temporal form or impulse response. These HRIR functions can furthermore include a room effect.
For each sound source Si, two signals (left and right) are obtained which are then added to the left and right signals resulting from the spatialization of all the other sound sources, in order to produce finally the signals L and R which are delivered to the left and right ears of the listener through two respective loud speakers (earphones of a headset in binaural technique or loud speakers in transaural technique).
If N denotes the number of incident sound or audio flux sources to be spatialized, the number of filters, or transfer functions, necessary for the binaural synthesis is 2×N for a rendition in static binaural spatialization, and 4×N for a rendition in dynamic binaural spatialization (with transitions of the transfer functions).
The processing described above with reference to
Nevertheless, the invention starts from another type of prior art.
There are compression techniques, often in a transformed domain, of signals in a multi-channel format in order to be able to convey these signals, in particular through telecommunication networks, on a restricted number of channels, for example on only one or two channels. Thus, for the transmission of a signal in a multi-channel format comprising more than two channels (for example 5.1, 7.1 or other), an encoder compresses the multi-channel signal on only one or two channels (typically according to the data rate offered on the telecommunications network) and furthermore delivers spatialization information. This embodiment is shown in
With reference to
Many types of parametric encoders/decoders, in particular standardized ones, offer such possibilities.
Audio encoders (AAC, MP3) use time-frequency representations of signals for compressing the information. These representations are based on an analysis by banks of filters or by time-frequency transformation of the MDCT (Modified Discrete Cosine Transform) type. In the case where a binaural spatialization must be carried out after an audio decoding, the filtering operations are advantageously carried out directly in the transformed domain.
Recent work on filtering subbands in the transformed domain has made it possible to formalize the filtering architecture for a bank of filters commonly used in audio encoders. It will be useful to refer to the document:
A more recent transformed domain filtering technique of complex QMFs (Quadrature Mirror Filters) has been proposed in the “MPEG Surround” standard. This technique aims at the conversion of the impulse response (finite) of the temporal filter referenced h(v) in a set of M complex filters referenced hm(l), where M is the number of subbands of frequencies. The conversion is carried out by analysis of the temporal filter h(v) by a bank of complex filters similar to the bank of QMF filters used for the analysis of the signal. In an example of embodiment, the prototype filter q(v) used for generating the conversion filter bank can be of length 192. An extension with zeros of the temporal filter is defined by the following formula:
m=0.1 . . . , 63, corresponding to the index of the subband
l=0.1 . . . , Kh+1, corresponding to the temporal index in the decimated domain of the subbands.
In more generic terms, it will be understood that such processing, directly in the transformed domain, makes it possible to change from a representation of the compressed signal on two channels L, R into a representation of the signal on two restitution channels L-BIN, R-BIN (
Thus, now referring to
Thus, the subband filters in the transformed domain are calculated for each ear and for each of the five positions of the loud speakers. This technique is often called the “virtual loud speakers technique”.
Using the representation in subbands of the binaural filters determined as described above from HRTF transfer functions, the binaural spatialization can then be advantageously carried out by applying these binaural filters in the transformed domain within the audio decoder DECOD BIN such as shown in
Thus, this type of decoder DECOD BIN uses a monophonic or stereophonic representation (compressed channels L, R) of the multi-channel audio scene, a representation with which are associated spatialization parameters SPAT (which can consist, for example, in energy differences between channels and correlation indices between channels). These SPAT parameters are used in the decoding on order to reproduce the original multi-channel sound scene as well as possible.
Moreover, when the original signal is encoded by a parametric encoder (for example in the sense of recent work in the “MPEG Surround” standard), in addition to the monophonic or stereophonic signal transmitted and spatialization information, the decoding can use decorrelated representations of these signals L, R (which are obtained, for example, by the application of all-pass decorrelation filters or reverberation filters). These signals are then adjusted in energy using the inter-channel energy differences and then recombined in order to obtain the multi-channel signal for the purpose of restitution.
In particular, the parametric encoder (ENCOD—
A description of preparatory work for this standard is given at the following URL address:
http://www.chiariglione.org/mpeg/technologies/mpd-mps/index.htm and details regarding such an encoder according to this draft can be found in:
“MPEG Spatial Audio Coding/MPEG Surround: Overview and Current Status”, J. Breebaart et al., in 119th Conv. Aud. Eng. Soc (AES), New York, N.Y., USA, October 2005.
In the case of a parametric audio decoder for binaural restitution (DECOD BIN—
h
L,L
=g
L,LσFLexp(−jφFL,BLLσBL2)hL,FL+gL,LσBLexp(jφFL,BLLσFL2)hL,BL
In this Expression:
With reference to
With reference the
In
The two signals L-BIN and R-BIN resulting from these filterings can then be applied to two loud speakers intended for the left ear and for the right ear respectively of the listener after changing from the transformed domain to the temporal domain.
However, a problem linked with this combination of filters for a binaural restitution is that it does not take account of a possible decorrelation between the front and back channels. This information, nevertheless used in the decoding of a 5.1 scene of an encoder according to the aforesaid draft of the MPEG Surround standard, is not used in the binaural decoding technique. Thus, when the sound scene comprises decorrelation effects between the front and back channels (for example for reverberated signals), this information is not used in the combination of HRTF filters, which results in a degradation of the spatialization quality and in particular of the surround effect of the 3D audio scene. The restitution in the binaural format is not therefore optimal.
The present invention has improved the situation.
It firstly relates to a method of processing sound data for a three-dimensional spatialized restitution on two restitution channels for the respective ears of a listener,
the sound data being initially represented in a multi-channel format and then compression-encoded on a reduced number of channels (for example one or two channels),
said initial multi-channel format consisting in providing more than two channels able to feed respective loud speakers,
the method comprising the steps:
The method according to the invention furthermore comprises the following steps:
The spatialized restitution on two channels, according to the invention, can be in either the binaural or transaural format. The initial multi-channel format can be of the ambisonic type (aimed at the decomposition of the sound signal on a spherical harmonics basis). As a variant, it can be a 5.1 or 7.1 or even 10.2 format. It will therefore be understood that for these latter types of format using channels intended to respectively feed at least front left/back left pairs of loud speakers on the one hand and front right/back right pairs of loud speakers on the other hand, the decorrelation cue can relate to the respective channels of the front/back loud speakers preferably associated with a same ear (left or right).
According to one advantage provided by the invention, because this decorrelation cue at the back of a 3D scene is represented in the binaural or transaural restitution, a better representation of ambiances is obtained, for example crowd noises or a reverberation at the back of a scene, or other, unlike the embodiments of the prior art.
In a particular embodiment, the combination of filters comprises a weighting, according to a coefficient chosen between:
This weighting advantageously makes it possible to favour the unprocessed transfer function of this back loud speaker, or the decorrelated version of that unprocessed transfer function, depending on whether the signal in the back channel of the initial multi-channel format is correlated or not with at least one signal of one of the front channels.
Moreover, in a particular embodiment, the combination of filters associated with a restitution channel comprises at least one grouping forming a filter on the basis of:
Advantageously, the compression-encoding uses a parametric encoder delivering, in the compressed flow including the spatialization parameters, a decorrelation between channels of the multi-channel format cue, on the basis of which said weighting can be determined in a dynamic manner.
Thus, in this embodiment, for a transcoding between a multi-channel format to a binaural format, the said combination of transfer functions makes use of the cues already present concerning the correlation between signals of channels in the multi-channel format, these cues being simply provided by the parametric encoder, with the said spatialization parameters.
By way of example, it is recalled that the parametric decoder according to the draft MPEG Surround standard delivers such decorrelation between channels cues in the 5.1 multi-channel format.
Other advantages and features of the invention will become apparent on reading the detailed description given hereafter by way of example, and on observation of the appended drawings, in which, apart from
With reference to
Thus, in general terms, the HRTF functions of front and back loud speakers on a same side of the listener are therefore grouped in order to construct each filter from a combination of filters belonging to a restitution channel to one ear of a listener. A grouping of HRTF functions in order to construct a filter is for example an addition, subject to multiplying coefficients, an example of which will be described below.
According to the invention, there is also determined from the retrieved SPAT parameters, a decorrelated version of the HRTF functions of the loud speakers situated behind the listener (paths C, D, E and F of
As a purely illustrative example, the initial sound data can be in the 5.1 multi-channel format and, with reference to
A similar processing is provided in order to construct the signal intended to feed the other binaural restitution channel R-BIN shown in
Finally, the combinations of filters integrating the decorrelated versions of the HRTF functions of the back loud speakers are applied to the compressed channels L and R in order to deliver the restitution channels L-BIN and R-BIN, for spatialized binaural restitution with 3D rendition.
In the examples shown in
In an advantageous embodiment, the initial sound data are in the 5.1 multi-channel format and are compression-encoded by a parametric encoder according to the abovementioned draft MPEG Surround standard. More particularly, during such encoding, it is possible of obtain, from the spatialization parameters provided, a decorrelation cue between the back right channel and the front right channel (loud speakers HP-BR and HP-FR respectively of
These decorrelation cues, in a 5.1 format, aim to make the restitution of the back loud speakers as independent as possible from the restitution of the front loud speakers, in order to enhance, in 5.1 format, the effect of surrounding by noises of reverberation or of the audience for concert recordings for example. It is recalled that this enhancement of 3D surround has not been proposed in binaural restitution and an advantage of the invention is to benefit from the availability of decorrelation cues among the spatialization parameters SPAT in order to construct decorrelated versions of the HRTF functions which are advantageously integrated in the combinations of filters for a binaural restitution.
According to another advantage, these combinations of filters can be calculated directly in the transformed domain, for example in the subbands domain, and the filters representing the decorrelated versions of the HRTF functions of the back loud speakers can be obtained for example by applying to the initial HRTF functions a phase shift depending on the frequency subband in question.
More generally, the decorrelation filters can be so-called “natural” reverberation filters (recorded in a particular acoustic environment such as a concert hall for example), or “synthetic” reverberation filters (created by summation of multiple reflections of decreasing amplitude over time). The application of a decorrelated filter can therefore amount to applying to the signal broken down into frequency subbands a different phase shift in each of the subbands, combined with the addition of an overall delay. In the case of a parametric decoder of the aforesaid type (formula (1) given previously in the description of the prior art), this amounts to multiplying each frequency subband by a complex exponential, having a different phase in each subband. These decorrelation filters can therefore correspond to syntheses of phase-shifting all-pass filters.
Advantageously, a weighting is applied between the transfer function of a back loud speaker and its decorrelated version in a same grouping forming a filter. Thus, taking again the formula (1) given previously for the calculation of a filter, for example hL,L for the left ear, weighting coefficients α and (1−α) and the decorrelated version of a transfer function are introduced as follows:
h
L,L
=g
L,LσFLexp(−jφFL,BLLσBL2)hL,FL+gL,LσBLexp(jφFL,BLLσFL2)(αhL,BL+(1−α)hL,BLDecorr)
with the same notations as explained previously and where hDecorrL,BL represents the decorrelated version of the transfer function of the back left loud speaker. The same type of equations are of course provided giving the other filters hL,R, hR,R and hR,L (
For example, for the filter hL,R for the crossed paths to the left ear, the expression is:
h
L,R
=g
L,RσFRexp(−jφFR,BRLσBR2)hL,FR+gL,RσBRexp(jφFR,BRLσFR2)(αhL,BR+(1−α)hL,BRDecorr)
More specifically, a weighting is provided by different coefficients α1 (1−α1) and α2, (1−α2) depending on whether the back loud speaker is on the same side as the ear in question (α=α1 giving the filters HL,L and h,R,R) or not (α=α2 giving the filters HL,R and hR,L). Preferentially, the decorrelated version is favoured for the crossed paths (back right loud speaker for the left ear and back left loud speaker of the right ear), such that in general the coefficient α1 will often be able to be greater than the coefficient α2.
In practice, the coefficients α (α1 or α2) are given by variable weighting functions in such a way as to dynamically favour the unprocessed version of the HRTF function of the back loud speaker or its decorrelated version depending on whether or not the back signal is correlated with the front signal. A better representation of ambiances (crowd noise, reverberation or other) is thus obtained in the 3D rendition.
The weighting function a can be defined dynamically because of the decorrelation cue provided with the spatialization parameters in the following way, given as a non-limitative example:
α=sqrt(abs(ICCL)), if abs(ICCL)>σBL2
σ=sqrt(σBL2), otherwise,
where the notation “sqrt” refers to the “square root” function, the notation “abs” refers to the “absolute value” function and the term ICCL represents the decorrelation cue (otherwise called the “correlation index”) between the front channel and the back channel on the same left side and is part of the spatialization parameters transmitted by the encoder according to the draft MPEG Surround standard mentioned above. As described above, the term σBL represents the target energy of the back left channel when it is a matter of determining the coefficient α in order to calculate the filter hL,L(α=α1). An equivalent expression can of course be applied in order to calculate the weighting coefficient α used in the similar filter hR,R for the direct acoustic paths to the right ear. However, for the filters hL,R and hR,L for the crossed paths, for example for the filter hL,R for the crossed paths to the left ear, the coefficient α=α2 can preferably be written:
α2=abs(ICCR), if abs(ICCR)>σBR2,
α2=σBR2 otherwise,
the term σBR representing the target energy of the back right channel and the term ICCR representing the correlation between the front right channel and the back right channel.
It will be noted that the “sqrt” function no longer applies for the crossed paths and for the calculation of the corresponding coefficient σ2 in the described example. In fact, the target energies and the correlation indices are terms comprised between 0 and 1 such that the coefficient α2 is generally lower than the coefficient α1.
The combination of overall filters, for the L-BIN channel, comprises groupings of HRTF functions forming filters hL,L and hL,R obtained by the formulae given previously, and, in each grouping, the HRTF function of a front loud speaker, the HRTF function of a back loud speaker and a decorrelated version of this latter HRTF function are used, which makes it possible to represent a decorrelation between the front and back channels directly in the combination of filters, and therefore directly in the binaural synthesis.
It is recalled that, as the sound data L, R (or M) are compression-encoded in a transformed domain, the combination of filters can be applied directly in the transformed domain as a function of the target energies (σFL, σBL, σFR, σBR) associated with the channels of the multi-channel format, these target energies being determined from the spatialization parameters SPAT. In this embodiment, there is of course then provision for changing from the transformed domain to the temporal domain again for the actual restitution in the binaural context (the TRANS modules in
The present invention also relates to a decoding module DECOD BIN such as shown by way of example in
The present invention also relates to a computer program intended to be stored in a memory of a decoding module, such as the memory MEM of the module DECOD-BIN shown in
Number | Date | Country | Kind |
---|---|---|---|
0606212 | Jul 2006 | FR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FR2007/051457 | 6/19/2007 | WO | 00 | 1/6/2009 |