This application is a national stage entry of International Application No. PCT/FR2007/050867, filed on Mar. 1, 2007, and claims priority to French Application No. 06 02098, filed Mar. 9, 2006, both of which are hereby incorporated by reference as if fully set forth herein in their entireties.
The present invention is concerned with processing sound signals for their spatialization.
Spatialized sound reproduction allows a listener to perceive sound sources originating from any direction or position in space.
The particular spatialized techniques of sound reproduction to which the present invention pertains are based on the acoustic transfer functions for the head between the positions in space and the auditory canal. These transfer functions termed “HRTF” (for “Head Related Transfer Functions”) relate to the frequency shape of the transfer functions. Their temporal shape will be denoted hereinafter by “HRIR” (for “Head Related Impulse Response”).
Additionally, the term “binaural” is concerned with reproduction on a stereophonic headset, but with spatialization effects. The present invention is not limited to this technique and applies in particular also to techniques derived from binaural such as so-called “transaural” reproduction techniques, that is to say those on remote loudspeakers. Such techniques can then use what is called “crosstalk cancellation” which consists in canceling the acoustic cross-paths in such a way that a sound, thus processed then emitted by the loudspeakers, can be perceived only by one of a listener's two ears.
The term “multichannel”, in processing for spatialized sound reproduction, consists in producing a representation of the acoustic field in the form of N signals (termed spatial components). These signals contain the whole set of sounds which make up the sound field, but with weightings which depend on their direction (or “incidence”) and described by N associated spatial encoding functions. The reconstruction of the sound field, for reproduction at a chosen point, is then ensured by N′ spatial decoding functions (usually with N=N′).
In the particular case of binaural, this decomposition makes it possible to carry out so-called “multichannel binaural” encoding and decoding. The decoding functions (which in reality are filters), associated with a given suite of spatial encoding functions (which in reality are encoding gains), when they are optimum in reproduction, ensure a feeling of perfect immersion of the listener within a sound scene, whereas in reality he has, for binaural reproduction, only two loudspeakers (earpieces of a headset or remote loudspeakers).
The advantages of a multichannel approach for binaural techniques are manyfold since the encoding step is independent of the decoding step.
Thus, in the case of composition of a virtual sound scene on the basis of synthesized or recorded signals, the encoding is generally inexpensive in terms of memory and/or calculations since the spatial functions are gains which depend solely on the incidences of the sources to be encoded and not on the number of sources themselves. The cost of the decoding is also independent of the number of sources to be spatialized.
In the case furthermore of a real sound field measured by an array of microphones and encoded according to known spatial functions, it is nowadays possible to find decoding functions which allow satisfactory binaural listening.
Finally, the decoding functions can be individualized for each of the listeners.
The present invention is concerned in particular with improved obtainment of the decoding filters and/or of the encoding gains in the multichannel binaural technique. The context is as follows: sources are spatialized by multichannel encoding and the reproduction of the spatially encoded content is performed by applying appropriate decoding filters.
The reference WO-00/19415 discloses a multichannel binaural processing which provides for the calculation of decoding filters. Denoting by:
this document WO-00/19415 essentially envisages two steps for obtaining filters on the basis of these spatial functions.
The delays are extracted from each HRTF. Specifically, the shape of a head is customarily such that, for a given position, a sound reaches one ear a certain time before reaching the other ear (a sound situated to the left reaching the left ear before reaching the right ear, of course). The difference in delay t between the two ears is an interaural index of location called the ITD (for “Interaural Time Difference”). New HRTF bases denoted L and R are then defined by:
L(θp,φp,f)=TL(θp,φp)L(θp,φp,f) for p=1,2, . . . ,P
R(θp,φp,f)=TR(θp,φp)L(θp,φp,f) for p=1,2, . . . ,P
Decoding filters Li(f) and Ri(f) for channel i which satisfy the equations:
To obtain these filters, this document proposes a procedure termed “calculation of the pseudo-inverse” which is concerned with satisfying the previous equations within the least squares sense, i.e.:
L=GL→L=(GTG−1)GTL
The implementation of such a technique therefore requires the reintroduction of a delay corresponding to the ITD at the moment of encoding each sound source. Each source is therefore encoded twice (once for each ear). Document WO-00/19415 specifies that it is possible not to extract the delays but that the sound rendition quality would then be worse. In particular, the quality is better, even with fewer channels, if the delays are extracted.
Additionally, a second approach, proposed in document U.S. Pat. No. 5,500,900, for jointly calculating the decoding filters and the spatial encoding functions, consists in decomposing the HRIR suites by performing a principal component analysis (PCA) then by selecting a reduced number of components (which corresponds to the number of channels).
An equivalent approach, proposed in U.S. Pat. No. 5,596,644, uses a singular value decomposition (SVD) instead. If the delays are extracted from the HRIRs before decomposition and then used at the moment of encoding, reconstruction of the HRIRs is very good with a reduced number of components.
When the delays are left in the original filters, the number of channels must be increased so as to obtain good quality reconstruction.
Moreover, these prior art techniques do not make it possible to have universal spatial encoding functions. Specifically, the decomposition gives different spatial functions for each individual.
It is also indicated that multichannel binaural can also be viewed as the simulation in binaural of a multichannel rendition on a plurality of loudspeakers (more than two). One then speaks of the so-called “virtual loudspeaker” procedure when, nevertheless, binaural reproduction is effected, according to this approach, solely on two earpieces of a headset or on two remote loudspeakers. The principle of such reproduction consists in considering a configuration of loudspeakers distributed around the listener. During rendition on two real loudspeakers, intensity panning (or “pan pot”) laws are then used to give the listener the sensation that sources are actually positioned in the space solely on the basis of two loudspeakers. One then speaks of “phantom sources”. Similar rules are used to define positions of virtual loudspeakers, this amounting to defining spatial encoding functions. The decoding filters correspond directly to the HRIR functions calculated at the positions of the virtual loudspeakers.
For efficacious spatial rendition with a small number of channels, the prior art techniques require the extraction of the delays from the HRIRs. The techniques of sound pick-up or multichannel encoding at a point in space are widely used since it is then possible to subject the encoded signals to transformations (for example rotations). Now, in the case where the signal to be decoded is a multichannel signal measured (or encoded) at a point, the delay information is not extractible on the basis of the signal alone. The decoding filters must then make it possible to reproduce the delays for optimal sound rendition. Moreover, in the case of recordings, the number of channels may be small and the prior art techniques do not allow good decoding with few channels without extracting the delays. For example in the acquisition technique based on ambiophonic microphones, the multichannel signal acquired may be constituted by only four channels, typically. The expression “ambiophonic microphones” is understood to mean microphones composed of coincident directional sensors. The interaural delays must then be reproduced on decoding.
More generally, the extraction of the delays exhibits at least two other major drawbacks:
The present invention aims to improve the situation.
It proposes for this purpose a method of sound spatialization with multichannel encoding and for binaural reproduction on two loudspeakers, comprising a spatial encoding defined by encoding functions associated with a plurality of encoding channels and a decoding by applying filters for reproduction in a binaural context on the two loudspeakers.
The method within the sense of the invention comprises the steps:
a) obtaining an original suite of acoustic transfer functions specific to an individual's morphology (HRIR;HRTF),
b) choosing spatial encoding functions and/or decoding filters, and
c) through successive iterations, optimizing the filters associated with the chosen encoding functions or the encoding functions associated with the chosen filters, or jointly the chosen filters and encoding functions, by minimizing an error calculated as a function of a comparison between:
What is meant by “acoustic transfer functions specific to an individual's morphology” can relate to the HRIR functions expressed in the time domain. However, the consideration, in the first step a), of the HRTF functions expressed in the frequency domain and, in reality, customarily corresponding to the Fourier transforms of the HRIR functions, is not excluded.
Thus, generally, the invention proposes the calculation by optimization of the filters associated with a set of chosen encoding gains or encoding gains associated with a set of chosen decoding filters, or joint optimization of the decoding filters and encoding gains. These filters and/or these gains have for example been fixed or calculated initially by the pseudo-inverse technique or virtual loudspeaker technique, described in particular in document WO-00/19415. Then, these filters and/or the associated gains are improved, within the sense of the invention, by iterative optimization which is concerned with reducing a predetermined error function.
The invention thus proposes the determination of decoding filters and encoding gains which allow at one and the same time good reconstruction of the delay and also good reconstruction of the amplitude of the HRTFs (modulus of the HRTFs), doing so for a small number of channels, as will be seen with reference to the description detailed hereinbelow.
Other characteristics and advantages of the invention will become apparent on examining the detailed description hereinafter, and the appended drawings in which:
In an exemplary embodiment, the method within the sense of the invention can be broken down into three steps:
a) obtaining an HRIR suite (left ear and/or right ear) at P positions around the listener, hereinafter denoted H(θp,φp,t),
b) fixing spatial encoding functions and/or base filters, the encoding functions being denoted g(θp,φp,n) (or else g(θ,φ,n,f)), where:
c) and finding the filters associated with the fixed spatial functions or the spatial functions associated with the fixed filters or a combination of associated filters and spatial functions, by an optimization technique which will be described in detail further on.
It is simply indicated here that, for the implementation of the aforesaid first step a), the obtaining of the HRTFS of the second ear can be deduced from the measurement of the first ear by symmetry. The suite of HRIR functions can for example be measured on a subject by positioning microphones at the entrance of his auditory canal. As a variant, this HRIR suite can also be calculated by digital simulation procedures (modeling of the morphology of the subject or calculation by artificial neural net) or else have been subjected to a chosen processing (reduction of the number of samples, correction of the phase, or the like).
It is possible in this step a) to extract the delays from the HRIRS, to store them and then to add them at the moment of the spatial encoding, steps b) and c) remaining unchanged. This embodiment will be described in detail with reference in particular to
This first step a) bears the reference E0 in
For the implementation of step b), if one seeks to obtain optimized filters on the one hand, it is necessary to fix the spatial encoding functions g(θ,φ,n) (or g(θ,φ,n,f)) and, in order to obtain optimized spatial functions on the other hand, it is necessary to fix the decoding filters denoted F(t,n).
Nevertheless, provision may be made to optimize jointly, at one and the same time the filters and the spatial functions, as indicated above.
The choice to optimize the spatial functions or to optimize the decoding filters may depend on various application contexts.
If the spatial encoding functions are fixed, they are then reproducible and universal and the individualization of the filters is effected simply on decoding.
Additionally, the spatial encoding functions, when they comprise a large number of zeros among n encoding channels as in the second embodiment described further on, make it possible to limit the number of operations during encoding. The intensity panning (“pan pot”) laws between virtual loudspeakers in two dimensions and their extensions in three dimensions can be represented by encoding functions comprising only two nonzero gains, at most, for two dimensions and three nonzero gains for three dimensions, for a single given source. The number of nonzero gains is, of course, independent of the number of channels and, above all, the zero gains make it possible to lighten the encoding calculations.
As regards the encoding functions proper, several choices still present themselves.
The spatial functions of the spherical harmonic type in an ambiophonic context have mathematical qualities which make it possible to subject the encoded signals to transformations (for example rotations of the sound field). Moreover, such functions ensure compatibility between binaural decoding and ambiophonic recordings based on decomposing the sound field into spherical harmonics.
The encoding functions can be real or simulated directivity functions of microphones so as to make it possible to listen to recordings in multichannel binaural.
The encoding functions may be any (non-universal) and determined by any procedure, rendition then having to be optimized during subsequent steps of the method within the sense of the invention.
The spatial functions may equally well be time dependent or frequency dependent.
The optimization will then be effected taking account of this dependence (for example by independently optimizing each temporal or frequency sample).
As regards the decoding filters, the latter may be fixed in such a way that the decoding can be universal.
The decoding filters can be chosen also in such a way as to reduce the cost in resources involved in the filtering. For example, the use of so-called “infinite impulse response” or “IIR” filters is advantageous.
The decoding filters may also be chosen according to a psychoacoustic criterion, for example constructed on the basis of normalized Bark bands.
More generally, the decoding filters may be determined by an arbitrary procedure. Rendition, in particular for an individual listener, can then be optimized during subsequent steps of the method pertaining to the encoding functions.
This second step b) relating to the calculation of an initial solution S0 bears the reference E1 in
For example, in the case where the fixed spatial functions are functions defining the intensity panning (“pan pot”) laws between virtual loudspeakers, the filters of the starting solution S0 in step E1 may be directly the HRIR functions given at the corresponding positions of the virtual loudspeakers.
In this example, provision may also be made to jointly optimize the decoding filters and the encoding gains, the starting solution S0 again being determined by functions defining the intensity panning (“pan pot”) laws as encoding functions and by the HRIR functions, themselves, given at the positions of the virtual loudspeakers, as decoding filters.
In another example where the spatial encoding functions are fixed as being spherical harmonics, the decoding filters are calculated in step E1 on the basis of the pseudo-inverse, so as to determine the starting solution S0.
More generally, the starting solution S0 in step E1 can be calculated on the basis of the least squares solution:
F=HRIR g−1
It should be specified here that the elements F, HRIR and g are matrices. Furthermore, the notation g−1 denotes the pseudo-inverse of the gain matrix g according to the expression:
g−1=pinv(g)=gT·(g·gT)−1, the notation gT denoting the transpose of the matrix g.
Again generally, the starting solution S0 can be any (random or fixed), the essential thing being that it leads to a converged solution SC being obtained in step E6 of
In step E2, the reconstruction of the suite of HRIR functions then gives a reconstructed suite HRIR*=gF that differs from the original suite, at the first iteration.
In step E3, the calculation of an error function is an important point of the optimization procedure within the sense of the invention. A proposed error function consists in simply minimizing the difference of moduli between the Fourier transform HRTF* of the reconstructed suite of HRIR functions and the Fourier transform HRTF of the original suite of HRIR functions (given in step E0). This error function, denoted c, may be written:
where F(X) denotes the Fourier transform of the function X.
Other error functions also allow optimal spatial rendition. For example, it is possible to weight the HRIR functions by a gain which depends on the position of the HRIR functions so as to better reconstruct certain favored positions in space, which may be written:
where wp is the gain corresponding to a position p. It is thus possible to favor the reconstruction of certain spatial zones of the HRIR function (for example the frontal part).
In the same manner, it is also possible to weight the HRIR functions as a function of time or frequency.
The error function can also minimize the energy difference between the moduli, i.e.:
Generally, it will be assumed that any error function calculated entirely or in part on the basis of the HRIR functions can be provided (modulus, phase, estimated delay or ITD, interaural differences, or the like).
Additionally, if the error criterion pertains to the frequency samples of the HRTF functions, independently of one another, unlike what was proposed above (sum over all the frequencies for the calculation of the error function c), the optimization iterations can be applied successively to each frequency sample, with the advantage of then reducing the number of simultaneous variables, of having an error function specific to each frequency f and of encountering a stopping criterion as a function of convergence specific to each frequency.
Step T4 is a test to stop or not stop the iteration of the optimization as a function of a chosen stopping criterion. It may involve a criterion characterizing the fact that:
If the criterion is attained (arrow Y on exit from the test T4), the filters F(n,t) or the gains g(θ,φ,n) or the filter/gain pairs calculated make it possible to obtain optimal spatial rendition, as will be seen in particular with reference to
If the criterion is not attained (arrow N on exit from the test T4), according to the error function used, it is difficult to ascertain analytically what the evolution of the filters F or of the gains g should be in order to minimize the error c. Recourse is advantageously had to a gradient calculation to adjust the filters and/or the gains so that they lead to a reduction in the error function c (iterative steps E5).
This processing is advantageously computationally assisted. A function dubbed “fminunc” from the “optimization Toolbox” module of the Matlab® software, programmed in an appropriate manner, makes it possible to carry out steps E2, E3, T4, E5, E6 described above with reference to
Of course, this embodiment illustrated in
Described hereinafter is an exemplary optimization of the filters for decoding a content arising from a spatial encoding by spherical harmonic functions in an ambiophonic context of high order (or “high order ambisonic”), for reproduction to binaural. This is a sensitive case since if sources have been recorded or encoded in an ambiophonic context, the interaural delays must being complied with in the processing when decoding, by applying the decoding filters.
In the implementation of the invention set forth hereinafter by way of example, we have chosen to limit ourselves to the case of two dimensions and thus seek to provide optimized filters so as to decode an ambiophonic content to order 2 (five ambiophonic channels) for binaural listening on a headset with earpieces.
For the embodiment of the first step a) of the general method described above (reference E0 of
A symmetry of the listener's head is assumed and the HRIRs of the right ear are symmetric to the HRIRs of the left ear.
As a variant of measurements to be performed on an individual, it is possible to obtain the HRIR functions from standard databases (“Kemar head”) or by modeling the morphology of the individual, or the like.
The spatial encoding functions chosen here are the spherical harmonics calculated on the basis of the functions cos(mθ) and sin(mθ), with increasing angular frequencies m=0, 1, 2, . . . , N to characterize the azimuthal dependence (as illustrated in
The starting solution S0 for step E1 is given by calculating the pseudo-inverse (with linear resolution). This starting solution constitutes the decoding solution which was proposed as such in document WO-00/19415 of the prior art described above. The optimization technique employed within the sense of the invention is preferably the gradient technique described above. The error function c employed corresponds to the least squares on the modulus of the Fourier transform of the HRIR functions, i.e.:
For the starting solution which nevertheless constituted the decoding solution within the sense of document WO-00/19415, the modulus of the HRTF functions is relatively poorly reconstructed, most of the reconstruction errors being greater than 8 dB.
Nevertheless, it is apparent that the error in the phase is practically unmodified in the course of the iterations. This error is however minimal at low frequencies and on the ispilateral part of the HRTF functions (region at 0-180° of azimuth). On the other hand, the error in the modulus decreases greatly as the optimization iterations proceed, especially in this ispilateral region. The optimization within the sense of the invention therefore makes it possible to improve the modulus of the HRTF functions without modifying the phase, therefore the group delay, and, thereby and especially, the interaural ITD delay, so that the rendition is particularly faithful by virtue of the implementation of this first embodiment.
Described hereinafter is an exemplary optimization of the decoding filters for spatial functions arising from intensity panning (“pan pot”) laws consisting, in simple terms, of mixing rules.
Panning laws are commonly employed by sound technicians to produce audio contents, in particular multichannel contents in so-called “surround” formats which are used in sound reproduction 5.1, 6.1, or the like. In this second embodiment, one seeks to calculate the filters which make it possible to reproduce a “surround” content on a headset. In this case, the encoding by panning laws is carried out by mixing a sound environment according to a “surround” format (tracks 5.1 of a digital recording for example). The filters optimized on the basis of the same panning laws then make it possible to obtain optimal binaural decoding for the desired rendition with this “surround” effect.
The present invention advantageously applies in the case where the positions of the virtual loudspeakers correspond to positions of a mass-market multichannel reproduction system, with “surround” effect. The optimized decoding filters then allow decoding of mass-market multimedia contents (typically multichannel contents with “surround” effect) for reproduction on two loudspeakers, for example on a binaural headset. This binaural reproduction of a content which is for example initially in the 5.1 format is optimized by virtue of the implementation of the invention.
The case of an example of ten virtual loudspeakers “disposed” around the listener is described hereinafter.
First of all, the HRIR functions are obtained at 64 positions around the listener, as described with reference to the first embodiment above.
The spatial functions given by the intensity panning laws (here tangent-wise) between each pair of adjacent loudspeakers, is determined in this second embodiment by a relation of the type:
tan(θv)=((L−R)/(L+R))tan(u), where:
The forms of the ten spatial functions adopted as a function of azimuth are given in
The optimization procedure used in the second embodiment is again the gradient procedure. The starting solution S0 in step E1 is given by the ten decoding filters which correspond to the ten HRIR functions given at the positions of the virtual loudspeakers. The fixed spatial functions are the encoding functions representing the panning laws. The error function c is based on the modulus of the Fourier transform of the HRIR functions, i.e.:
Reference is now made to
The optimized solution within the sense of the invention agrees perfectly with the original function, this being explained by the fact that the error function c proposed here is concerned with reducing to the maximum the error in the modulus of the function.
The optimization of the method within the sense of the invention therefore makes it possible to reconstruct at one and the same time the modulus of the HRTF functions and the ITD group delay between the two ears.
Moreover, it is apparent in this second embodiment that the quality of the reconstructed filters is not affected by the choice of the encoding functions. Therefore, it is possible to use any spatial encoding functions, for example advantageously comprising many zeros, as in this exemplary embodiment, thereby making it possible to correspondingly reduce the resources necessary for calculating the encoding.
The object of this part of the description is to assess the gain in terms of number of operations and memory resources necessary for the implementation of the encoding and the multichannel binaural decoding within the sense of the invention, with decoding filters which take the delay into account.
The case dealt with in the example described here is that of two spatially distinct sources to be encoded in multichannel and to be reproduced in binaural. The two implementation examples of
The example given in
The realization of
In the example of
In
In
In
For the decoding part of
Finally, L and R denote the left and right binaural channels.
In the implementation of
Thus, the fact of not having to take account of the interaural delays on encoding makes it possible to reduce the number of channels to n (and no longer 2 n). The use of the symmetry of the decoding filters makes it possible furthermore, in the implementation of
It is indicated that this implementation of
The processing on decoding of
Thus, whereas the solution illustrated in
the solution illustrated in
Additionally, even if the memory storage requires, for the two solutions, the same capacities (storage of n filters by calculating the delays and the gains on the fly), the useful work memory (buffer) for the implementation of
The present invention is thus concerned with a sound spatialization system with multichannel encoding and for reproduction on two channels comprising a spatial encoding block ENCOD defined by encoding functions associated with a plurality of encoding channels and a decoding block DECOD based on applying filters for reproduction in a binaural context. In particular, the spatial encoding functions and/or the decoding filters are determined by implementing the method described above. Such a system can correspond to that illustrated in
Another advantageous realization consists of the implementation of the method according to the second embodiment so as thus to construct a spatialization system with a block for direct encoding, without applying delay, so as to reduce a number of encoding channels and a corresponding number of decoding filters, which directly include the interaural delays ITD, according to an advantage offered by implementing the invention, as illustrated in
This realization of
The present invention is also concerned with a computer program comprising instructions for implementing the method described above and the algorithm of which may be illustrated by a general flowchart of the type represented in
Number | Date | Country | Kind |
---|---|---|---|
06 02098 | Mar 2006 | FR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FR2007/050867 | 3/1/2007 | WO | 00 | 9/8/2008 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2007/101958 | 9/13/2007 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5500900 | Chen et al. | Mar 1996 | A |
5596644 | Abel et al. | Jan 1997 | A |
5727066 | Elliott et al. | Mar 1998 | A |
5802180 | Abel et al. | Sep 1998 | A |
5862227 | Orduna-Bustamante et al. | Jan 1999 | A |
6181800 | Lambrecht | Jan 2001 | B1 |
7231054 | Jot et al. | Jun 2007 | B1 |
20080137870 | Nicol et al. | Jun 2008 | A1 |
20080306720 | Nicol et al. | Dec 2008 | A1 |
Number | Date | Country |
---|---|---|
WO 9000851 | Jan 1990 | WO |
WO 0019415 | Apr 2000 | WO |
Number | Date | Country | |
---|---|---|---|
20090067636 A1 | Mar 2009 | US |