The invention relates to a processing of sound data for spatialized restitution of acoustic signals.
The appearance of new formats for coding data on telecommunications networks allows the transmission of complex and structured sound scenes comprising multiple sound sources. In general, these sound sources are spatialized, that is to say they are processed in such a way as to afford a realistic final rendition in terms of position of the sources and room effect (reverberation). Such is the case for example for coding according to the MPEG-4 standard which makes it possible to transmit complex sound scenes comprising compressed or uncompressed sounds, and synthesis sounds, with which are associated spatialization parameters (position, effect of the surrounding room). This transmission is made over networks with constraints, and the sound rendition depends on the type of terminal used. On a mobile terminal of PDA type for example (standing for “Personal Digital Assistant”), a listening headset will preferably be used. The constraints of terminals of this type (calculation power, memory size) render the implementation of sound spatialization techniques difficult.
Sound spatialization covers two different processing types. On the basis of a monophone audio signal, one seeks to give a listener the illusion that the sound source or sources are at very precise positions in space (that one desires to be able to modify in real time), and immersed in a space having particular acoustic properties (reverberation, or other acoustic phenomena such as occlusion). By way of example, on telecommunication terminals of mobile type, it is natural to envisage a sound rendition with a stereophonic listening headset. The most effective technique of positioning of the sound sources is then binaural synthesis.
It consists, for each sound source, in filtering the monophone signal via acoustic transfer functions, called HRTFs (standing for “Head Related Transfer Functions”), which model the transformations engendered by the torso, the head and the auricle of the ear of the listener on a signal originating from a sound source. For each position in space, it is possible to measure a pair of these functions (one for the right ear, one for the left ear). The HRTFs are therefore functions of a spatial position, more particularly of an angle of azimuth θ and of an angle of elevation φ, and of the sound frequency f. Thus, for a given subject, a database of acoustic transfer functions of N positions in space is obtained, for each ear, and in which a sound may be “placed” (or “spatialized” according to the terminology used hereinbelow).
It is indicated that a similar spatialization processing consists of a so-called “transaural” synthesis, in which provision is simply made for more than two loudspeakers in a restitution device (which then takes a different form from a headset with two earpieces, left and right).
In a conventional manner, the implementation of this technique is effected in a so-called “bichannel” form (processing represented diagrammatically in
If one wishes, on the contrary, to vary the positions of the sound sources in space in the course of time (“dynamic” synthesis), the filters used to model the HRTFs (left ear and right ear) have to be modified. However, these filters being for the most part of the finite impulse response type (FIR) or infinite impulse response type (IIR), problems of discontinuities of the left and right output signals appear, giving rise to audible “clicks”. The technical solution conventionally employed to alleviate this problem is to make two sets of binaural filters take a turn in parallel. The first set simulates a position [θ1, φ1] at the instant t1, the second a position [θ2, φ2] at the instant t2. The signal giving the illusion of a displacement between the positions at the instants t1 and t2 is then obtained by cross-fading the left and right signals resulting from the filtering processes for the position [θ1, φ1] and for the position [θ2, φ2]. Thus, the complexity of the system for positioning the sound sources is then doubled (two positions at two instants) with respect to the static case.
In order to alleviate this problem, techniques of linear decomposition of the HRTFs have been proposed (processing represented diagrammatically in
These techniques of linear decomposition are also of interest in the case of dynamic binaural synthesis (i.e. when the position of the sound sources varies in the course of time). Specifically, in this configuration, the values of the weighting coefficients and of the delays, rather than the coefficients of the filters, are now made to vary as a function of position alone. The principle described hereinabove of linear decomposition of sound rendition filters generalizes to other approaches, as will be seen hereinbelow.
Moreover, in the various group communication services (teleconferencing, audio conferencing, video conferencing, or the like) or “STREAMING” communication services, to adapt a binary throughput to the bandwidth provided by a network, the audio and/or speech streams are transmitted in a compressed coded format. Hereinbelow we consider only streams initially compressed by coders of frequency type (or by frequency transform) such as those operating according to the MPEG-1 standard (layer I-II-III), the MPEG-2/4 AAC standard, the MPEG-4 TwinVQ standard, the Dolby AC-2 standard, the Dolby AC-3 standard, or else a UIT-T G.722.1 standard for speech coding, or else the Applicant's TDAC coding method. The use of such coders amounts to firstly performing a time/frequency transformation on blocks of the time signal. The parameters obtained are thereafter quantized and coded so as to be transmitted in a frame with other supplementary information required for decoding. This time/frequency transformation may take the form of a bank of frequency subband filters or else a transform of MDCT type (standing for “Modified Discrete Cosine Transform”). Hereinbelow, the same terms “subband domain” will designate a domain defined in a frequency subband space, a domain of a frequency-transformed time space or a frequency domain.
To perform the sound spatialization on such streams, the conventional procedure consists in firstly doing a decoding, carrying out the sound spatialization processing on the time signals, then recoding the signals which result, for transmission to a restitution terminal. This irksome succession of steps is often very expensive in terms of calculation power, of memory required for the processing and of the algorithmic lag introduced. It is therefore often unsuited to the constraints imposed by machines where the processing is performed and to the communication constraints.
The present invention comes to improve the situation.
One of the aims of the present invention is to propose a method of processing sound data grouping together the operations of compression coding/decoding of the audio streams and of spatialization of said streams.
Another aim of the present invention is to propose a method of processing sound data, by spatialization, which adapts to a variable number (dynamically) of sound sources to be positioned.
A general aim of the present invention is to propose a method of processing sound data, by spatialization, allowing wide broadcasting of the spatialized sound data, in particular broadcasting for the general public, the restitution devices being simply equipped with a decoder of the signals received and restitution loudspeakers.
To this end it proposes a method of processing sound data, for spatialized restitution of acoustic signals, in which:
Each acoustic signal in step a) of the method within the sense of the invention is at least partially compression-coded and is expressed in the form of a vector of subsignals associated with respective frequency subbands, and each filtering unit is devised so as to perform a matrix filtering applied to each vector, in the frequency subband space.
Advantageously, each matrix filtering is obtained by conversion, in the frequency subband space, of a (finite or infinite) impulse response filter defined in the time space. Such an impulse response filter is preferably obtained by determination of an acoustic transfer function dependent on a direction of perception of a sound and the frequency of this sound.
According to an advantageous characteristic of the invention, these transfer functions are expressed by a linear combination of frequency dependent terms weighted by direction dependent terms, thereby making it possible, as indicated hereinabove, on the one hand, to process a variable number of acoustic signals in step a) and, on the other hand, to dynamically vary the position of each source over time. Furthermore, such an expression for the transfer functions “integrates” the interaural delay which is conventionally applied to one of the output signals, with respect to the other, before restitution, in binaural processing. To this end, matrices of filters of gains associated with each signal are envisaged.
Thus, said first and second output signals preferably being intended to be decoded into first and second restitution signals, the aforesaid linear combination already takes account of a time shift between these first and second restitution signals, in an advantageous manner.
Finally, between the step of reception/decoding of the signals received by a restitution device and the step of restitution itself, it is possible not to envisage any further step of sound spatialization, this spatialization processing being completely performed upstream and directly on coded signals.
According to one of the advantages afforded by the present invention, association of the techniques of linear decomposition of the HRTFs with the techniques of filtering in the subband domain makes it possible to profit from the advantages of the two techniques so as to arrive at sound spatialization systems with low complexity and reduced memory for multiple coded audio signals.
Specifically, in a conventional “bichannel” architecture, the number of filters to be used is dependent on the number of sources to be positioned. As indicated hereinabove, this problem does not arise in an architecture based on the linear decomposition of HRTFs. This technique is therefore preferable in terms of calculation power, but also memory space required for storing the binaural filters. Finally, this architecture makes it possible to optimally manage the dynamic binaural system, since it makes it possible to effect the “fading” between two instants t1 and t2 on coefficients which depend only on position, and therefore does not require two sets of filters in parallel.
According to another advantage afforded by the present invention, the direct filtering of the signals in the coded domain allows a saving of one complete decoding per audio stream before undertaking the spatialization of the sources, thereby entailing a considerable gain in terms of complexity.
According to another advantage afforded by the present invention, the sound spatialization of the audio stream can occur at various points of a transmission chain (servers, nodes of the network or terminals). The nature of the application and the architecture of the communication used may favor one or other case. Thus, in a teleconferencing context, the spatialization processing is preferably performed at the level of the terminals in a decentralized architecture and, on the contrary, at the audio bridge level (or MCU standing for “Multipoint Control Unit”) in a centralized architecture. For audio “streaming” applications, especially on mobile terminals, the spatialization may be carried out either in the server, or in the terminal, or else during content creation. In these various cases, a decrease in the processing complexity and also the memory required for the storage of the HRTF filters is still felt. For example, for mobile terminals (second and third generation portable telephones, PDA, or pocket micro computers) having heavy constraints in terms of calculational capacity and memory size, provision is preferably made for spatialization processing directly at the level of a contents server.
The present invention may also find applications in the field of the transmission of multiple audio streams included in structured sound scenes, as provided for in the MPEG-4 standard.
Other characteristics, advantages and applications of the invention will become apparent on examining the detailed description hereinbelow, and the appended drawings, in which:
Reference is firstly made to
Reference is now made to
|HRTF(θ,φ,f)|=Σn=1PCn(θ,φ)Ln(f) Eq[1]
Each signal of a source Si to be spatialized (i=1, . . . , N) is weighted by coefficients Cni(θ,φ) (n=1, . . . , P) emanating from the linear decomposition of the HRTFs. These coefficients have the particular feature of depending only on the position [θ,φ] at which one wishes to place the source, and not on the frequency f. The number of these coefficients depends on the number P of basis vectors that were preserved for the reconstruction. The N signals of all the sources, weighted by the “directional” coefficient Cni, are then added together (for the right channel and the left channel, separately), then filtered by the filter corresponding to the nth basis vector. Thus, contrary to the “bichannel” binaural synthesis, the addition of a further source does not require the addition of two extra filters (often of FIR or IIR type). The P basis filters are in effect shared by all the sources present. This implementation is said to be “multichannel”. Moreover, in the case of dynamic binaural synthesis, it is possible to vary the coefficients Cni(θ,φ) without the appearance of clicks at the output of the device. In this case, only 2.P filters are required, whereas 4.N filters were required by channel synthesis.
In
However, before referring to
In the case where a spatialization processing of this type is carried out at the communication terminal level, a step of decoding the N signals is required before the spatialization processing proper. This step demands considerable calculational resources (this being problematic on current communication terminals in particular of portable type). Moreover, this step entails a lag in the signals processed, thereby hindering the interactivity of the communication. If the sound scene transmitted comprises a large number of sources (N), the decoding step may in fact become more expensive in terms of calculational resources than the sound specialization step proper. Specifically, as indicated hereinabove, the calculational cost of the “multichannel” binaural synthesis depends only very slightly on the number of sound sources to be spatialized.
The calculational cost of the operation for spatializing the N coded audio streams (in the multichannel synthesis of
In the case where the spatialization is not carried out at the level of a terminal but at the level of a server (case of
Referring to
Thus, the decoding of the N coded streams is required before the step of spatializing the sound sources, thereby giving rise to an increase in the calculational cost and the addition of a lag due to the processing of the decoder. It is indicated that the initial audio sources are generally stored directly in coded format, in the current contents servers.
It is indicated furthermore that for restitution on more than two loudspeakers (transaural synthesis or else in an “ambisonic” context that will be described below), the number of signals resulting from the spatialization processing is generally greater than two, thereby further increasing the calculational cost for completely recoding these signals before their transmission by the communication network.
Reference is now made to
It consists in associating the “multichannel” deployment of binaural synthesis (
The various steps for processing the data and the architecture of the system are described in detail hereinbelow.
In the case of spatialization of multiple coded audio signals, at the server level as in the example represented in
The overall calculational cost of the operation of spatializing the coded audio streams is then considerably reduced. Specifically, the initial operation of decoding in a conventional system is replaced with an operation of partial decoding of much lesser complexity. The calculational burden in a system within the sense of the invention becomes substantially constant as a function of the number of audio streams that one wishes to spatialize. With respect to conventional systems, one obtains a gain in terms of calculational cost which then becomes proportional to the number of audio streams that one wishes to spatialize. Moreover, the operation of partial decoding gives rise to a lower processing lag than the complete decoding operation, this being especially beneficial in an interactive communication context.
The system for the implementation of the method according to the invention, performing spatialization in the subband domain, is denoted “system II” in
Described hereinbelow is the obtaining of the parameters in the subband domain from binaural impulse responses.
In a conventional manner, the binaural transfer functions or HRTFs are accessible in the form of temporal impulse responses. These functions generally consist of 256 temporal samples, at a sampling frequency of 44.1 kHz (typical in the field of audio). These impulse reponses may emanate from acoustic simulations or measurements.
The pre-processing steps for obtaining the parameters in the subband domain are preferably the following:
It will be noted that the matrices of filters Gi applied independently to each source “integrate” a conventional operation of delay calculation for the addition of the interaural delay between a signal Li and a signal Ri to be restored. Specifically, in the time domain, provision is conventionally made for delay lines τi (
In the case of a transmission from a server to restitution terminals, all these steps are performed advantageously off-line. The matrices of filters hereinabove are therefore calculated once and then stored definitively in the memory of the server. It will be noted in particular that the set of weighting coefficients Cni, Dni advantageously remains unchanged from the time domain to the subband domain.
For spatialization techniques based on filtering by HRTF filters and addition of the ITD delay (standing for “Interaural Time Delay”) such as binaural and transaural synthesis, or else filters of transfer functions in the ambisonic context, a difficulty has arisen finding equivalent filters to be applied to samples in the subband domain. Specifically, these filters emanating from the bank of analysis filters must preferably be constructed in such a way that the left and right time signals restored by the bank of synthesis filters exhibit the same sound rendition, and without any artefact, as that obtained through direct spatialization on a temporal signal. The design of filters making it possible to achieve such a result is not immediate. Specifically, the modification of the spectrum of the signal afforded by filtering in the time domain cannot be carried out directly on the subband signals without taking account of the spectrum overlap phenomenon (“aliasing”) introduced by the bank of analysis filters. The dependency relation between the aliasing components of the various subbands is preferably preserved during the filtering operation so that their removal is ensured by the bank of synthesis filters.
Described hereinbelow is a method for transposing a rational filter S(z), of FIR or IIR type (its z transform being a quotient of two polynomials) in the case of a linear decomposition of HRTFs or of transfer functions of this type, into the subband domain, for a bank of filters with M subbands and with critical sampling, defined respectively by its analysis and synthesis filters Hk(z) and Fk(z), where 0≦k≦M−1. The expression “critical sampling” is understood to mean the fact that the number of the collection of output samples of the subbands corresponds to the number of samples input. This bank of filters is also assumed to satisfy the perfect reconstruction condition.
We firstly consider a transfer matrix S(z) corresponding to the scalar filter S(z), which is expressed as follows:
where Sk(z) (0≦k≦M−1) are the polyphase components of the filter S(z).
These components are obtained directly for an FIR filter. For IIR filters, a calculational procedure is indicated in:
We thereafter determine polyphase matrices, E(z) and R(z), corresponding respectively to the banks of analysis and synthesis filters. These matrices are determined definitively for the filter bank considered.
We then calculate the matrix for complete subband filtering by the following formula: Ssb(z)=zkE(z)S(z)R(z), where zk corresponds to an advance with K=(L/M)−1 (characterizing the filter bank used), L being the length of the analysis and synthesis filters of the filter banks used.
We next construct the matrix {tilde over (S)}sb(z) whose rows are obtained from those of Ssb(Z) as follows: [0 . . . Ssbil(z) . . . Ssbii(z) . . . Ssbin(z) . . . 0] (0≦n≦M−1), where:
It is indicated that the number chosen δ corresponds to the number of bands that overlap sufficiently on one side with the passband of a filter of the bank of filters. It therefore depends on the type of bank of filters used in the coding chosen. By way of example, for the MDCT filter bank, δ may be taken equal to 2 or 3. For the pseudo-QMF filter bank of the MPEG-1 coding, δ is taken equal to 1.
It will be noted that the result of this transposition of a finite or infinite impulse response filter to the subband domain is a matrix of filters of size M×M. However, not all the filters of this matrix are considered during the subband filtering. Advantageously, only the filters of the main diagonal and of a few adjacent subdiagonals may be used to obtain a result similar to that obtained by filtering in the time domain (without however impairing the quality of restitution).
The matrix {tilde over (S)}sb(z) resulting from this transposition, then reduced, is that used for the subband filtering.
By way of example, indicated hereinbelow are the expressions for the polyphase matrices E(z) and R(z) for an MDCT filter bank, widely used in current transform-based coders such as those operating according to the MPEG-2/4 AAC, or Dolby AC-2 & AC-3, or the Applicant's TDAC standards. The processing below may just as well be adapted to a bank of filters of pseudo-QMF type of the MPEG-1/2 layer I-II coder.
An MDCT filter bank is generally defined by a matrix T=[tk,l], of size M×2M, whose elements are expressed as follows:
0≦k≦M−1 and 0≦l≦2M−1, where h[l] corresponds to the weighting window, a possible choice for which is the sinusoidal window which is expressed in the following form:
The polyphase analysis and synthesis matrices are then given respectively by the following formulae:
E(z)=TlJM+T0JMz−1,
R(z)=JMT0T+JMTlTz−1,
where
corresponds to the anti-identity matrix of size M×M and T0 and T1 are matrices of size M×M resulting from the following partition:
T=[T0 T1].
It is indicated that for this filter bank L=2M and K=1.
For filter banks of pseudo-QMF type of MPEG-1/2 Layer I-II, we define a weighting window h[i], i=0 . . . L-1, and a cosine modulation matrix Ĉ=[ckl], of size M×2M, whose coefficients are given by:
with the following relations: L=2 mM and K=2m−1 where m is an integer. More particularly in the case of the MPEG-1/2 Layer I-II coder, these parameters take the following values: M=32, L=512, m=8 and K=15.
The polyphase analysis matrix is then expressed as follows:
where g0(z) and g1(z) are diagonal matrices defined by:
In the MPEG-1 Audio Layer I-II standard, the values of the window (−1)1h(21M+k) are typically provided, with 0≦k≦2M−1, 0≦l≦m−1.
The polyphase synthesis matrix may then be deduced simply through the following formula:
R(z)=z−(2m−1)ET(z−1)
Thus, now referring to
Referring to
In the example represented in
Thus, an initial step of partial decoding of the coded signals Si is envisaged, before the spatialization processing. However, this step is much less expensive and faster than the operation of complete decoding which was required in the prior art (
It is indicated that the two vertical broken lines of
It is indicated that the document:
as well as the document [1] cited above, relate to a general procedure for calculating a transposition into the subband domain of a finite or infinite impulse response filter.
It is indicated moreover that techniques of sound spatialization in the subband domain have been proposed recently, in particular in another document:
This latter document presents a procedure making it possible to transpose a finite impulse response filter (FIR) into the subband domain of the pseudo-QMF filter banks of the MPEG-1 Layer I-II and MDCT coder of the MPEG-2/4 AAC coder. The equivalent filtering operation in the subband domain is represented by a matrix of FIR filters. In particular, this proposal fits within the context of a transposition of HRTF filters, directly in their classical form and not in the form of a linear decomposition such as expressed by equation Eq[1] above and over a basis of filters within the sense of the invention. Thus, a drawback of the procedure within the sense of this latter document consists in that the spatialization processing cannot adapt to any number of encoded audio streams or sources to be spatialized.
It is indicated that, for a given position, each HRTF filter (of order 200 for an FIR and of order 12 for an IIR) gives rise to a (square) matrix of filters of dimension equal to the number of subbands of the filter bank used. In document [3] cited above, provision must be made for a sufficient number of HRTFs to represent the various positions in space, this posing a memory size problem if one wishes to spatialize a source at any position whatsoever in space.
On the other hand, an adaptation of a linear decomposition of the HRTFs in the subband domain, in the sense of the present invention, does not present this problem since the number (P) of matrices of basis filters Ln and Rn is much smaller. These matrices are then stored definitively in a memory (of the content server or of the restitution terminal) and allow simultaneous spatialization processing of any number of sources whatsoever, as represented in
Described hereinbelow is a generalization of the spatialization processing within the sense of
The aforesaid system may also take the form of a sound rendition system consisting in decoding the signals emanating from the sound pick-up so as to adapt them to the sound rendition transducer devices (such as a plurality of loudspeakers or a stereophonic type headset). The p signals are transformed into n signals which feed the n loudspeakers.
By way of example, the binaural synthesis consists in carrying out a pick-up of real sound, with the aid of a pair of microphones introduced into the ears of a human head (artificial or real). Recording may also be simulated by carrying out the convolution of a monophonic sound with the pair of HRTFs corresponding to a desired direction of the virtual sound source. On the basis of one or more monophone signals originating from predetermined sources, are obtained two signals (left ear and right ear) corresponding to a so-called “binaural encoding” phase, these two signals simply being applied thereafter to a headset with two earpieces (such as a stereophonic headset).
However, other encodings and decodings are possible on the basis of the filter decomposition corresponding to transfer functions over a basis of filters. As indicated hereinabove, the spatial and frequency dependencies of the transfer functions, of the HRTF type, are separated by virtue of a linear decomposition and may be written as a sum of spatial functions Ci(θ,φ) and of reconstruction filters Li(f) which depend on frequency:
However, it is indicated that this expression may be generalized to any type of encoding, for n sound sources Sj(f) and an encoding format comprising p signals at output, to:
where, for example in the case of binaural synthesis, Xij may be expressed in the form of a product of the filters of gains Gj and of the coefficients Cij, Dij.
We refer to
Likewise, a general relation, for a decoding format comprising p signals Ei(f) and a sound rendition format comprising m signals, is given by:
For a given sound rendition system, the filters Kji(f) are fixed and depend, at constant frequency, only on the sound rendition system and its disposition with respect to a listener. This situation is represented in
Of course, several decoding systems may be arranged in series, according to the application in mind.
For example, in the bidimensional ambisonic context of order 1, an encoding format with three signals W, X, Y for p sound sources is expressed, for the encoding, by:
E1=W=Σj=1nSj
E2=X=Σj=1n cos(θj)Sj
E3=Y=Σj=1n sin(θj)Sj
For the “ambisonic” decoding at a restitution device with five loudspeakers on two frequency bands [0, f1] and [f1, f2] with f1=400 Hz and f2 corresponding to a passband of the signals considered, the filters Kji(f) take the constant numerical values on these two frequency bands, given in tables I and II below.
Of course, different methods of spatialization (ambisonic context and binaural and/or transaural synthesis) may be combined at a server and/or at a restitution terminal, such methods of spatialization complying with the general expression for a linear decomposition of transfer functions in the frequency space, as indicated hereinabove.
Described hereinbelow is an implementation of the method within the sense of the invention in an application related to a teleconference between remote terminals.
Referring again to
This spatialization may be static or dynamic and, furthermore, interactive. Thus, the position of the talkers is fixed or may vary over time. If the spatialization is not interactive, the position of the various talkers is fixed: the listener cannot modify it. On the other hand, if the spatialization is interactive, each listener can configure his terminal so as to position the voice of the other N talkers where he so desires, substantially in real time.
Referring now to
Described hereinbelow is the case of “streaming” or of downloading of a sound scene, in particular in the context of compression coding according to the MPEG-4 standard.
This scene may be simple, or else complex as often within the framework of MPEG-4 transmissions, where the sound scene is transmitted in a structured format. In the MPEG-4 context, the client terminal receives, from a multimedia server, a multiplexed binary stream corresponding to each of the coded primitive audio objects, as well as instructions as to their composition for reconstructing the sound scene. The expression “audio object” is understood to mean an elementary binary stream obtained via an audio MPEG-4 coder. The MPEG-4 System standard provides a special format, called “AudioBIFS” (standing for “Binary Format for Scene description”), so as to transmit these instructions. The role of this format is to describe the spatio-temporal composition of the audio objects. To construct the sound scene and ensure a certain rendition, these various decoded streams may undergo subsequent processing. Particularly, a sound spatialization processing step may be performed.
In the “AudioBIFS” format, the manipulations to be performed are represented by a graph. The decoded audio signals are provided as input to the graph. Each node of the graph represents a type of processing to be carried out on an audio signal. The various sound signals to be restored or to be associated with other media objects (images or the like) are provided as output from the graph.
The algorithms used are updated dynamically and are transmitted together with the graph of the scene. They are described in the form of routines written in a specific language such as “SAOL” (standing for “Structured Audio Score Language”). This language possesses predefined functions which include in particular and in an especially advantageous manner FIR and IIR filters (which may then correspond to HRTFs, as indicated hereinabove).
Furthermore, in the audio compression tools provided by the MPEG-4 standard, there are transform-based coders used especially for high quality audio transmission (multiphonic and multichannel). Such is the case for the AAC and TwinVQ coders based on the MDCT transform.
Thus, in the MPEG-4 context, the tools making it possible to implement the method within the sense of the invention are already present.
In a receiver MPEG-4 terminal, it is then sufficient to integrate the bottom decoding layer with the nodes of the upper layer which ensures particular processing, such as binaural spatialization by HRTF filters. Thus, after partial decoding of the demultiplexed elementary audio binary streams arising from one and the same type of coder (MPEG-4 AAC for example), the nodes of the “AudioBIFs” graph which involve binaural spatialization may be processed directly in the subband domain (MDCT for example). The operation of synthesis based on filter bank is performed only after this step.
In a centralized multipoint teleconferencing architecture such as represented in
It is understood that a reduction in the complexity of processing is especially desired in this case. Specifically, for a conference with N terminals (N≧3), the audio bridge must carry out spatialization of the talkers arising from the terminals for each of the N subsets consisting of (N-1) talkers from among the N participants to the conference. Processing in the coded domain affords more benefit of course.
Additionally, as indicated hereinabove, the position of the sound source to be spatialized may vary over time, this amounting to making the directional coefficients of the subband domain Cni and Dni vary over time. The variation of the value of these coefficients is preferably effected in a discrete manner.
Of course, the present invention is not limited to the embodiments described hereinabove by way of examples but extends to other variants defined within the framework of the claims hereinbelow.
Number | Date | Country | Kind |
---|---|---|---|
03/02 397 | Feb 2003 | FR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FR04/00385 | 2/18/2004 | WO | 8/25/2005 |