The present invention relates to the field of the coding/decoding of digital signals.
The coding and the decoding according to the invention is adapted in particular to the transmission and/or the storage of digital signals such as audiofrequency signals (speech, music or other).
More particularly, the present invention pertains to the parametric multichannel coding and decoding of multichannel audio signals.
The invention is therefore concerned with multichannel signals, and in particular with binaural signals which are sound signals recorded with microphones placed at the entrance of the canal of each ear (of a person or of a mannequin) or else synthesized artificially by way of filters known as HRIR (Head-Related Impulse Response) filters in the time domain or HRTF (Head-Related Transfer Function) filters in the frequency domain, which are dependent on the direction and distance of the sound source and the morphology of the subject.
Binaural signals are associated with listening typically with a headset or earpiece and exhibit the advantage of representing a spatial image giving the illusion of being naturally in the midst of a sound scene; it therefore entails reproduction of the sound scene in 3D with only 2 channels. It will be noted that it is possible to listen to a binaural sound on loudspeakers by way of complex processings for inverting the HRIR/HRTF filters and for reconstructing binaural signals.
Here we distinguish binaural signals from stereo signals. A stereo signal also consists of two channels but it does not in general allow perfect reproduction of the sound scene in 3D. For example, a stereo signal can be constructed by taking a given signal on the left channel and a zero signal on the right channel, listening to such a signal will give a sound source location on the left but in a natural environment this stratagem is not possible since the signal to the right ear is a filtered version (including a time shift and an attenuation) of the signal to the left ear as a function of the person's morphology.
Parametric multichannel coding is based on the extraction and the coding of spatial-information parameters so that, on decoding, these spatial characteristics can be used to recreate the same spatial image as in the original signal. Examples of codecs based on this principle are found in the 3GPP e-AAC+ or MPEG Surround standards.
The case of parametric stereo coding with N=2 channels is considered here by way of example, insofar as its description is simpler than in the case of N>2 channels.
A parametric stereo coding/decoding technique is for example described in the document by J. Breebaart, S. van de Par, A. Kohlrausch, E. Schuijers, entitled “Parametric Coding of Stereo Audio” in EURASIP Journal on Applied Signal Processing 2005:9, pp. 1305-1322. This example is also employed with reference to
Thus,
The temporal signals L (n) and R (n), where n is the integer index of the samples, are processed by the blocks 101, 102, 103 and 104 which perform a short-term Fourier analysis. The transformed signals L[k] and R[k], where k is the integer index of the frequency coefficients, are thus obtained.
The block 105 performs a channels reduction processing or “downmix” in English to obtain in the frequency domain on the basis of the left and right signals, a monophonic signal hereinafter named mono signal. Several techniques have been developed for stereo to mono channel reduction or “downmix” processing. This “downmix” can be performed in the time or frequency domain. One generally distinguishes:
Extraction of spatial-information parameters is also performed in the block 105. The extracted parameters are the following.
The parameters ICLD or ILD or CLD (for “InterChannel/Channel Level Difference” in English), also called differences of interchannel intensity, characterize the ratios of energy per frequency sub-band between the left and right channels. These parameters make it possible to position sound sources in the stereo horizontal plane by “panning”. They are defined in dB by the following formula:
where L[k] and R[k] correspond to the (complex) spectral coefficients of the channels L and R, each frequency band of index b=0, . . . , B−1 comprises the frequency spectral lines in the interval [kb, kb+1−1], the symbol * indicates the complex conjugate and B is the number of sub-bands.
The parameters ICPD or IPD (for “InterChannel Phase Difference” in English), also called phase differences, are defined according to the following relation:
ICPD[b]=∠(Σk=k
where ∠ indicates the argument (the phase) of the complex operand.
It is also possible to define in an equivalent manner to the ICPD, an interchannel time shift called ICTD or ITD (for “InterChannel Time Difference” in English). The ITD can for example be measured as the delay which maximizes the intercorrelation between L and R:
where d defines the search interval for the maximum. It will be noted that the correlation in equation (3) can be normalized.
In contradistinction to the parameters ICLD, ICPD and ICTD which are location parameters, the parameter ICC (for “InterChannel Coherence” in English) represents the level of inter-channel correlation (or coherence) and is associated with the spatial width of a sound source; the ICC can be defined as:
where the correlation can be normalized just as for eq. (3).
It is noted in the article by Breebart et al. that the ICC parameters are not necessary in the sub-bands that are reduced to a single frequency coefficient—indeed the differences of amplitude and of phase completely describe the spatialization in this “degenerate” case.
The ICLD and ICPD parameters are extracted by analysis of the stereo signals, by the block 105. The ICTD or ICC parameters can also be extracted per sub-band on the basis of the spectra L[k] and R[k]; however their extraction is in general simplified by assuming an identical interchannel time shift for each sub-band and in this case a parameter can be extracted on the basis of the temporal channels L(n) and R(n).
The mono signal M[k] is transformed into the time domain (blocks 106 to 108) after short-term Fourier synthesis (inverse FFT, windowing and OverLap-Add or OLA in English) and a mono coding (block 109) is carried out thereafter. In parallel the stereo parameters are quantized and coded in the block 110.
In general the spectrum of the signals (L[k], R[k]) is divided according to a non-linear frequency scale of ERB (Equivalent Rectangular Bandwidth) or Bark type. The parameters (ICLD, ICPD, ICC, ITD) are coded by scalar quantization optionally followed by an entropy coding and/or by a differential coding. For example, in the article cited above, the ICLD is coded by a non-uniform quantizer (ranging from −50 to +50 dB) with differential entropy coding. The non-uniform quantization step exploits the fact that the larger the value of the ICLD the lower the auditory sensitivity to the variations of this parameter.
For the coding of the mono signal (block 109), several quantization techniques with or without memory are possible, for example “Pulse-Code Modulation” (PCM) coding, its version with adaptive prediction termed “Adaptive Differential Pulse-Code Modulation” (ADPCM) or more advanced techniques such as transform-based perceptual coding or “Code Excited Linear Prediction” (CELP) coding or multi-mode coding.
One is concerned here more particularly with the 3GPP EVS (for “Enhanced Voice Services”) standard which uses multi-mode coding. The algorithmic details of the EVS codec are provided in the specifications 3GPP TS 26.441 to 26.451 and they are therefore not repeated here. Hereinafter, these specifications will be referred to by the name EVS.
The input signal of the (mono) EVS codec is sampled at the frequency of 8, 16, 32 or 48 kHz and the codec can represent telephone audio bands (narrowband, NB), wide (wideband, WB), super-wide (super-wideband, SWB) or full band (fullband, FB). The bitrates of the EVS codec are divided into two modes:
To this is added the discontinuous-transmission mode (DTX) in which the frames detected as inactive are replaced with SID frames (SID Primary or SID AMR-WB TO) which are transmitted in an intermittent manner, about once every 8 frames.
At the decoder 200, with reference to
An exemplary parametric stereo coding seeking to represent binaural signals (without regard for the nature of the HRTF filters) is described in the article by Pasi Ojala, Mikko Tammi, Miikka Vilermo, entitled “Parametric binaural audio coding”, in Proc. ICASSP, 2010, pp. 393-396. Two parameters are coded to restore a spatial image with a location close to a binaural image: the ICLD and the ITD. Moreover a parameter ALC (for “Ambience Level Control” in English) similar to the ICC is also coded, making it possible to control the level of the “ambience” associated with the use of decorrelated channels. This codec is described for signals in the super-wide band with 20-ms frames and a bitrate of 20 or 32 kbit/s to code the mono signal to which is added a bitrate of 5 kbit/s to code the spatial parameters.
Another exemplary parametric stereo codec developed with a specific mode to code binaural signals is given by the standard G.722 Annex D, in particular in the stereo coding mode R1ws in the widened band to 56+8 kbit/s. This codec operates with “short” frames of 5 ms according to 2 modes: a “transient” mode where ICLDs are coded on 38 bits and a “normal” mode where ICLDs are coded on 24 bits with a full-band ITD/IPD on 5 bits. The details of estimating the ITD, of coding the ICLD and ITD parameters are not repeated here. It will be noted that the ICLDs are coded by “decimation” by distributing the coding of the ICLDs over several successive frames, coding only a subset of the parameters of a given frame.
In the two examples it is important to note that one is not dealing with binaural codecs, but with stereo codecs seeking to reproduce a spatial image similar to a binaural signal.
It will be noted that the case of parametric multichannel coding with N>2 follows the same principle as the case N=2, however in general the downmix might not be mono but stereo and the inter-channel parameters must cover more than 2 channels. An exemplary embodiment is given in the MPEG Surround standard where ICLD, ICTD and ICC parameters are coded. It will also be noted that the MPEG Surround decoder includes a binaural restoration, parametrized by HRTF filters.
Let us consider now the case of a stereo coding and decoding of parameters of ICLD type such as is described in
where σL2[b] and σR2[b] represent respectively the energy of the left channel (L[k]) and of the right channel (R[k]):
According to the prior art, the coding of a block of 35 ICLD of a given frame can be carried out for example with:
thus giving a total of 5+32×4+2×3=139 bits/frame, i.e. a bitrate of close to 7 kbit/s in the case of 20-ms frames. This bitrate does not comprise the other parameters.
This bitrate of approximately 7 kbit/s can be reduced on average by using a variable-bitrate entropy coding, for example a Huffman coding; however, in most cases, a drastic bitrate reduction will not be possible.
To halve the bitrate of the coding of the ICLD parameters, it would be possible to use the alternate coding approach described previously in the case of stereo G.722 coding. However, the associated bitrate remains significant for a coding with 35 sub-bands and 20 ms of frame; moreover, the temporal resolution of the coding would be reduced and this may be problematic in the case of non-stationary signals. Another approach would consist in reducing the number of sub-bands to go from 35 to for example 20 sub-bands. This would reduce the bitrate associated with the ICLD parameters, but would in general degrade the fidelity of the synthesized spatial image.
If it is assumed that the coder of
A need therefore exists to represent the spatial parameters of a multichannel signal in an efficient manner, at as low a bitrate as possible and with acceptable quality.
The invention improves the situation of the prior art.
For this purpose, it proposes a method of parametric coding of a multichannel digital audio signal comprising a step of coding a signal arising from a channels reduction processing applied to the multichannel signal and of coding spatialization cues in respect of the multichannel signal. The method is such that it comprises the following steps:
The scheme for coding the spatialization cues relies on a model-based approach which makes it possible to approximate the spatial cues. Thus the coding of a plurality of spatial cues is reduced to the coding of an angle parameter thereby considerably reducing the coding bitrate with respect to the direct coding of the spatial cue. The bitrate required for the coding of this parameter is therefore reduced.
In a particular embodiment based on sub-bands, the spatialization cues are defined by frequency sub-bands of the multichannel audio signal and at least one angle parameter per sub-band is determined and coded.
In a particular embodiment, the method furthermore comprises the steps of calculating a reference spatialization cue and of coding this reference spatialization cue.
Thus, the coding of a reference cue can improve decoding quality. The bitrate for coding this reference cue does not require too significant a bitrate.
This scheme is particularly well suited to the coding of the spatial cue of interchannel time shift (ITD) type and/or of interchannel intensity difference (ILD) type.
To further improve the quality of decoding of the cue of ILD type, the method furthermore comprises the following steps:
The coding of this residual requires an additional coding bitrate but this scheme still affords a gain in bitrate with respect to the direct coding of the ILD spatialization cue.
In a particular embodiment, a spatialization-cue-based representation model is obtained. It can be fixed and stored in memory.
This fixed and recorded model is for example a model of sine form. This type of model is adapted to suit the form of the ITD or ILD cue according to the position of the source.
In a variant embodiment, the obtaining of a representation model of the spatialization cues is performed by selecting from a table of models defined for various values of the spatialization cues.
Several models may be selectable as a function of characteristics of the multichannel signal. This makes it possible to best adapt the spatialization cue model to the signal.
The index of the model chosen can then be in one embodiment, coded and transmitted.
In a variant embodiment a representation model common to several spatialization cues is obtained.
This makes it possible to pool the selection of a model to several spatialization cues, thereby reducing the processing operations to be performed.
The invention also pertains to a method of parametric decoding of a multichannel digital audio signal comprising a step of decoding a signal arising from a channels reduction processing applied to the multichannel and coded signal and of decoding spatialization cues in respect of the multichannel signal. The method is such that it comprises the following steps for decoding at least one spatialization cue:
In the same way as for the coding, this scheme based on the use of a representation model of the spatialization cues makes it possible to retrieve the cue with good quality without it being necessary to have too large a bitrate. At reduced bitrate, a plurality of spatialization cues is retrieved by decoding a simple angle parameter.
In a particular embodiment, the method comprises a step of receiving and decoding an index of table of models and of obtaining the at least one representation model of the spatialization cues to be decoded on the basis of the decoded index.
Thus, it is possible to adapt the model to be used according to the characteristics of the multichannel signal.
The invention pertains to a parametric coder of a multichannel digital audio signal comprising a module for coding a signal arising from a module for channels reduction processing applied to the multichannel signal and modules for coding spatialization cues in respect of the multichannel signal. The coder is such that it comprises:
The coder exhibits the same advantages as the method that it implements.
The invention pertains to a parametric decoder of a multichannel digital audio signal comprising a module for decoding a signal arising from a channels reduction processing applied to the multichannel and coded signal and a module for decoding spatialization cues in respect of the multichannel signal. The decoder is such that it comprises:
The decoder exhibits the same advantages as the method that it implements.
Finally, the invention pertains to a computer program comprising code instructions for the implementation of the steps of a coding method according to the invention, when these instructions are executed by a processor, to a computer program comprising code instructions for the implementation of the steps of a decoding method according to the invention, when these instructions are executed by a processor.
The invention pertains finally to storage medium readable by a processor on which is recorded a computer program comprising code instructions for the execution of the steps of the coding method such as described and/or of the decoding method such as described.
Other characteristics and advantages of the invention will become more clearly apparent on reading the following description, given solely by way of nonlimiting example and with reference to the appended drawings in which:
With reference to
The case of a signal with two channels is described here. The invention also applies to the case of a multichannel signal with a number of channels greater than 2.
To avoid overburdening the text, the coder described in
This parametric stereo coder such as illustrated uses an EVS mono coding according to the specifications 3GPP TS 26.442 (fixed-point source code) or TS 26.443 (floating-point source code), it operates with stereo or multichannel signals sampled at the sampling frequency Fs of 8, 16, 32 and 48 kHz, with 20-ms frames. Hereinafter, with no loss of generality, the description is given mainly for the case Fs=16 kHz and for the case N=2 channels.
It should be noted that the choice of a frame length of 20 ms is not in any case restrictive in the invention which applies likewise in variants of the embodiment where the frame length is different, for example 5 or 10 ms, with a codec other than EVS.
Moreover, the invention applies likewise to other types of mono coding (e.g.: IETF OPUS, UIT-T G.722) operating at identical or non-identical sampling frequencies.
Each temporal channel (L(n) and R(n)) sampled at 16 kHz is firstly pre-filtered by a high-pass filter (HPF for High Pass Filter in English) typically eliminating the components below 50 Hz (blocks 301 and 302). This pre-filtering is optional, but it can be used to avoid the bias due to the continuous component (DC) in the estimation of parameters such as the ICTD or the ICC.
The channels L′(n) and R′(n) arising from the pre-filtering blocks are analyzed in terms of frequencies by discrete Fourier transform with sinusoidal windowing with overlap of 50% of length 40 ms i.e. 640 samples (blocks 303 to 306). For each frame, the signal (L′(n), R′(n)) is therefore weighted by a symmetric analysis window covering 2 frames of 20 ms i.e. 40 ms (or 640 samples for Fs=16 kHz). The 40-ms analysis window covers the current frame and the future frame. The future frame corresponds to a “future” signal segment commonly called “lookahead” of 20 ms. In variants of the invention, other windows could be used, for example a low-delay asymmetric window called “ALDO” in the EVS codec. Moreover, in variants, the analysis windowing could be rendered adaptive as a function of the current frame, so as to use an analysis with a long window on stationary segments and an analysis with short windows on transient/non-stationary segments, optionally with transition windows between long and short windows.
For the current frame of 320 samples (20 ms at Fs=16 kHz), the spectra obtained, L[k] and R[k] (k=0 . . . 320), comprise 321 complex coefficients, with a resolution of 25 Hz per frequency coefficient. The coefficient of index k=0 corresponds to the continuous component (0 Hz), it is real. The coefficient of index k=320 corresponds to the Nyquist frequency (8000 Hz for Fs=16 kHz), it is also real. The coefficients of index 0<k<160 are complex and correspond to a sub-band of width 25 Hz centered on the frequency of k.
The spectra L[k] and R[k] are combined in the block 307 to obtain a mono signal (downmix) M[k] in the frequency domain. This signal is converted into time by inverse FFT and windowing-overlap with the “lookahead” part of the previous frame (blocks 308 to 310).
An example of frequency “downmix” technique is described in the document entitled “A stereo to mono downmixing scheme for MPEG-4 parametric stereo encoder” by Samsudin, E. Kurniawati, N. Boon Poh, F. Sattar, S. George, in Proc. ICASSP, 2006.
In this document, the L and R channels are aligned in phase before performing the channels reduction processing.
More precisely, the phase of the L channel for each frequency sub-band is chosen as the reference phase, the R channel is aligned according to the phase of the L channel for each sub-band through the following formula:
R′[k]=ej·ICPD[b]R[k] (7)
where R′[k] is the aligned R channel, k is the index of a coefficient in the bth frequency sub-band, ICPD[b] is the inter-channel phase difference in the bth frequency sub-band given by equation (2).
Note that when the sub-band of index b is reduced to a frequency coefficient, we find:
R′[k]=|R[k]|·ej∠L[k] (8)
Finally the mono signal obtained by the “downmix” of the document of Samsudin et al. cited previously is calculated by averaging the L channel and the aligned R′ channel, according to the following equation:
The phase alignment therefore makes it possible to preserve the energy and to avoid the problems of attenuation by eliminating the influence of the phase. This “downmix” corresponds to the “downmix” described in the document by Breebart et al. where:
M[k]=w1L[k]+w2R[k] (10)
with w1=0.5 and
in the case where the sub-band of index b comprises only a frequency value of index k.
Other “downmix” schemes can of course be chosen without modifying the scope of the invention.
The algorithmic delay of the EVS codec is 30.9375 ms at Fs=8 kHz and 32 ms for the other frequencies Fs=16, 32 or 48 kHz. This delay includes the current frame of 20 ms, the additional delay with respect to the frame length is therefore 10.9375 ms at Fs=8 kHz and 12 ms for the other frequencies (i.e. 192 samples at Fs=16 kHz), the mono signal is delayed (block 311) by T=320−192=128 samples so that the delay accumulated between the mono signal decoded by EVS and the original stereo channels becomes a multiple of the length of frames (320 samples). Accordingly, to synchronize the extraction of stereo parameters (block 314) and the spatial synthesis on the basis of the mono signal performed at the decoder, the lookahead for the calculation of the mono signal (20 ms) and the mono coding/decoding delay to which is added the delay T to align the mono synthesis (20 ms) correspond to an additional delay of 2 frames (40 ms) with respect to the current frame. This delay of 2 frames is specific to the implementation detailed here, in particular it is related to the 20-ms sinusoidal symmetric windows. This delay could be different. In a variant embodiment, it would be possible to obtain a delay of a frame with an optimized window with a smaller overlap between adjacent windows with a block 311 not introducing any delay (T=0).
The shifted mono signal is thereafter coded (block 312) by the mono EVS coder for example at a bitrate of 13.2, 16.4 or 24.4 kbit/s. In variants, the coding could be performed directly on the unshifted signal; in this case the shift could be performed after decoding.
In a particular embodiment of the invention, illustrated here in
It would be possible in a more advantageous manner in terms of quantity of data to be stored, to shift the outputs of the parameters extraction block 314 or else the outputs of the quantization blocks 318, 316 and 319. It would also be possible to introduce this shift at the decoder on receiving the binary train of the stereo coder.
In parallel with the mono coding, the coding of the spatial cue is implemented in the blocks 315 to 319 according to a coding method of the invention. Moreover, the coding comprises an optional step of classifying the input signal in the block 321.
This classification block, according to the multichannel signal to be coded, can make it possible to pass from one mode of coding to another. One of the coding modes being that implementing the invention for the coding of the spatialization cues. The other coding modes are not detailed here, but it will be possible to use conventional techniques for stereo or multichannel coding, including techniques for parametric coding with ILD, ITD, IPD, ICC parameters. The classification is indicated here with the L and R temporal signals as input, optionally the signals in the frequency domain and the stereo or multichannel parameters will also be able to serve for the classification. It will also be possible to use the classification to apply the invention to a given spatial parameter (for example to code the ITD or the ILD), stated otherwise to switch the type of coding of spatial parameters with a possible choice between a coding scheme according to a model as in the invention or an alternative coding scheme of the prior art.
The spatial parameters are extracted (block 314) on the basis of the spectra L[k], R[k] and M[k] shifted by two frames: Lbuf[k], Rbuf[k] and Mbuf[k] and coded (blocks 315 to 319) according to a coding method described with reference to
For the extraction of the parameters ILD (block 314), the spectra Lbuf[k] and Rbuf[k] are for example sliced into frequency sub-bands.
In one embodiment, a ⅓ octave sub-band slicing defined in array 1 hereinbelow will be taken:
Array 1
This array covers all the cases of sampling frequency, for example for a coder with a sampling frequency at 16 kHz only the first B=20 sub-bands will be retained. Thus, it will be possible to define the array:
The above array delimits (as index of Fourier spectral lines) the frequency sub-bands of index b=0 to B−1 for the case Fs=16 kHz. Each sub-band of index b comprises the coefficients kb=0 to kb+1−1. The frequency spectral line of index k=320 which corresponds to the Nyquist frequency is not taken into account here.
In variants, it will be possible to use another sub-band slicing, for example according to the ERB scale; in this case, it will be possible to use B=35 sub-bands, the latter are defined by the following boundaries in the case where the input signal is sampled at 16 kHz:
The above array delimits (as index of Fourier spectral lines) the frequency sub-bands of index b=0 to B−1. For example the first sub-band (b=0) goes from the coefficient kb=0 to kb+1−1=0; it is therefore reduced to a single coefficient which represents 25 Hz. Likewise, the last sub-band (k=34) goes from the coefficient kb=307 to kb+1−1=319, it comprises 12 coefficients (300 Hz). The frequency spectral line of index k=320 which corresponds to the Nyquist frequency is not taken into account here.
For each frame, the ILD of the sub-band b=0, . . . , B−1 is calculated according to equations (5) and (6) repeated here:
where σL2[b] and σR2[b] represent respectively the energy of the left channel (Lbuf[k]) and of the right channel (Rbuf[k]):
According to a particular embodiment, the parameters ITD and ICC are extracted in the time domain (block 320). In variants of the invention these parameters could be extracted in the frequency domain (block 314), this not being represented in
In one embodiment the parameters ITD and ICC are estimated in the following manner. The ITD is sought by intercorrelation according to equation (3) repeated here:
ITD=max−d≤τ≤dΣn=0N−τ−1L(n+τ)·R(n) (13)
with for example d=630 μs×Fs, i.e. 10 samples at 16 kHz. This value of 630 μs is obtained for the binaural case, on the basis of Woodworth's law defined hereinafter, with a spherical approximation of the head (with a mean radius α=8.5 cm) and an azimuth θ=π/2.
The ITD obtained according to equation (3) is thereafter smoothed to attenuate its temporal variations. The benefit of the smoothing is to attenuate the fluctuations of the instantaneous ITD which may degrade the quality of the spatial synthesis at the decoder. The smoothing scheme adopted lies outside the scope of the invention and it is not detailed here.
During the calculation of the ITD, the ICC is also calculated according to equation (4) defined hereinabove.
The spatial parameters or cues ILD and ITD are coded according to a scheme forming the subject of the invention and described with reference to
These blocks 315 and 317 implement schemes based on models of respective representations of the cues ITD and ILD.
Certain parameters of the respective models obtained on output from the blocks 315 and 317 are thereafter coded at 316 and 318 for example according to a scalar quantization scheme.
All the spatialization cues thus coded are multiplexed by the multiplexer 322 before being transmitted.
Certain significant notions about sound perception are recalled in
In one embodiment it is considered that the signal comprises a sound source situated in the horizontal plane.
In the case of a binaural signal, it may be useful to define the position of a virtual source associated with the multichannel signal to be coded. As illustrated in
The angle θ is defined between the frontal axis 530 of the listener and the axis of the source 520. The two ears of the listener are represented as 550R for the right ear and as 550L for the left ear. The cue in respect of time shift between the two channels of a binaural signal is associated with the interaural time difference, that is to say the difference in time that a sound takes to arrive at the two ears. If the source is directly in front of the listener, the wave arrives at the same moment at both ears and the ITD cue is zero.
The interaural time difference (ITD) can be simplified by using a geometric approximation in the form of the following sine law:
ITD(θ)=α sin(θ)/c (14)
where θ is the azimuth in the horizontal plane, α is the radius of a spherical approximation of the head and c the speed of sound (in m·s−1) which can be defined as c=343 m·s−1. This law is independent of frequency, and it is known to give good results in terms of spatial location.
A virtual sound source can therefore be located with an angle θ and the ITD cue can be deduced through the following formula:
ITD(θ)=ITDmax sin(θ) (15)
where
ITDmax=α/c (16)
The value given to ITDmax may for example correspond to 630 μs, which is the limit of perceptual separation between two pulses. For larger values of ITD the subject will hear two different sounds and will not be able to interpret the sounds as a single sound source.
In variants of the invention the sine law could be replaced with Woodworth's ITD model defined in the work by R. S Woodworth, Experimental Psychology (Holt, N.Y.), 1938, pp. 520-523, by the following equation:
ITD(θ)=α(sin(θ)+θ)/c (17)
which is valid for a far field (typically a source at a distance of at least 10. α). Employing the principle of normalization by a maximum value ITDmax as in equation (15), the ITD model according to Woodworth's law can be written in the form:
In variants, it would be possible to define a multiplicative factor which does not represent the maximum value of the ITD but a proportional value for example the factor α/c. The invention also applies in this case. For example, to simplify the expression for Woodworth's law it is possible to write:
ITD(θ)=ITDmax(sin(θ)+θ) (20)
where
ITDmax=α/c (21)
In this case the value of ITDmax does not represent the maximum value of the ITD. Hereinafter, this “disparity of notation” will be used.
Thus, with reference to
This model is for example the model such as defined hereinabove in equation (15) with a value ITDmax=630 μs predefined in the model or the model of equation (20).
In variants, the value ITDmax could be rendered flexible by coding either this value directly, or by coding the difference between this value and a predetermined value. This approach makes it possible in fact to extend the application of the ITD model to more general cases, but its drawback is to require additional bitrate. To indicate that the explicit coding of the value ITDmax is optional, the block 412 appears dashed in
A module 411 for determining the angle θ such as defined hereinabove is implemented to obtain the angle defined by the sound source. More precisely this module searches for the azimuth parameter θ which makes it possible to approach as close as possible to the ITD extracted. When the law is known as in equation (15), this angle can be obtained in an analytical manner:
θ=α sin(ITD/ITDmax) (22)
In variants, the α sin function could be approximated.
An equivalent approach for determining the azimuth can be implemented in the block 411. According to this approach, the determination of the angle θ for the sine law calls upon a search with the aid of the ITD model, for the closest value as a function of the possible values of azimuth:
θ=argminθϵT(ITD−ITDmax sin(θ))2 (23)
This search can be performed by pre-storing the various candidate values of ITDmax·sin(θ) arising from the ITD model in a table MITD for a search interval which may be T=[−π/2, π/2] assuming that the ITD is symmetric when the source is in front of or behind the subject. In this case, the values of θ are discretized, for example with a step size of 1° over the search interval.
In the case of Woodworth's law, it is also possible to follow the same approach as hereinabove for the sine law. The analytical expression for the inverse function of sin(θ)+θ not being trivial, it will be possible to prefer the search:
θ=argminθϵT(ITD−ITDmax(sin(θ)+θ))2 (24)
The angle parameter θ determined in the block 411 is thereafter coded according to a conventional coding scheme for example by scalar quantization on 4 bits by the block 316. This block carries out a search for the quantization index
i=argminj=0, . . . ,15(θ−Qθ[j])2 (25)
where the table is given for the case of a uniform scalar quantization on 4 bits
In variants, the number of bits allocated to the coding of the azimuth could be different, and the quantization levels could be non-uniform to take account of the perceptual limits of location of a sound source according to the azimuth.
It is the coding of this parameter which makes it possible to code the time shift cue ITD, optionally with the coding of ITDmax (block 412) as additional cue if the value predetermined by the ITD model must be adapted. The spatialization cue will therefore be retrieved on decoding by decoding the angle parameter, optionally by decoding ITDmax, and by applying the same representation model of the ITD. The bitrate necessary for coding this angle parameter is low (for example 4 bits per frame) when no correction of the value ITDmax predefined in the model is coded. Thus, the coding of this spatialization cue (ITD) consumes little bitrate.
At very low bitrate, the coding of a single angle θ can be implemented to code the spatialization cue in respect of a binaural signal.
In a variant embodiment, it will be possible to estimate an ITD per frequency band, for example by taking a slicing into B sub-bands, defined previously. In this case, an angle θ per frequency band is coded and transmitted to the decoder, which for the example of B sub-bands gives B angles to be transmitted.
In another variant, it will be possible to ignore the estimation of the ITD for certain high frequency bands for which the phase differences are not perceptible. Likewise, it will be possible to omit the estimation of the ITD for very low frequencies. For example, the ITD will not be able to be estimated for bands above 1 kHz, and for a sub-band slicing as defined previously it will be possible to retain the bands b=0 to 11 in the embodiment using the ⅓ octave and 1 to 16 in the variants using the ERB scale (the first band b=0 being omitted in the latter case since it entails frequencies below 25 Hz). In variants of the invention, a sub-band slicing with a different resolution from 25 Hz could be used; it will thus be possible to group together certain sub-bands since the ⅓ octave slicing or the ERB scale may be too fine for the coding of the ITD. This avoids coding too many angles per frame. For each frequency band, the ITD is thereafter converted into an angle as in the case of a single angle described hereinabove with a bit allocation which can be either fixed or variable as a function of the significance of the sub-band. In all these variants where several angles are determined and coded, a vector quantization could be implemented in the block 316.
In this variant embodiment, one considers the definition of several “competing” models for coding the ITD, knowing that the invention also applies when a single ITD model is defined.
Thus, the model such as defined for the interchannel time shift (ITD) cue might not be fixed and be parametrizable. Each model defines a set of values of ITD as a function of an angle parameter: the sine law and Woodworth's law constitute two examples of models. In this variant, for coding, a model index and an angle index (also called angle parameter) to be coded are determined in the block 432 on the basis of an ITD models table obtained at 430 according to the following equation:
where NM is the number of models in the ITD models table, Nθ(m) is the number of azimuth angles considered for the m-th model and MITD(m, t) corresponds to a precise value of the cue ITD.
An exemplary model MITD(m, t) is given hereinbelow in the case of a model of index m=0 according to a Woodworth law as in equation 20 with ITDmax=0.2551 ms:
M
ITD(m=1,t=0 . . . 7)=[−0.5362−0.3807−0.1978 0 0.1978 0.3807 0.5362 0.6558]
where each value is in ms. The angle index t corresponds in fact to an angle θ covering the interval
with a step size of
This table can also be referred to samples for example in the case of a sampling at 16 kHz, one obtains in an equivalent manner:
M
ITD(m=1, t=0 . . . 7)=[−8.5795−6.0919−3.1648 0 3.1648 6.0919 8.5795 10.4930]
In this case, Nθ(m)=8 and NM=1. It is therefore possible to code the cue ITD on 3 bits with this single model.
It will be noted that for a given model index m, the model MITD (m, t) is implicitly dependent on the azimuth angle, insofar as the index t in fact represents a quantization index for the angle θ. Thus, the model MITD t) is an efficient means of combining the relation between ITD and θ, and the quantization of θ on Nθ(m) levels, and of potentially using several models (at least one), indexed by mopt when more than one model is used.
In one embodiment the case of two different models is for example considered:
ITD(θ)=ITDmax sin(θ) and ITDmax=30 (samples at 16 kHz)
It will be noted that the size Nθ(m) may be identical for all the models, but in the general case it is possible for different sizes to be used. For example it will be possible to define Nθ(m)=16 and NM=2. It is therefore possible to code the cue ITD on 4+1=5 bits.
An index of the selected law mopt is then coded on ┌log2NM┐ bits and transmitted to the decoder in addition to the azimuth angle topt coded on ┌log2Nθ┐ bits. In the example taken hereinabove, it will be possible to code mopt on 1 bit, and topt on 4 bits.
In a variant, it will be possible to replace the model m=0 by an ITD table as a function of the azimuth arising from real measurements of HRTFs, without parametric law, but with ITD values estimated on the real data; in this case, the size Nθ(m) will be able to depend on the angular resolution used to measure HRTFs (assuming that no angular interpolation has been applied).
As in
In a variant of the invention the representation model of the ITD could be generalized so as to reduce solely to the horizontal plane but also include the elevation. In this case, two angles are determined, the azimuth angle θ and the elevation angle φ.
The search for the two angles can be made according to the following equation:
with Nφ(m) the number of elevation angles considered for the m-th model and popt representing the elevation angle to be coded.
In the invention, one also seeks to reduce the coding bitrate of spatialization cues other than the ITD, such as the spatialization interchannel intensity difference (ILD) cue. It will be noted that the block 316 of
Thus, in the same way as for the ITD it is possible to resort to a parametrization of the ILD. In the binaural case, in accordance with the thesis of Jérome Daniel, entitled “Representation de champs acoustiques, application a la transmission et a la reproduction de scenes sonores complexes dans un contexte multimedia” [Representation of acoustic fields, application to the transmission and reproduction of complex sound scenes in a multimedia context], University of Paris 6, Jul. 2011, the ILD can also be approximated according to the following law:
where f is the frequency, r the distance from the sound source and c the speed of sound.
By defining a relative ILD, ILDmax, it is possible under certain conditions to reduce this approximation to the equation:
ILDglob(θ)=ILDmax sin(θ) (30)
The above law is only an approximation corresponding to the global level of the HRTFs at a given azimuth; it does not make it possible to completely characterize the spectral coloration given by the HRTFs but it characterizes only their global level. The reference ILD can be defined —at a later time, when defining the ILD model, by taking a base of normalized signals or a base of HRTF filters—by taking the maximum of the total ILD of a binaural signal. In the invention it is considered that this sine law applies not only to the total (or global) ILD but also to the sub-band based ILD; in this case, the parameter ILDmax depends on the index of the sub-band and the model becomes:
ILD[b](θ)=ILDmax[b]sin(θ) (31)
Experimentally, it may be verified that if the energy (illustrated with reference to
It will be noted that even if the symmetry of the frontal half-plane (azimuth lying in [0, 180] degrees) and the half-plane at the rear of the head (azimuth lying in [180, 360] degrees) is in general not totally valid, this sine law is used in the invention to code and decode the ILD.
Just as for the case of the ITD where a value ITDmax has been defined, it is therefore possible either to transmit the parameter ILDmax, or to use a predetermined and stored value ILDmax, so as to derive therefrom a value ILDglob (θ) according to equation (30) and thus apply a global ILD, valid over the whole spectrum of the signal to obtain a rudimentary (global) location.
Another exemplary model relies on the configuration of ORTF stereo microphones which is illustrated in
In this example, the sub-band based ILD model could be defined in relation to a configuration of ORTF microphones as follows:
ILD(θ)=L(θ)−R(θ)=α(cos(θ−θ0)−cos(θ+θ0) (32)
with
L(θ)=α(1+cos(θ−θ0)) (33)
R(θ)=α(1+cos(θ+θ0)) (34)
where θ0 (in radians) corresponds to 55°.
This model can also be written in the form:
ILD(θ)=L(θ)−R(θ)=α(cos(θ)cos(θ0)+sin(θ)sin(θ0)) (35)
Here again it is possible to define a value ILDmax which corresponds to:
ILDmax=α (36)
Here again, it is assumed that the model defined in equation 35 applies not only to the case of a total (or global) ILD but also to the sub-band based ILD; in this case the parameter ILDmax (or a proportional version) will be dependent on the sub-band in the form ILD[b]max.
Thus, with reference to
This model is for example the model such as defined hereinabove in equation (30) or with other models described in this document.
The angle parameter θ already defined at 411 can be reused at the decoder to retrieve the global ILD or the sub-band based ILD such as defined by equation (30), (31) or (35); this in fact makes it possible to “pool” the coding of the ITD and of the ILD. In the case where the value ILDmax is not fixed, the latter is determined at 423 and coded.
In a particular embodiment, a module 421 for estimating an interchannel intensity difference cue is implemented on the basis on the one hand of the angle parameter obtained by the block 411 in order to code the time shift cue (ITD) and on the other hand of the representation model of equation (30), (31) or (35). In an optional manner, the module 422 calculates a residual of the cue ILD, that is to say the difference between the cue in respect of real interchannel intensity difference (ILD) extracted at 314 and the interchannel intensity difference (ILD) cue estimated at 421 on the basis of the ILD model.
This residual can be coded at 318 for example by a conventional scalar quantization scheme. However, in contradistinction to the coding of a direct ILD, the quantization table may be for example limited to a dynamic range of +/−12 dB with a step size of 3 dB.
This ILD residual makes it possible to improve the quality of decoding of the cue ILD in the case where the ILD model is too specific and applies only to the signal to be coded in the current frame; it is recalled that a classification may optionally be used at the coder to avoid such cases, however in the general case it may be useful to code an ILD residual.
Thus, the coding of these parameters as well as that of angle of the ITD makes it possible to retrieve at the decoder the interchannel intensity difference (ILD) cue of the binaural audio signal with a good quality.
In the same way as for the ITD, the spatialization cue (global or sub-band based) will therefore be retrieved on decoding by applying the same representation model and by decoding if relevant the residual parameter and reference ILD parameter. The bitrate necessary for coding these parameters is lower than if the cue ILD itself were coded, in particular when the ILD residual does not have to be transmitted and when use is made of the parameter or parameters ILDmax predefined in the ILD model or models. Thus, the coding of this spatialization cue (ILD) consumes little bitrate.
This ILD model using only a global ILD value is however very simplistic since in general the ILD is defined on several sub-bands.
In the coder described previously, B sub-bands according to a ⅓ octave slicing or according to the ERB scale were defined. To make it possible to represent more than one parameter of total (or global) ILD the representation model of the ILD is therefore extended to several sub-bands. This extension applies to the invention described in FIG. 4a, however the associated description is given hereinafter in the context of
We consider the variant embodiment described in
where NM is the number of models in the ILD models table, Nθ(m) is the number of azimuth angles considered for the m-th model, MILD (m, t) corresponds to a precise value of the cue ILD and dist(.,.) is a criterion of distance between ILD vectors. However, in a variant embodiment, this search could be simplified by using the angle cue already obtained in the block 432 for the ITD model. It will be noted that the values t=0, . . . , Nθ(m)−1 for the ILD model do not necessarily correspond to the same set of values as for the ITD model, however it is advantageous to harmonize these sets so as to have coherence between representation models for the ILD and the ITD.
The following may for example be taken as possible distance criteria:
dist(X,Y)=|Σb=0B−1X[b]−Σb=0B−1Y[b]|q (38)
where q=1 or 2.
An exemplary ILD model is illustrated in
In a variant of the invention the representation model of the ILD could be generalized so as not to reduce solely to the horizontal plane but also to include the elevation. In this case, the search for two angles becomes:
with Nφ(m) the number of elevation angles considered for the m-th model and popt representing the elevation angle to be coded.
In a variant, an exemplary model MILD (m, t, p) can be obtained on the basis of a suite of HRTFs in the following manner. Given the HRTF filters for θ and φ, it is possible to:
An index of the selected law mopt is then coded and transmitted to the decoder at 318.
In the same way as for
Hitherto separate models have been considered for the ITD and the ILD, even if it was noted that the determination of the angle may be “pooled”. For example, the azimuth may be determined by using the ITD model and this same angle is used directly for the ILD model. Another variant embodiment calling upon a (joint) “integrated model” is now considered. This variant is described in
In this variant, rather than having separate models for the ITD and the ILD (MITD (m, t, p) and MILD (m, t, p)) it will be possible to define a joint model in the block 450: MITD,ILD (m, t, p) whose inputs comprise candidate values of ITD and of ILD; thus, for various discrete values representing θ and φ “vectors” (ITD, ILD) are defined. In this case, the distance measurement used for the search must combine the distance on the ITD and the distance on the ILD, however it is still possible to perform a separate search.
Thus, an index of the selected law mopt, of the azimuth angle topt and of the elevation angle popt that are determined at 453, are coded at 331 and transmitted to the decoder. Just as for
A variant of the coder illustrated in
With reference to
This decoder comprises a demultiplexer 701 in which the coded mono signal is extracted so as to be decoded at 702 by a mono EVS decoder (according to the specifications 3GPP TS 26.442 or TS 26.443) in this example. The part of the binary train corresponding to the mono EVS coder is decoded according to the bitrate used at the coder. It is assumed here that there is no loss of frames nor any binary errors in the binary train to simplify the description, however known techniques for correcting loss of frames can quite obviously be implemented in the decoder.
The decoded mono signal corresponds to {circumflex over (M)}(n) in the absence of channel errors. An analysis by short-term discrete Fourier transform with the same windowing as at the coder is carried out on {circumflex over (M)}(n) (blocks 703 and 704) to obtain the spectrum {circumflex over (M)}[k]. It is considered here that a decorrelation in the frequency domain (block 720) is also applied. This decorrelation could also have been applied in the time domain.
The details of implementation of the block 708 for the synthesis of the stereo signal are not presented here since they lie outside the scope of the invention, but the conventional synthesis techniques known from the prior art could be used.
In the synthesis block 708, it is for example possible to reconstruct a signal with two channels with the following processing on the mono signal decoded and transformed into frequencies:
{circumflex over (L)}[k]=c1{circumflex over (M)}[k] (40)
{circumflex over (R)}[k]=c2{circumflex over (M)}[k]e−j2πkiTD/NFFT (41)
where c=10ILD[b]/10 (with b the index of the sub-band containing the spectral line of index k),
ITD is the ITD decoded for the spectral line k (if a single ITD is coded, this value is identical for the various spectral lines of index k) and NFFT is the length of the FFT and of the inverse FFT (blocks 704, 709, 712).
It is also possible to take into account the parameter ICC decoded at 718 to recreate a non-localized sound ambience (background noise) to improve the quality.
The spectra {circumflex over (L)}[k] and {circumflex over (R)}[k] are thus calculated and thereafter converted into the time domain by inverse FFT, windowing, addition and overlap (blocks 709 to 714) to obtain the synthesized channels {circumflex over (L)}(n) and {circumflex over (R)}(n).
The parameters which have been coded to obtain the spatialization cues are decoded at 705, 715 and 718.
At 718, it is the cues ICCq[b] which are decoded if, however, they have been coded.
At 705, it is the angle parameter θ which is decoded, optionally with a value ITDmax. On the basis of this parameter, the module 706 for obtaining a representation model of an interchannel time shift cue is implemented to obtain this model. Just as for the coder, this model can be defined by equation (15) defined hereinabove. Thus, on the basis of this model and of the decoded angle parameter, it is possible for the module 707 to determine the interchannel time shift (ITD) cue in respect of the multichannel signal.
If at the decoder an angle per frequency or per frequency band is coded, then these various angles per frequency or frequency bands are decoded to define the cues ITD per frequency or frequency bands.
In the same way, in the case where parameters making it possible to code the interchannel intensity difference (ILD) cue are coded, they are decoded by the module for decoding these parameters at 715, at the decoder.
Thus, the residual parameter (Resid. ILD) and reference ILD parameter (ILDmax) are decoded at 715.
On the basis of these parameters, the module 716 for obtaining a representation model of an interchannel intensity difference cue is implemented to obtain this model. Just as for the coder, this model can be defined by equation (30) defined hereinabove.
Thus, on the basis of this model, of the ILD residual parameters (that is to say the difference between the cue in respect of real interchannel intensity difference (ILD) and the interchannel intensity difference (ILD) cue estimated with the model), of the reference ILD parameter (ILDmax) and of the angle parameter decoded at 705 for the cue ITD, it is possible for the module 717 to determine the interchannel intensity difference (ILD) cue of the multichannel signal.
If at the coder the ILD coding parameters were itemized by frequency band, then these various frequency band based parameters are decoded to define the cues ILD per frequency or frequency bands.
It will be noted that the decoder of
In a variant of the invention the decoder of
The coder presented with reference to
The coders and decoders such as described with reference to
In the case of a coder, the memory block can advantageously comprise a computer program comprising code instructions for the implementation of the steps of the coding method in the sense of the invention, when these instructions are executed by the processor PROC, and in particular the steps of extracting a plurality of spatialization cues in respect of the multichannel signal, of obtaining at least one representation model of the spatialization cues extracted, of determining at least one angle parameter of a model obtained and of coding the at least one angle parameter determined so as to code the spatialization cues extracted during the coding of spatialization cues.
In the case of a decoder, the memory block can advantageously comprise a computer program comprising code instructions for the implementation of the steps of the decoding method in the sense of the invention, when these instructions are executed by the processor PROC, and in particular the steps of receiving and decoding at least one coded angle parameter, of obtaining at least one representation model of spatialization cues and of determining a plurality of spatialization cues in respect of the multichannel signal on the basis of the at least one model obtained and of the at least one decoded angle parameter.
The memory MEM can store the representation model or models of various spatialization cues which are used in the coding and decoding methods according to the invention.
Typically, the descriptions of
Such an item of equipment in the guise of coder comprises an input module able to receive a multichannel signal for example a binaural signal comprising the channels R and L for right and left, either through a communication network, or by reading a content stored on a storage medium. This multimedia equipment item can also comprise means for capturing such a binaural signal.
The device in the guise of coder comprises an output module able to transmit a mono signal M arising from a channels reduction processing and at the minimum, an angle parameter θ making it possible to apply a representation model of a spatialization cue so as to retrieve this spatial cue. If relevant, other parameters such as the ILD residual, ILD or reference ITD (ILDmax or ITDmax) parameters are also transmitted via the output module.
Such an item of equipment in the guise of decoder comprises an input module able to receive a mono signal M arising from a channels reduction processing and at the minimum an angle parameter θ making it possible to apply a representation model of the spatialization cue so as to retrieve this spatial cue. If relevant, to retrieve the spatialization cue, other parameters such as the ILD residual, ILD or reference ITD (ILDmax or ITDmax) parameters are also received via the input module E.
The device in the guise of decoder comprises an output module able to transmit a multichannel signal for example a binaural signal comprising the channels R and L for right and left.
Although the present disclosure has been described with reference to one or more examples, workers skilled in the art will recognize that changes may be made in form and detail without departing from the scope of the disclosure and/or the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
1652034 | Mar 2016 | FR | national |
This application is divisional of U.S. application Ser. No. 16/083,741, filed Sep. 10, 2018, which is a Section 371 National Stage Application of International Application No. PCT/FR2017/050547, filed Mar. 10, 2017, and published as WO 2017/153697 on Sep. 14, 2017, not in English, the contents of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | 16083741 | Sep 2018 | US |
Child | 17130567 | US |