The invention relates to a method for suppressing the late reverberation of an audio signal. The invention is more particularly, thought not exclusively, adapted to the field of processing reverberation in an enclosed space.
There are two types of reflections, early reflections and late reverberation. The microphone 120 captures the early reflection signals with a slight delay relative to the source signal 130, on the order of zero to fifty milliseconds. Said early reflection signals are temporally and spatially separated from the source signal 130, but the human ear does not perceive these early reflection signals and the source signal 130 separately due to an effect called the “precedence effect.” When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the temporal integration of the early reflection signals by the human ear makes it possible to enhance certain characteristics of the speech, which improves the intelligibility of the audio signal.
Depending on the size of the room, the boundary between the early reflections and the late reverberation is between fifty and eighty milliseconds. The late reverberation comprises numerous reflected signals that are close together in time and therefore impossible to separate. This set of reflected signals is thus considered from a probability standpoint to be a random distribution whose density increases with time. When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the late reverberation degrades both the quality of said audio signal and its intelligibility. Said late reverberation also affects the performance of speech recognition and sound source separation systems.
According to the prior art, a first method known as “inverse filtering” attempts to identify the impulse response of the enclosed space 110 in order to then construct an inverse filter that can compensate the effects of the reverberation in the audio signal.
This type of method is for example described in the following scientific publications: B. W. Gillespie, H. S. Malvar and D. A. F. Florèncio, “Speech dereverberation via maximum-kurtosis subband adaptive filtering,” Proc. International Conference on Acoustics, Speech and Signal Processing, Volume 6 of ICASSP '01, pages 3701-3704, IEEE, 2001; M. Wu and D. L. Wang, “A two-stage algorithm for one-microphone reverberant speech enhancement,” Audio, Speech and Language Processing, IEEE Transactions on, 14(3): 774-784, 2006; and Saeed Mosayyebpour, Abolghasem Sayyadiyan, Mohsen Zareian, and Ali Shahbazi, “Single Channel Inverse Filtering of Room Impulse Response by Maximizing Skewness of LP Residual.”
This method uses, in the time domain, distortions introduced by reverberation in parameters of a linear prediction model of the audio signal. Proceeding from the observation that reverberation primarily modifies the residual of the linear prediction model of the audio signal, a filter that maximizes the higher order moments of said residual is constructed. This method is adapted to short impulse responses and is primarily used to compensate early reflection signals.
However, this method assumes that the impulse response of the enclosed space 110 does not vary over time. Furthermore, this method does not model late reverberation. Said method must thus be combined with another method for processing the late reverberation. These two methods combined require a large number of iterations before convergence is obtained, which means that said methods cannot be used for a real-time application. Moreover, the inverse filtering introduces artifacts such as pre-echoes, which must then be compensated.
A second method known as the “cepstral” method attempts to separate the effects of the enclosed space 110 and the audio signal in the cepstral domain. In essence, reverberation modifies the average and the variance of the cepstra of the reflected signals relative to the average and the variance of the cepstra of the source signal 130. Thus, when the average and the variance of the cepstra are normalized, the reverberation is attenuated.
This type of method is for example described in the following scientific publication: D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancement using cepstral processing,” ICASSP '91 Proceedings of the Acoustics, Speech and Signal Processing, 1991.
This method is particularly useful for voice recognition problems since the reference databases of recognition systems can also be normalized so as to more closely approximate the signals captured by the microphone 120. However, the effects of the closed space 110 and the audio signal cannot be completely separated in the cepstral domain. Using this method therefore produces a distortion of the timbre of the audio signal emitted by the omnidirectional sound source 100. Moreover, this method processes early reflections rather than late reverberation.
A third method known as “estimating the power spectral density of late reverberation” makes it possible to establish a parametric model of the late reverberation.
This type of method is for example described in the following scientific publications: E. A. P. Habets, “Single- and Multi-Microphone Speech Dereverberation using Spectral Enhancement,” PhD thesis, Technische Universiteit Eindhoven, 2007; and T. Yoshioka, Speech Enhancement, Reverberant Environments, PhD thesis, 2010.
According to this third method, an estimation of the power spectral density of the late reverberation makes it possible to construct a spectral subtraction filter for the dereverberation. Spectral subtraction introduces artifacts such as musical noise, but said artifacts can be limited by applying more complex filtering schemes, as used in denoising methods.
However, an important parameter for estimating the power spectral density of late reverberation in the context of this third method is the reverberation time. Reverberation time is parameter that is difficult to estimate with precision. The estimation of the reverberation time is distorted by background noise and other interfering audio signals. Moreover, this estimation of reverberation time is time-consuming and thus increases execution time.
A fourth method exploits the sparsity of speech signals in the time-frequency plane.
This type of method is for example described in the following scientific publication: T. Yoshioka, “Speech Enhancement in Reverberant Environments,” PhD thesis, 2010.
In this publication, the late reverberation is modeled as a delayed and attenuated version of the current observation whose attenuation factor is determined by solving a maximum likelihood problem with a sparsity constraint.
This type of method is also described in the following scientific publication: H. Kameoka, T. Nakatani, and T. Yoshioka, “Robust speech dereverberation based on nonnegativity and sparse nature of speech spectrograms,” Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '09, pages 45-48, IEEE Computer Society, 2009.
Dereverberation is approached in this publication as a problem of deconvolution by nonnegative matrix factorization, which makes it possible to separate the response of the enclosed space 110 from the audio signal. However, this method introduces a lot of noise and distortion. Moreover, said method depends on the initialization of the matrices for the factorization.
Furthermore, the methods cited require a plurality of microphones in order to process the reverberation with precision.
A particular object of the invention is to solve all or some of the above-mentioned problems.
To this end, the invention relates to a method for suppressing the late reverberation of an audio signal, characterized in that it comprises the following steps:
Thus, the method that is the subject of the invention is fast and offers reduced complexity. Said method can therefore be used in real time. Furthermore, this method does not introduce artifacts and is resistant to background noise. Moreover, said method reduces background noise and is compatible with noise reduction methods.
The invention can be implemented according to the embodiments described below, which may be considered individually or in any technically feasible combination.
Advantageously, the method also comprises the following steps:
Advantageously, the step for calculating the plurality of prediction vectors is performed by minimizing, for each prediction vector, the expression ∥{tilde over (X)}ν−Dαα∥2, which is the Euclidean norm of the difference between the subsampled observation vector associated with said prediction vector and the analysis dictionary associated with said prediction vector multiplied by said prediction vector, taking into account the constraint ∥α∥1≦λ, according to which the norm 1 of said prediction vector is less than or equal to a maximum intensity parameter of the late reverberation.
Advantageously, the value of the maximum intensity parameter of the late reverberation is between 0 and 1.
Advantageously, the method also comprises the following step:
Advantageously, the method also comprises the following step:
Advantageously, the method also comprises a step for constructing a dereverberation filter according to the model
where ξ is the a priori signal-to-noise ratio and where the bound of integration υ is calculated according to the model
where γ is the a posteriori signal-to-noise ratio.
The invention also relates to a device for suppressing the late reverberation of an audio signal, characterized in that it comprises means for
The invention will be more clearly understood by reading the following description, given as a nonlimiting example in reference to the figures, which show:
In these figures, references that are identical from one figure to another designate identical or comparable elements. For the sake of clarity, the elements shown are not to scale, unless otherwise indicated.
The invention uses a device for dereverberating an audio signal emitted by an omnidirectional sound source 100 positioned in an enclosed space 110 such as an automotive vehicle or a room and captured by a microphone 120. Said dereverberation device is inserted into the audio processing chain of a device such as a telephone. This dereverberation device comprises a unit for applying a time-frequency transform 200, a dereverberation unit 210, and a unit for applying a frequency-time transform 220 (cf.
In a step 900, a microphone 120 captures an input signal x(t) formed by the superimposition of several delayed and attenuated versions of the audio signal emitted by the omnidirectional sound source 100. In essence, the microphone 120 initially captures the source signal 130, also called the direct signal 130, but also the signals 140 reflected off the walls of the enclosed space 110. The various reflected signals 140 have traveled along acoustic paths of various lengths and have been attenuated by the absorption of the walls of the enclosed space 110; the phase and the amplitude of the reflected signals 140 captured by the microphone 120 are therefore different.
There are two types of reflections, early reflections and late reverberation. The microphone 120 captures the early reflection signals with a slight delay relative to the source signal 130, on the order of zero to fifty milliseconds. Said early reflection signals are temporally and spatially separated from the source signal 130, but the human ear does not perceive these early reflection signals and the source signal 130 separately due to an effect called the “precedence effect.” When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the temporal integration of the early reflection signals by the human ear makes it possible to enhance certain characteristics of the speech, which improves the intelligibility of the audio signal.
The microphone 120 captures the late reverberation fifty to eighty milliseconds after the arrival of the source signal 130. The late reverberation comprises numerous reflected signals that are close together in time and therefore impossible to separate. This set of reflected signals is thus considered from a probability standpoint to be a random distribution whose density increases with time. When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the late reverberation degrades both the quality of said audio signal and its intelligibility. Said late reverberation also affects the performance of speech recognition and sound source separation systems.
The input signal x(t) is sampled at a sampling frequency fs. The input signal x(t) is thus subdivided into samples. In order to suppress the late reverberation of said input signal x(t), the power spectral density of the late reverberation is estimated, after which a dereverberation filter is constructed by the dereverberation unit 210. The estimation of the power spectral density of the late reverberation, the construction of the dereverberation filter, and the application of said dereverberation filter are performed in the frequency domain. Thus, in a step 901, a time-frequency transformation is applied to the input signal x(t) by the Short-Term Fourier Transform application unit 200 in order to obtain a complex time-frequency transform of the input signal x(t), notated XC (cf.
Each element XCk,n of the complex time-frequency transform XC is calculated as follows:
where k is a frequency subsampling index with a value between 1 and a number K, n is a time index with a value between 1 and a number N, w(m) is a sliding analysis window, m is the index of the elements belonging to a frame, M is the length of a frame, i.e. the number of samples in a frame, and R is the hop size of the time-frequency transformation.
The input signal x(t) is analyzed by frames of length M with a hop size R equal to M/4 samples. For each frame of the input signal x(t) in the time domain, a discrete time-frequency transform with a frequency sampling index k and a time index n is thus calculated using the algorithm of the time-frequency transformation in order to obtain a complex signal XCk,n, defined by
X
k,n
C
=|X
k,n
|e
−j∠X
where |Xk,n| is the modulus of the complex signal XCk,n, and ∠Xk,n is the phase of the complex signal XCk,n.
The estimation of the power spectral density of the late reverberation is performed on the modulus of the complex time-frequency transform of the input signal XC, notated X. The phase of the complex time frequency transform XC, notated ∠X, is stored in memory and is used to reconstruct a dereverberated signal in the time domain after the application of the dereverberation filter.
The modulus X of the complex time-frequency transform of the input signal XC is then grouped into subbands. More precisely, said modulus X comprises the number K of spectral lines notated Xk. The term “spectral line” in this context designates all the samples of the modulus X of the complex time-frequency transform of the input signal XC for the frequency sampling index k and all of the time indices n. In a step 903, the subband grouping unit 400 groups the K spectral lines Xk into a number J of subbands, in order to obtain a frequency subsampled modulus notated {tilde over (X)} comprising a number J of spectral lines notated {tilde over (X)}j, where j is a frequency subsampling index between 1 and the number J. The number J is less than the number K. Each subband thus comprises a plurality of spectral lines Xk, the frequency index k belonging to an interval having a lower bound bj and an upper bound ej. In one example, each subband corresponds to an octave in order to adapt to the sound perception model of the human ear. Next, in a step 904, the subband grouping unit 400 calculates, for each subband, an average Mean of the spectral lines Xk of said subband in order to obtain the J spectral lines {tilde over (X)}j of the frequency subsampled modulus {tilde over (X)} (cf.
Next, the prediction vector calculation unit 410 calculates for each spectral line {tilde over (X)}j of the frequency subsampled modulus {tilde over (X)}, subsampled modulus and for each time index n, a prediction vector αj,n (cf.
{tilde over (X)}νj,n:=[{tilde over (X)}j,n . . . {tilde over (X)}j,n−N+1]r
Each observation vector {tilde over (X)}νj,n has the size of N×1, where the number N is the length of the observation. The length of the observation N is the number of frames of the time-frequency transformation required for the estimation of the late reverberation. The length of the observation N makes it possible to define the time resolution of the estimation. When the length of the observation N increases, the complexity of the system is reduced. The subsampling of the modulus X of the complex time-frequency transform of the input signal XC makes it possible, among other things, to apply the method in real time.
In a step 906, the analysis dictionary construction unit 710 constructs analysis dictionaries Dα. More precisely, for each time index n and frequency subsampling index j, an analysis dictionary Dj,nα is constructed by concatenating a number L of past observation vectors determined in step 905. The analysis dictionary Dj,nα is thus defined as the matrix
where L is the number of past observation vectors and hence the size of the analysis dictionary Dj,nα and δ∈R* is the delay of the analysis dictionary Dj,nα. More precisely, the delay δ is the frame delay between the current subsampled observation vector {tilde over (X)}νj,n and the other subsampled observation vectors belonging to the analysis dictionary Dj,nα. Said delay δ makes it possible to reduce the distortions introduced by the method. This delay δ also makes it possible to improve the separation of the late reverberation from the early reflections. In order to calculate the current observation vector {tilde over (X)}νj,n and the analysis dictionary Dj,nα and thus the prediction vector αj,n for each spectral line {tilde over (X)}j and for each time index n, a number L+N+δ of frames must be stored in memory.
In a step 907, the LASSO solving unit 720 solves a so-called “LASSO” problem, which is to minimize the Euclidean norm ∥{tilde over (X)}νj,n−Dj,nααj,n∥2, taking into account the constraint |αj,n∥1≦λ, where λ is a maximum intensity parameter. In order to solve said problem, the best linear combination of the L vectors of the dictionary for approximating the current observation must be found. In one example, a method known as LARS, the English acronym for “Least Angle Regression,” makes it possible to solve said problem. The constraint |αj,n∥1≦λ makes it possible to favor solutions that have few non-zero elements, i.e. sparse solutions. The maximum intensity parameter λ makes it possible to adjust the estimated maximum intensity of the late reverberation. This maximum intensity parameter λ theoretically depends on the acoustic environment, i.e. in one example the enclosed space 110. For each enclosed space 110, there is an optimal value of the maximum intensity parameter λ. However, tests have shown that said maximum intensity parameter λ can be set at an identical value for all enclosed spaces 110 without said parameter's introducing degradations relative to the optimal value. Thus, the method works in a great variety of enclosed spaces 110 without requiring any particular adjustment, making it possible to avoid errors in the estimation of the reverberation time of the enclosed space 110. Moreover, the method according to the invention does not require any parameters that must be estimated, thus enabling said method to be applied in real time. The value of the maximum intensity parameter λ is between 0 and 1. In one example, the value of the maximum intensity parameter λ is equal to 0.5, which is a good compromise between the reduction of the reverberation and the overall quality of the method.
In a step 908, for each time index n and each frequency subsampling index k, a current observation vector Xνk,n is created from the set of samples belonging to the kth spectral line Xk of the modulus X of the complex time-frequency transform and falling between the instants n1 and n, notated Xk,n
In a step 909, the synthesis dictionary construction unit 800 constructs a synthesis dictionary Ds. More precisely, for each time index n and each frequency sampling index k, the synthesis dictionary Dk,ns is constructed by concatenating a number L of past observation vectors determined in step 908. The synthesis dictionary Dk,ns is thus defined as the matrix
where L and δ are the same parameters as for the analysis dictionary Dj,nα.
In a step 910, for each time index n and each frequency sampling index k, an estimation of the power spectral density of the late reverberation or the spectrum of the late reverberation Xk,nl is constructed by a multiplication of the synthesis dictionary Dk,ns with the prediction vector αj,n according to the formula
Xk,nl=Dk,nsαj,n∀k∈└bj,ej┘, j=1, . . . , J
Thus, the prediction vector αj,n indicates the columns of the synthesis dictionary that have been used for the estimation of the reverberation, and the contribution of each of them to the reverberation. The spectrum of the late reverberation Xl is considered in the rest of the method as a noise signal to be eliminated.
To this end, a filtering of the reverberation is performed by the filtering unit 310. More precisely, in a step 911, for each time index n and each frequency sampling index k, a dereverberation filter Gk,n is constructed according to the formula
where ζk,n is the a priori signal-to-noise ratio, calculated as follows
ξk,n=βGk,n−12γk,n−1+(1−β)max{γk,n−1,0}
and where the bound of integration νk,n is calculated as follows
where γk,n is the a posteriori signal-to-noise ratio, calculated according to the formula
where Rk,n is the late reverberation calculated as follows
R
k,n
=αR
k,n−1+(1−α)|Xk,nl|
where α is a first smoothing constant and β is a second smoothing constant. In one example, the first smoothing constant α equals 0.77 and the second smoothing constant β equals 0.98.
In essence, the estimated reverberation is not stationary in the long-term because the audio signal emitted by the omnidirectional sound source 100 that gives rise to said estimated reverberation is not stationary in the long term. Overly fast variations of the estimated reverberation can introduce annoying artifacts during the filtering. To limit these effects, a recursive smoothing is performed in order to calculate the power spectral density of the late reverberation.
In a step 912, for each time index n and each frequency sampling index k, the observation vectors Xνk,n are filtered by the dereverberation filter Gk,n calculated in step 911 so as to obtain a dereverberated signal modulus Yk,n calculated as follows
Yk,n=Gk,nXk,n.
The filter constructed in step 911 strongly attenuates certain observation vectors Xνk,n, which generates artifacts that can be detrimental to the quality of the dereverberated signal. To limit said artifacts, a lower bound is imposed on the attenuation of the filter. Thus, for each frequency sampling index k and for each time index n, if the dereverberation filter Gk,n is less than or equal to a minimum value of the dereverberation filter Gmin, then said dereverberation filter Gk,n is equal to said minimum value of the dereverberation filter Gmin.
In a step 913, for each frequency sampling index k and each time index n, the dereverberated signal modulus Yk,n and the phase ∠Xk,n of the complex signal XCk,n are multiplied in order to create a dereverberated complex signal YC.
In a step 914, a frequency-time transformation is applied by the frequency-time transformation application unit 220 to the dereverberated complex signal Yk,nC in order to obtain a dereverberated time signal y(t) in the time domain. In one example, the frequency-time transformation is an Inverse Short-Term Fourier Transform.
In one embodiment, the value of the number of observation vectors L is equal to 10, the value of the number N of the length of the observation is equal to 8, the value of the delay δ is equal to 5, the value of the maximum intensity parameter λ is equal to 0.5, the value of the number K is equal to 257, the value of the number J is equal to 10, the value of the length of a frame M is equal to 512, and the minimum value of the dereverberation filter Gmin is equal to −12 decibels. The choice of these parameters enables the method to be applied in real time.
The method for suppressing the late reverberation of an audio signal according to the invention is fast and offers reduced complexity. Said method can therefore be used in real time. Moreover, this method does not introduce artifacts and is resistant to background noise. Furthermore, said method reduces background noise and is compatible with noise-reduction methods.
The method for suppressing the late reverberation of an audio signal according to the invention requires only one microphone to process the reverberation with precision.
Number | Date | Country | Kind |
---|---|---|---|
1357226 | Jul 2013 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2014/065594 | 7/21/2014 | WO | 00 |