Method for suppressing the late reverberation of an audio signal

Information

  • Patent Grant
  • 9520137
  • Patent Number
    9,520,137
  • Date Filed
    Monday, July 21, 2014
    10 years ago
  • Date Issued
    Tuesday, December 13, 2016
    8 years ago
Abstract
A method for suppressing the late reverberation of an audio signal. A plurality of prediction vectors are calculated. A plurality of observation vectors from the modulus of the complex time-frequency transform of an input signal is generated. A plurality of synthesis dictionaries from the plurality of observation vectors are constructed. A late reverberation spectrum from the plurality of synthesis dictionaries and the plurality of prediction vectors are estimated. A plurality of observation vectors are filtered to eliminate the late reverberation spectrum and obtain a dereverberated signal modulus.
Description
RELATED APPLICATIONS

This application is a §371 application from PCT/EP2014/065594 filed Jul. 21, 2014, which claims priority from French Patent Application No. 13 57226 filed Jul. 23, 2013, each of which is herein incorporated by reference in its entirety.


TECHNICAL FIELD

The invention relates to a method for suppressing the late reverberation of an audio signal. The invention is more particularly, thought not exclusively, adapted to the field of processing reverberation in an enclosed space.


PRIOR ART


FIG. 1 shows an omnidirectional sound source 100 positioned in an enclosed space 110 such as an automotive vehicle or a room, and a microphone 120. An audio signal emitted by the omnidirectional sound source 100 propagates in all directions. Thus, the signal observed at the level of the microphone is formed by the superimposition of several delayed and attenuated versions of the audio signal emitted by the omnidirectional sound source 100. In essence, the microphone 120 initially captures the source signal 130, also called the direct signal 130, but also the signals 140 reflected off the walls of the enclosed space 110. The various reflected signals 140 have traveled along acoustic paths of various lengths and have been attenuated by the absorption of the walls of the enclosed space 110; the phase and the amplitude of the reflected signals 140 captured by the microphone 120 are therefore different.


There are two types of reflections, early reflections and late reverberation. The microphone 120 captures the early reflection signals with a slight delay relative to the source signal 130, on the order of zero to fifty milliseconds. Said early reflection signals are temporally and spatially separated from the source signal 130, but the human ear does not perceive these early reflection signals and the source signal 130 separately due to an effect called the “precedence effect.” When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the temporal integration of the early reflection signals by the human ear makes it possible to enhance certain characteristics of the speech, which improves the intelligibility of the audio signal.


Depending on the size of the room, the boundary between the early reflections and the late reverberation is between fifty and eighty milliseconds. The late reverberation comprises numerous reflected signals that are close together in time and therefore impossible to separate. This set of reflected signals is thus considered from a probability standpoint to be a random distribution whose density increases with time. When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the late reverberation degrades both the quality of said audio signal and its intelligibility. Said late reverberation also affects the performance of speech recognition and sound source separation systems.


According to the prior art, a first method known as “inverse filtering” attempts to identify the impulse response of the enclosed space 110 in order to then construct an inverse filter that can compensate the effects of the reverberation in the audio signal.


This type of method is for example described in the following scientific publications: B. W. Gillespie, H. S. Malvar and D. A. F. Florèncio, “Speech dereverberation via maximum-kurtosis subband adaptive filtering,” Proc. International Conference on Acoustics, Speech and Signal Processing, Volume 6 of ICASSP '01, pages 3701-3704, IEEE, 2001; M. Wu and D. L. Wang, “A two-stage algorithm for one-microphone reverberant speech enhancement,” Audio, Speech and Language Processing, IEEE Transactions on, 14(3): 774-784, 2006; and Saeed Mosayyebpour, Abolghasem Sayyadiyan, Mohsen Zareian, and Ali Shahbazi, “Single Channel Inverse Filtering of Room Impulse Response by Maximizing Skewness of LP Residual.”


This method uses, in the time domain, distortions introduced by reverberation in parameters of a linear prediction model of the audio signal. Proceeding from the observation that reverberation primarily modifies the residual of the linear prediction model of the audio signal, a filter that maximizes the higher order moments of said residual is constructed. This method is adapted to short impulse responses and is primarily used to compensate early reflection signals.


However, this method assumes that the impulse response of the enclosed space 110 does not vary over time. Furthermore, this method does not model late reverberation. Said method must thus be combined with another method for processing the late reverberation. These two methods combined require a large number of iterations before convergence is obtained, which means that said methods cannot be used for a real-time application. Moreover, the inverse filtering introduces artifacts such as pre-echoes, which must then be compensated.


A second method known as the “cepstral” method attempts to separate the effects of the enclosed space 110 and the audio signal in the cepstral domain. In essence, reverberation modifies the average and the variance of the cepstra of the reflected signals relative to the average and the variance of the cepstra of the source signal 130. Thus, when the average and the variance of the cepstra are normalized, the reverberation is attenuated.


This type of method is for example described in the following scientific publication: D. Bees, M. Blostein, and P. Kabal, “Reverberant speech enhancement using cepstral processing,” ICASSP '91 Proceedings of the Acoustics, Speech and Signal Processing, 1991.


This method is particularly useful for voice recognition problems since the reference databases of recognition systems can also be normalized so as to more closely approximate the signals captured by the microphone 120. However, the effects of the closed space 110 and the audio signal cannot be completely separated in the cepstral domain. Using this method therefore produces a distortion of the timbre of the audio signal emitted by the omnidirectional sound source 100. Moreover, this method processes early reflections rather than late reverberation.


A third method known as “estimating the power spectral density of late reverberation” makes it possible to establish a parametric model of the late reverberation.


This type of method is for example described in the following scientific publications: E. A. P. Habets, “Single- and Multi-Microphone Speech Dereverberation using Spectral Enhancement,” PhD thesis, Technische Universiteit Eindhoven, 2007; and T. Yoshioka, Speech Enhancement, Reverberant Environments, PhD thesis, 2010.


According to this third method, an estimation of the power spectral density of the late reverberation makes it possible to construct a spectral subtraction filter for the dereverberation. Spectral subtraction introduces artifacts such as musical noise, but said artifacts can be limited by applying more complex filtering schemes, as used in denoising methods.


However, an important parameter for estimating the power spectral density of late reverberation in the context of this third method is the reverberation time. Reverberation time is parameter that is difficult to estimate with precision. The estimation of the reverberation time is distorted by background noise and other interfering audio signals. Moreover, this estimation of reverberation time is time-consuming and thus increases execution time.


A fourth method exploits the sparsity of speech signals in the time-frequency plane.


This type of method is for example described in the following scientific publication: T. Yoshioka, “Speech Enhancement in Reverberant Environments,” PhD thesis, 2010.


In this publication, the late reverberation is modeled as a delayed and attenuated version of the current observation whose attenuation factor is determined by solving a maximum likelihood problem with a sparsity constraint.


This type of method is also described in the following scientific publication: H. Kameoka, T. Nakatani, and T. Yoshioka, “Robust speech dereverberation based on nonnegativity and sparse nature of speech spectrograms,” Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '09, pages 45-48, IEEE Computer Society, 2009.


Dereverberation is approached in this publication as a problem of deconvolution by nonnegative matrix factorization, which makes it possible to separate the response of the enclosed space 110 from the audio signal. However, this method introduces a lot of noise and distortion. Moreover, said method depends on the initialization of the matrices for the factorization.


Furthermore, the methods cited require a plurality of microphones in order to process the reverberation with precision.


SUMMARY OF THE INVENTION

A particular object of the invention is to solve all or some of the above-mentioned problems.


To this end, the invention relates to a method for suppressing the late reverberation of an audio signal, characterized in that it comprises the following steps:

    • capture of an input signal formed by the superimposition of several delayed and attenuated versions of the audio signal,
    • application of a time-frequency transformation to the input signal in order to obtain a complex time-frequency transform of the input signal,
    • calculation of a plurality of prediction vectors,
    • creation of a plurality of observation vectors from the modulus of the complex time-frequency transform of the input signal,
    • construction of a plurality of synthesis dictionaries from the plurality of observation vectors,
    • estimation of a late reverberation spectrum from the plurality of synthesis dictionaries and the plurality of prediction vectors,
    • filtering of the plurality of observation vectors so as to eliminate the late reverberation spectrum and obtain a dereverberated signal modulus.


Thus, the method that is the subject of the invention is fast and offers reduced complexity. Said method can therefore be used in real time. Furthermore, this method does not introduce artifacts and is resistant to background noise. Moreover, said method reduces background noise and is compatible with noise reduction methods.


The invention can be implemented according to the embodiments described below, which may be considered individually or in any technically feasible combination.


Advantageously, the method also comprises the following steps:

    • creation of a frequency subsampled modulus from the modulus of the complex time-frequency transform of the input signal,
    • creation of a plurality of subsampled observation vectors from said frequency subsampled modulus,
    • construction of a plurality of analysis dictionaries from the plurality of subsampled observation vectors,
    • calculation of the plurality of prediction vectors from the plurality of subsampled observation vectors and the plurality of analysis dictionaries.


Advantageously, the step for calculating the plurality of prediction vectors is performed by minimizing, for each prediction vector, the expression ∥{tilde over (X)}ν−Dαα∥2, which is the Euclidean norm of the difference between the subsampled observation vector associated with said prediction vector and the analysis dictionary associated with said prediction vector multiplied by said prediction vector, taking into account the constraint ∥α∥1≦λ, according to which the norm 1 of said prediction vector is less than or equal to a maximum intensity parameter of the late reverberation.


Advantageously, the value of the maximum intensity parameter of the late reverberation is between 0 and 1.


Advantageously, the method also comprises the following step:

    • creation of a dereverberated complex signal from the dereverberated signal modulus and the phase of the complex time-frequency transform of the input signal.


Advantageously, the method also comprises the following step:

    • application of a frequency-time transformation to the dereverberated complex signal so as to obtain a dereverberated time signal.


Advantageously, the method also comprises a step for constructing a dereverberation filter according to the model







G
=


ξ

1
+
ξ




exp
(



v








-
t


t




t



)



,




where ξ is the a priori signal-to-noise ratio and where the bound of integration υ is calculated according to the model






v
=

γ


ξ

1
+
ξ







where γ is the a posteriori signal-to-noise ratio.


The invention also relates to a device for suppressing the late reverberation of an audio signal, characterized in that it comprises means for

    • capturing an input signal formed by the superimposition of several delayed and attenuated versions of the audio signal,
    • applying a time-frequency transformation to the input signal in order to obtain a complex time-frequency transform of the input signal,
    • calculating a plurality of prediction vectors,
    • creating a plurality of observation vectors from the modulus of the complex time-frequency transform of the input signal,
    • constructing a plurality of synthesis dictionaries from the plurality of observation vectors,
    • estimating a late reverberation spectrum from the plurality of synthesis dictionaries and the plurality of prediction vectors,
    • filtering the plurality of observation vectors so as to eliminate the late reverberation spectrum and obtain a dereverberated signal modulus.





DESCRIPTION OF THE FIGURES

The invention will be more clearly understood by reading the following description, given as a nonlimiting example in reference to the figures, which show:

    • FIG. 1 (already described): a schematic illustration of an omnidirectional sound source and a microphone positioned in an enclosed space according to an exemplary embodiment of the invention;



FIG. 2: a schematic illustration of an audio signal dereverberation device according to an exemplary embodiment of the invention;



FIG. 3: a schematic illustration of a dereverberation unit of an audio signal dereverberation device according to an exemplary embodiment of the invention;



FIG. 4: a schematic illustration of a late reverberation estimation unit of an audio signal dereverberation device according to an exemplary embodiment of the invention;



FIG. 5: a schematic illustration of a subband grouping of a modulus of a complex time-frequency transform of an input signal according to an exemplary embodiment of the invention;



FIG. 6: a schematic illustration of a prediction vector calculation unit of an audio signal dereverberation device according to an exemplary embodiment of the invention;



FIG. 7: a schematic illustration of a prediction vector calculation unit of an audio signal dereverberation device according to an exemplary embodiment of the invention;



FIG. 8: a schematic illustration of a reverberation evaluation unit of an audio signal dereverberation device according to an exemplary embodiment of the invention;



FIG. 9: a functional diagram showing various steps of the method according to an exemplary embodiment of the invention.





In these figures, references that are identical from one figure to another designate identical or comparable elements. For the sake of clarity, the elements shown are not to scale, unless otherwise indicated.


DETAILED DESCRIPTION OF THE EMBODIMENTS

The invention uses a device for dereverberating an audio signal emitted by an omnidirectional sound source 100 positioned in an enclosed space 110 such as an automotive vehicle or a room and captured by a microphone 120. Said dereverberation device is inserted into the audio processing chain of a device such as a telephone. This dereverberation device comprises a unit for applying a time-frequency transform 200, a dereverberation unit 210, and a unit for applying a frequency-time transform 220 (cf. FIG. 2). The dereverberation unit 210 comprises a late reverberation estimation unit 300 and a filtering unit 310 (cf. FIG. 3). The late reverberation estimation unit 300 comprises a subband grouping unit 400, a prediction vector calculation unit 410 and a reverberation evaluation unit 420 (cf. FIG. 4). The prediction vector calculation unit 410 comprises an observation construction unit 700, an analysis dictionary construction unit 710 and a LASSO solving unit 720 (cf. FIG. 7). The reverberation evaluation unit 420 comprises a synthesis dictionary construction unit 800 (cf. FIG. 8).


In a step 900, a microphone 120 captures an input signal x(t) formed by the superimposition of several delayed and attenuated versions of the audio signal emitted by the omnidirectional sound source 100. In essence, the microphone 120 initially captures the source signal 130, also called the direct signal 130, but also the signals 140 reflected off the walls of the enclosed space 110. The various reflected signals 140 have traveled along acoustic paths of various lengths and have been attenuated by the absorption of the walls of the enclosed space 110; the phase and the amplitude of the reflected signals 140 captured by the microphone 120 are therefore different.


There are two types of reflections, early reflections and late reverberation. The microphone 120 captures the early reflection signals with a slight delay relative to the source signal 130, on the order of zero to fifty milliseconds. Said early reflection signals are temporally and spatially separated from the source signal 130, but the human ear does not perceive these early reflection signals and the source signal 130 separately due to an effect called the “precedence effect.” When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the temporal integration of the early reflection signals by the human ear makes it possible to enhance certain characteristics of the speech, which improves the intelligibility of the audio signal.


The microphone 120 captures the late reverberation fifty to eighty milliseconds after the arrival of the source signal 130. The late reverberation comprises numerous reflected signals that are close together in time and therefore impossible to separate. This set of reflected signals is thus considered from a probability standpoint to be a random distribution whose density increases with time. When the audio signal emitted by the omnidirectional sound source 100 is a speech signal, the late reverberation degrades both the quality of said audio signal and its intelligibility. Said late reverberation also affects the performance of speech recognition and sound source separation systems.


The input signal x(t) is sampled at a sampling frequency fs. The input signal x(t) is thus subdivided into samples. In order to suppress the late reverberation of said input signal x(t), the power spectral density of the late reverberation is estimated, after which a dereverberation filter is constructed by the dereverberation unit 210. The estimation of the power spectral density of the late reverberation, the construction of the dereverberation filter, and the application of said dereverberation filter are performed in the frequency domain. Thus, in a step 901, a time-frequency transformation is applied to the input signal x(t) by the Short-Term Fourier Transform application unit 200 in order to obtain a complex time-frequency transform of the input signal x(t), notated XC (cf. FIG. 2). In one example, the time-frequency transform is a Short-Term Fourier Transform.


Each element XCk,n of the complex time-frequency transform XC is calculated as follows:







X

k
,
n

C

=




m
=
0


M
-
1





x


(

m
+
nR

)




w


(
m
)







2







k





m

M









where k is a frequency subsampling index with a value between 1 and a number K, n is a time index with a value between 1 and a number N, w(m) is a sliding analysis window, m is the index of the elements belonging to a frame, M is the length of a frame, i.e. the number of samples in a frame, and R is the hop size of the time-frequency transformation.


The input signal x(t) is analyzed by frames of length M with a hop size R equal to M/4 samples. For each frame of the input signal x(t) in the time domain, a discrete time-frequency transform with a frequency sampling index k and a time index n is thus calculated using the algorithm of the time-frequency transformation in order to obtain a complex signal XCk,n, defined by

Xk,nC=|Xk,n|e−j∠Xk,n


where |Xk,n| is the modulus of the complex signal XCk,n, and ∠Xk,n is the phase of the complex signal XCk,n.


The estimation of the power spectral density of the late reverberation is performed on the modulus of the complex time-frequency transform of the input signal XC, notated X. The phase of the complex time frequency transform XC, notated ∠X, is stored in memory and is used to reconstruct a dereverberated signal in the time domain after the application of the dereverberation filter.


The modulus X of the complex time-frequency transform of the input signal XC is then grouped into subbands. More precisely, said modulus X comprises the number K of spectral lines notated Xk. The term “spectral line” in this context designates all the samples of the modulus X of the complex time-frequency transform of the input signal XC for the frequency sampling index k and all of the time indices n. In a step 903, the subband grouping unit 400 groups the K spectral lines Xk into a number J of subbands, in order to obtain a frequency subsampled modulus notated {tilde over (X)} comprising a number J of spectral lines notated {tilde over (X)}j, where j is a frequency subsampling index between 1 and the number J. The number J is less than the number K. Each subband thus comprises a plurality of spectral lines Xk, the frequency index k belonging to an interval having a lower bound bj and an upper bound ej. In one example, each subband corresponds to an octave in order to adapt to the sound perception model of the human ear. Next, in a step 904, the subband grouping unit 400 calculates, for each subband, an average Mean of the spectral lines Xk of said subband in order to obtain the J spectral lines {tilde over (X)}j of the frequency subsampled modulus {tilde over (X)} (cf. FIG. 5).


Next, the prediction vector calculation unit 410 calculates for each spectral line {tilde over (X)}j of the frequency subsampled modulus {tilde over (X)}, subsampled modulus and for each time index n, a prediction vector αj,n (cf. FIG. 6). More precisely, in a step 905, the observation construction unit 700 constructs, for each time index n and frequency subsampling index j, a subsampled observation vector {tilde over (X)}νj,n from the set of samples {tilde over (X)}j,n1:n belonging to the jth spectral line {tilde over (X)}j of the frequency subsampled modulus {tilde over (X)} and falling between the instants n1=n−N+1 and n, where n is the index of the current instant and n−n1 is the size of the memory of the dereverberation device. Each subsampled observation vector {tilde over (X)}νj,n is defined by:

{tilde over (X)}νj,n:=[{tilde over (X)}j,n . . . {tilde over (X)}j,n−N+1]r


Each observation vector {tilde over (X)}νj,n has the size of N×1, where the number N is the length of the observation. The length of the observation N is the number of frames of the time-frequency transformation required for the estimation of the late reverberation. The length of the observation N makes it possible to define the time resolution of the estimation. When the length of the observation N increases, the complexity of the system is reduced. The subsampling of the modulus X of the complex time-frequency transform of the input signal XC makes it possible, among other things, to apply the method in real time.


In a step 906, the analysis dictionary construction unit 710 constructs analysis dictionaries Dα. More precisely, for each time index n and frequency subsampling index j, an analysis dictionary Dj,nα is constructed by concatenating a number L of past observation vectors determined in step 905. The analysis dictionary Dj,nα is thus defined as the matrix







D

j
,
n

a

:=

[





X
~


j
,

n
-
δ







X
~


j
,

n
-
δ
-
1










X
~


j
,

n
-
δ
-
L
+
1









X
~


j
,

n
-
δ
-
1







X
~


j
,

n
-
δ
-
2










X
~


j
,

n
-
δ
-
L























X
~


j
,

n
-
δ
-
N
+
1







X
~


j
,

n
-
δ
-
N










X
~


j
,

n
-
δ
-
L
-
N
+
2






]






where L is the number of past observation vectors and hence the size of the analysis dictionary Dj,nα and δεR* is the delay of the analysis dictionary Dj,nα. More precisely, the delay δ is the frame delay between the current subsampled observation vector {tilde over (X)}νj,n and the other subsampled observation vectors belonging to the analysis dictionary Dj,nα. Said delay δ makes it possible to reduce the distortions introduced by the method. This delay δ also makes it possible to improve the separation of the late reverberation from the early reflections. In order to calculate the current observation vector {tilde over (X)}νj,n and the analysis dictionary Dj,nα and thus the prediction vector αj,n for each spectral line {tilde over (X)}j and for each time index n, a number L+N+δ of frames must be stored in memory.


In a step 907, the LASSO solving unit 720 solves a so-called “LASSO” problem, which is to minimize the Euclidean norm ∥{tilde over (X)}νj,n−Dj,nααj,n2, taking into account the constraint |αj,n1≦λ, where λ is a maximum intensity parameter. In order to solve said problem, the best linear combination of the L vectors of the dictionary for approximating the current observation must be found. In one example, a method known as LARS, the English acronym for “Least Angle Regression,” makes it possible to solve said problem. The constraint |αj,n1≦λ makes it possible to favor solutions that have few non-zero elements, i.e. sparse solutions. The maximum intensity parameter λ makes it possible to adjust the estimated maximum intensity of the late reverberation. This maximum intensity parameter λ theoretically depends on the acoustic environment, i.e. in one example the enclosed space 110. For each enclosed space 110, there is an optimal value of the maximum intensity parameter λ. However, tests have shown that said maximum intensity parameter λ can be set at an identical value for all enclosed spaces 110 without said parameter's introducing degradations relative to the optimal value. Thus, the method works in a great variety of enclosed spaces 110 without requiring any particular adjustment, making it possible to avoid errors in the estimation of the reverberation time of the enclosed space 110. Moreover, the method according to the invention does not require any parameters that must be estimated, thus enabling said method to be applied in real time. The value of the maximum intensity parameter λ is between 0 and 1. In one example, the value of the maximum intensity parameter λ is equal to 0.5, which is a good compromise between the reduction of the reverberation and the overall quality of the method.


In a step 908, for each time index n and each frequency subsampling index k, a current observation vector Xνk,n is created from the set of samples belonging to the kth spectral line Xk of the modulus X of the complex time-frequency transform and falling between the instants n1 and n, notated Xk,n1:n, where n is the currant instant index and n−n1 is the size of the memory of the dereverberation device. Each observation vector Xνk,n is defined by the formula Xνk,n:=[Xk,n . . . Xk,n−N+1]r and is of a size N×1, where N is the length of the observation.


In a step 909, the synthesis dictionary construction unit 800 constructs a synthesis dictionary Ds. More precisely, for each time index n and each frequency sampling index k, the synthesis dictionary Dk,ns is constructed by concatenating a number L of past observation vectors determined in step 908. The synthesis dictionary Dk,ns is thus defined as the matrix







D

k
,
n

x

:=

[




X

k
,

n
-
δ






X

k
,

n
-
δ
-
1









X

k
,

n
-
δ
-
L
+
1








X

k
,

n
-
δ
-
1






X

k
,

n
-
δ
-
2









X

k
,

n
-
δ
-
L






















X

k
,

n
-
δ
-
N
+
1






X

k
,

n
-
δ
-
N









X

k
,

n
-
δ
-
L
-
N
+
2






]






where L and δ are the same parameters as for the analysis dictionary Dj,nα.


In a step 910, for each time index n and each frequency sampling index k, an estimation of the power spectral density of the late reverberation or the spectrum of the late reverberation Xk,nl is constructed by a multiplication of the synthesis dictionary Dk,ns with the prediction vector αj,n according to the formula

Xk,nl=Dk,nsαj,n∀kε└bj,ej┘, j=1, . . . , J


Thus, the prediction vector αj,n indicates the columns of the synthesis dictionary that have been used for the estimation of the reverberation, and the contribution of each of them to the reverberation. The spectrum of the late reverberation Xl is considered in the rest of the method as a noise signal to be eliminated.


To this end, a filtering of the reverberation is performed by the filtering unit 310. More precisely, in a step 911, for each time index n and each frequency sampling index k, a dereverberation filter Gk,n is constructed according to the formula







G

k
,
n


=



ξ

k
,
n



1
+

ξ

k
,
n






exp
(




v

k
,
n










-
t


t




t



)







where ζk,n is the a priori signal-to-noise ratio, calculated as follows

ξk,n=βGk,n−12γk,n−1+(1−β)max{γk,n−1,0}

and where the bound of integration νk,n is calculated as follows







v

k
,
n


=


γ

k
,
n





ξ

k
,
n



1
+

ξ

k
,
n










where γk,n is the a posteriori signal-to-noise ratio, calculated according to the formula







γ

k
,
n


=





X

k
,
n




2





R

k
,
n




2






where Rk,n is the late reverberation calculated as follows

Rk,n=αRk,n−1+(1−α)|Xk,nl|


where α is a first smoothing constant and β is a second smoothing constant. In one example, the first smoothing constant α equals 0.77 and the second smoothing constant β equals 0.98.


In essence, the estimated reverberation is not stationary in the long-term because the audio signal emitted by the omnidirectional sound source 100 that gives rise to said estimated reverberation is not stationary in the long term. Overly fast variations of the estimated reverberation can introduce annoying artifacts during the filtering. To limit these effects, a recursive smoothing is performed in order to calculate the power spectral density of the late reverberation.


In a step 912, for each time index n and each frequency sampling index k, the observation vectors Xνk,n are filtered by the dereverberation filter Gk,n calculated in step 911 so as to obtain a dereverberated signal modulus Yk,n calculated as follows

Yk,n=Gk,nXk,n.


The filter constructed in step 911 strongly attenuates certain observation vectors Xνk,n, which generates artifacts that can be detrimental to the quality of the dereverberated signal. To limit said artifacts, a lower bound is imposed on the attenuation of the filter. Thus, for each frequency sampling index k and for each time index n, if the dereverberation filter Gk,n is less than or equal to a minimum value of the dereverberation filter Gmin, then said dereverberation filter Gk,n is equal to said minimum value of the dereverberation filter Gmin.


In a step 913, for each frequency sampling index k and each time index n, the dereverberated signal modulus Yk,n and the phase ∠Xk,n of the complex signal XCk,n are multiplied in order to create a dereverberated complex signal YC.


In a step 914, a frequency-time transformation is applied by the frequency-time transformation application unit 220 to the dereverberated complex signal Yk,nC in order to obtain a dereverberated time signal y(t) in the time domain. In one example, the frequency-time transformation is an Inverse Short-Term Fourier Transform.


In one embodiment, the value of the number of observation vectors L is equal to 10, the value of the number N of the length of the observation is equal to 8, the value of the delay δ is equal to 5, the value of the maximum intensity parameter λ is equal to 0.5, the value of the number K is equal to 257, the value of the number J is equal to 10, the value of the length of a frame M is equal to 512, and the minimum value of the dereverberation filter Gmin is equal to −12 decibels. The choice of these parameters enables the method to be applied in real time.


The method for suppressing the late reverberation of an audio signal according to the invention is fast and offers reduced complexity. Said method can therefore be used in real time. Moreover, this method does not introduce artifacts and is resistant to background noise. Furthermore, said method reduces background noise and is compatible with noise-reduction methods.


The method for suppressing the late reverberation of an audio signal according to the invention requires only one microphone to process the reverberation with precision.

Claims
  • 1. Method for suppressing a late reverberation of an audio signal, comprising the steps of: capturing an input signal formed by a superimposition of several delayed and attenuated versions of the audio signal;applying a time-frequency transformation to the input signal to obtain a complex time-frequency transform of the input signal;generating a frequency subsampled modulus from a modulus of the complex time-frequency transform of the input signal;generating a plurality of subsampled observation vectors from said frequency subsampled modulus;constructing a plurality of analysis dictionaries from the plurality of subsampled observation vectors;calculating a plurality of prediction vectors from the plurality of subsampled observation vectors and the plurality of analysis dictionaries by minimizing, for each prediction vector (α), the expression ∥{tilde over (X)}ν−Dαα∥2, which is an Euclidean norm of a difference between the subsampled observation vector ({tilde over (X)}ν) associated with said each prediction vector (α) and the analysis dictionary (Dα) associated with said each prediction vector (α) multiplied by said each prediction vector (α), with a constraint ∥α∥1≦λ, according to which the norm 1 of said each prediction vector (α) is less than or equal to a maximum intensity parameter of the late reverberation (λ);generating a plurality of observation vectors from the modulus of the complex time-frequency transform of the input signal;constructing a plurality of synthesis dictionaries from a concatenation of the plurality of observation vectors;estimating a late reverberation spectrum from a multiplication of the plurality of synthesis dictionaries with the plurality of prediction vectors; andfiltering the plurality of observation vectors to eliminate the late reverberation spectrum and to obtain a dereverberated signal modulus.
  • 2. The method according to claim 1, wherein a value of the maximum intensity parameter of the late reverberation (λ) is between 0 and 1.
  • 3. The method according to claim 1, further comprising the step of generating a dereverberated complex signal from the dereverberated signal modulus and a phase of the complex time-frequency transform of the input signal.
  • 4. The method according to claim 3, further comprising the step of applying a frequency-time transformation to the dereverberated complex signal to obtain a dereverberated time signal.
  • 5. The method according to claim 1, further comprising the step of constructing a dereverberation filter (G) according to the model
  • 6. A device for suppressing a late reverberation of an audio signal, comprising: a microphone to capture an input signal formed by a superimposition of several delayed and attenuated versions of the audio signal;a time-frequency unit to apply a time-frequency transformation to the input signal to obtain a complex time-frequency transform of the input signal;a subband grouping unit generates a frequency subsampled modulus from the modulus of the complex time-frequency transform of the input signal;an observation construction unit generates a plurality of subsampled observation vectors from said frequency subsampled modulus;an analysis dictionary construction unit constructs a plurality of analysis dictionaries from the plurality of subsampled observation vectors;a prediction vector calculation unit calculates a plurality of prediction vectors from the plurality of subsampled observation vectors and the plurality of analysis dictionaries by minimizing, for each prediction vector, the expression ∥{tilde over (X)}ν−Dαα∥2, which is an Euclidean norm of a difference between the subsampled observation vector associated with said each prediction vector (α) and the analysis dictionary associated with said each prediction vector (α) multiplied by said each prediction vector (α), with a constraint ∥α∥1≦λ, according to which the norm 1 of said each prediction vector (α) is less than or equal to a maximum intensity parameter of the late reverberation (λ);a reverberation evaluation unit generates a plurality of observation vectors from the modulus of the complex time-frequency transform of the input signal;a synthesis dictionary constructing unit constructs a plurality of synthesis dictionaries from the concatenation of the plurality of observation vectors;a late reverberation estimation unit estimates a late reverberation spectrum from the multiplication of the plurality of synthesis dictionaries with the plurality of prediction vectors; anda filtering unit to filter the plurality of observation vectors so as to eliminate the late reverberation spectrum and obtain a dereverberated signal modulus.
Priority Claims (1)
Number Date Country Kind
13 57226 Jul 2013 FR national
PCT Information
Filing Document Filing Date Country Kind
PCT/EP2014/065594 7/21/2014 WO 00
Publishing Document Publishing Date Country Kind
WO2015/011078 1/29/2015 WO A
US Referenced Citations (2)
Number Name Date Kind
8116471 Derkx Feb 2012 B2
9454956 Kondo Sep 2016 B2
Non-Patent Literature Citations (12)
Entry
Nakatani et al., “Speech Dereverberation Based on Variance-Normalizes Delayed Linear Prediction,” IEEE Transactions on Audio, Speech and Language Processing, Sep. 1, 2010, pp. 1717-1731, vol. 18, No. 7, IEEE, New York, USA.
Kinoshita et al., Suppression of late Reverberation Effect on Speech Signal Using Long-Term Multiple-step Linear Prediction, IEEE Transactions on Audio, Speech and Language Processing, May 1, 2009, pp. 534-545, vol. 17, No. 4, IEEE, New York, USA.
Habets et al., Late Reverberant Spectral Variance Estimation Based on a Statistical Model, IEEE signal processing letters, IEEE service center, Piscataway, NJ, US, vol. 16, No. 9, Sep. 1, 2009, pp. 770-773.
Li et al., “Feature Denoising Using Joint Sparse, Representation for In-Car Speech Recognition,” IEEE Signal Processing Letters, Jul. 1, 2013, pp. 681-684, vol. 20, No. 7, IEEE, Piscataway, USA.
Ephraim et al., “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator,” IEEE Transactions on Acoustics, Speech and Signal Processing, Dec. 1, 1984, pp. 1109-1121, vo. ASSP-32, No. 6, IEEE, New York, USA.
Gillespie et al., “Speech dereverbation via maximum-kurtosis subband adaptive filtering,” Proc. International Conference on Acoustics, Speech and Signal Processing, 2001, pp. 3701-3704, vol. 6, IEEE.
Wu et al., “A two-stage algorithm for one-microphone reverberant speech enhancement,” IEEE Transactions on Audio, Speech and Language Processing, May 2006, pp. 774-784, vol. 14, No. 3, IEEE.
Mosayyebpour et al., “Single Channel Inverse Filtering of Room Impulse Response by Maximizing Skewness of LP Residual,” International Conference on Signal Acquisition and Processing, Feb. 9-10, 2010, pp. 130-134, IEEE.
Bees et al., “Reverberant speech enhancement using cepstral processing,” ICASSP '91 Proceedings of the Acoustics, Speech and Signal Processing, Apr. 14-17, 1991, pp. 977-980, vol. 2, IEEE.
Habets, “Single-and Multi-Microphone Speech Dereverberation using Spectral Enhancement,” PhD thesis, Technische Universiteit Eindhoven, 2007.
Yoshioka, “Speech Enhancement in Reverberant Environments,” PhD thesis, Kyoto University, Mar. 2010.
Kameoka et al., “Robust speech dereverberation based on nonnegativity and sparse nature of speech spectrograms,” Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '09, Apr. 19-24, 2009, pp. 45-48, IEEE.
Related Publications (1)
Number Date Country
20160210976 A1 Jul 2016 US