This description relates to a method and a device for echo cancellation.
In the context of simultaneous sound capture and playback, it is appropriate to use processing involving acoustic echo cancellation (or “AEC” hereinafter).
As shown in
This echo signal is associated with the direct path between the microphone and the playback system, as well as with any reflections of the signal x(t) in the propagation environment.
The overall acoustic path can be modeled by a finite impulse response filter w whose length depends on the characteristics of the propagation environment, such that:
z(t)=x(t)*w(t)
The operation consisting of removing from the microphone signal y(t) the contribution of the echo signal z(t) is called “acoustic echo cancellation” (or AEC). Processing to perform this operation can consist of deriving an echo signal {circumflex over (z)}(t) from the estimation of an acoustic path ŵ(t): this operation is called “adaptive filtering”. The estimated useful signal ŝ(t) is derived by subtracting the estimated echo signal {circumflex over (z)}(t) from the microphone signal y(t), as follows:
s(t)=y(t)−{circumflex over (z)}(t)=y(t)−x(t)*ŵ(t)
Adaptive filtering is generally carried out on the basis of the correlation between the microphone signal and the loudspeaker signal, exploiting the statistical independence between the signal emitted by the loudspeaker x(t) and the signal of interest s(t). In practice, it is appropriate to carry out this processing with a short-term deadline in order to track the changes in the acoustic channel that is represented by the filter w (and for convenience referred to hereinafter as the acoustic path w). These changes can typically manifest themselves when the person speaking is moving through a room which forms said environment.
A result of this short-term processing is that the statistical independence between the signal of interest s(t) and the loudspeaker signal x(t) may no longer hold in certain situations, except for the trivial case where the signal s(t) is zero. Indeed, this independence is no longer true when it is calculated over short windows of time of several tens to hundreds of milliseconds, typically corresponding to a conventional frame length of a digital signal.
The result, in these situations referred to as “double talk”, i.e. when the useful signal s(t) is non-zero, is bias in the estimation of the acoustic channel, degrading the echo cancellation. Less complex solutions based on processing that uses a stochastic gradient for example, such as the “Normalized Least Mean Square” (NLMS) technique and its derivatives, are very sensitive to the presence of a local signal s(t). During these double-talk situations, if the filter continues to adapt, it may even diverge and ultimately cause echo amplification, the opposite of the desired effect. Also, to be effective, the adaptive filtering solution must be robust to double-talk situations while being able to quickly track changes in the acoustic path.
Ideally, this filtering should process only the data in play, namely the reference signal x(t) and the microphone signal y(t).
To overcome double-talk situations, certain known adaptive filtering processing solutions implement double-talk detection (DTD) systems. A system of this type is described for example in the reference [@jung2005new] for which the publication details are given in the appendix at the end of this description. Such systems disable adaptation during periods identified as double-talk. However, in practice, DTDs suffer from detection delays, which can lead to echoes. On the other hand, in this specific case of binary decisions, adaptation of the filter is frozen during the double-talk period, which is distracting in practice if the filter has not yet finished converging, resulting in a perceptible residual echo.
Other methods have instead proposed to derive an adaptive step size in the estimation of the acoustic path. In the known references, this step size is continuous. Such implementations make it possible, unlike binary decision approaches such as DTDs, to continue to track the acoustic path, including during periods of double-talk. These types of adaptation are usually derived by frequency bands, as follows:
Ŵ(f,k+1)=Ŵ(f,k)+ΔW(f,k)
Working in frequencies makes it possible on the one hand to make the convergence more uniform over the entire frequency range considered. On the other hand, the spectral sparseness of the signals makes it possible to continue to estimate the acoustic channel in one frequency band while freezing the estimate in another. Certain methods, referred to as “Variable Step-Size” or VSS, propose modulating the adaptation ΔW according to different criteria.
It has been attempted to smooth the stochastic adaptation by freezing the iterations deemed to be too random, in particular to avoid random updates due to the presence of double talk.
It has also been attempted to directly measure the local speech presence rate, in the form of a ratio between the energy of the local signal {circumflex over (σ)}y2(t) and that of the echo signal {circumflex over (σ)}z2(t), but this adaptation becomes fixed when this ratio is too high. Since the estimates of the variances {circumflex over (σ)}y2(t)/{circumflex over (σ)}z2(t) are particularly noisy, their direct use in modulating the adaptive step size renders these approaches ineffective in practice: they freeze the adaptation too much, slowing down the speed of convergence, or they limit mismatch insufficiently during the double-talk period.
Other methods are based on an optimal solution of adaptive step size which guarantees a minimal variance of the estimated filter, in view of a minimal echo. This criterion is called “BLUE” for “Best Linear Unbiased Estimate” [@trump1998frequency]. Updating the acoustic path ΔW(k) in the adaptive filtering process according to this criterion allows limiting the residual echo linked to variations of the adaptive filter around its solution (minimum in variance). However, in practice, the BLUE expression depends on second-order statistics of signal s(t) (and more precisely on its statistical autocorrelation matrix Γs) which are unknown and generally variable over time as is the case for non-stationary signals such as speech typically [@van2007double]. The solution presented in [@trump1998frequency] is therefore not yet fully satisfactory.
The development improves the situation.
A method is proposed for processing a signal y(t) coming from at least one microphone of an equipment item, the equipment item further comprising at least one loudspeaker intended to be supplied a signal x(t),
Such an implementation offers, as detailed below, an acoustic echo cancellation solution which is robust to double-talk situations in particular.
In one embodiment, the chosen criterion, mentioned above, is of the “BLUE” type, for “Best Linear Unbiased Estimate”.
Said statistical expectation can be written E{ssH} in the case of a matrix representation of the useful signal s (sH designating the conjugate transpose of matrix s). For example, in the time domain and in the case of a representation that is simply scalar, it can depend on a time parameter r, and can be written E{s(t)s(t−τ)}.
In the frequency domain, said statistical expectation can be represented by a parameter corresponding to a power spectral density. Thus, in an implementation where the adaptive filter is produced for example in a domain of frequency sub-bands f, its expression can be a function of a parameter corresponding to the power spectral density Γs(f) of the useful signal s(f). In particular, said normalization Λ(f), expressed in the frequency domain, is itself a function of a parameter corresponding to a power spectral density Γs of the useful signal s.
In such an embodiment, said normalization Λ(k) is defined more precisely as a function of the power spectral density Γs(k) of the useful signal s, and also of the power spectral density Γx(k) of the signal x supplied to the loudspeaker.
In this embodiment, in a matrix representation where f denotes a row index (and also here a frequency sub-band index) and b a column index, the normalization Λ(k)(f, b) can be given by:
with μ∈[0,2[, and where γ is a chosen positive coefficient (this choice can be empirical in the context of a practical implementation).
The power spectral density Γs(k) of the useful signal s can itself be estimated as a function of a power spectral density Γy(k) of the signal y captured by the microphone, and of a representation PESR(k) of an echo-to-signal energy ratio.
In this embodiment, in a matrix representation where f designates a row index and b a column index, the power spectral density Γs(k) of the useful signal s is given by:
The representation PESR(k) of the echo-to-signal energy ratio can itself be estimated as a function at least of a power inter-spectral density ΓyX(k) between the signal y coming from the microphone and the signal X intended to supply the loudspeaker.
For example, in a matrix representation where f denotes a row index and b a column index, the representation PESR(k) of the echo-to-signal energy ratio can be given by:
In this expression, the power inter-spectral density ΓyX(k) can be given by:
The power spectral densities of the signal that is intended to supply the loudspeaker X and of the signal y coming from the microphone can be given, in a matrix representation where X is a matrix and y a vector, by:
Γx(k)=αΓx(k−1)+(1−α)|X|2
Γy(k)=ηΓy(k−1)+(1−η)|y|2,
In an embodiment offering advantages for the estimation of the adaptive filter, the latter can be represented by successive partitions. Thus, in such an embodiment, the filter w can be of the finite impulse response type and be N samples long. In particular, it is subdivided into
partitions wb of L samples each.
In such an embodiment, one can estimate a matrix W∈M×B corresponding to an expression in a transformed domain (for example in the aforementioned domain of frequency sub-bands) of the partitions wb such that W=[w1, . . . , wB], wb∈M, and representing the filter in the transformed domain, with wb=Fwb, F∈M×L, M≥L, where F is a domain transformation matrix.
One will note that in this embodiment, said column index “b” here can correspond to a partition index wb. Nevertheless, the matrix representation presented above with row indices f and column indices b can be applied to situations other than those involving a partition of the filter. As an immediate illustrative example, the formulas given above remain valid in a degraded embodiment where b=1 for example, which therefore does not involve a partition.
Moreover, for each temporal frame, denoted xb∈M, of M samples of the signal intended to supply the loudspeaker x(t), a matrix X∈M×B is formed representing the signal intended to supply the loudspeaker and corresponding to the transforms of the last B frames xb such that X=[x1, . . . , xB], xb∈M, with xb=Fxb. For a temporal frame y∈L of the signal coming from the microphone y(t), a vector y∈M is finally formed.
This vector y can be constructed such that:
In this format, the update to the acoustic path ΔW(k) for a current frame k can then be given by Δwb(k)=GΛb(k) ∘xb(k)*∘Fe(k), where:
G=FF
Hand G=IM,
Λ(k)=[Λ1(k) . . . ΛB(k)]∈M×B, is a matrix representing the aforementioned normalization, and
The a priori error can be given by:
In an embodiment where the adaptive filter is updated from a current frame k to a following frame k+1 as a function of an update to the acoustic path ΔW(k), this update can be estimated for the current frame k, and the update to the acoustic path is given by a relation of the type:
W
(k+1)
=W
(k)
+ΔW
(k)
This description also relates to a computer program comprising instructions for implementing the above method when this program is executed by a processor. In another aspect, a non-transitory, computer-readable storage medium is provided on which such a program is stored.
It also relates to a device for processing a signal y(t) coming from at least one microphone, comprising a processor configured to execute a method as defined above.
Other features, details, and advantages will become apparent upon reading the detailed description below, and upon analyzing the appended drawings, in which:
The drawings and the description below for the most part contain elements that are certain in nature. They therefore not only serve to provide a better understanding of this disclosure, but where applicable they also contribute to its definition.
This description hereinafter proposes an acoustic echo cancellation solution that is robust to double-talk situations. It is based on processing that involves adaptive filtering, for example NLMS processing, typically applied successively to each frame of a succession of frames. Frame is understood here to mean a given number of successive samples of the signal supplied to the loudspeaker x(t), this signal of course being presumed to be digital.
In one embodiment, the filter used for the adaptive filtering is partitioned (the length of each partition may or may not correspond to the length of a frame), preferably in the frequency domain (technique referred to here as “Partitioned-Block Frequency Domain NLMS” or “PBFD-NLMS”). A technique of this type is presented for example in the reference [@borrallo1992implementation].
More particularly here, the solution is based on a derivation of the BLUE optimal step size, but estimates the necessary statistics directly from the reference and microphone signals without adding auxiliary information. This makes it possible to calculate ΔW(k) without an error prediction model or a priori error prediction model on the acoustic path, as may be the case in the references of the prior art, in particular [@gil2014frequency].
Such an embodiment guarantees, without auxiliary information other than that inferred directly by the processing itself, both a convergence that is close to optimum in the sense of speed of convergence, zero bias at convergence, and an absence of divergence in double-talk situations.
Adaptive filtering, when expressed in the frequency domain, makes it possible in particular to control and normalize the updating of the acoustic path independently of the frequency band involved. Thus, in addition to reduced complexity, the solution benefits from a more uniform convergence over the entire frequency range considered.
Its operation via partitioning, associated with filtering in the frequency domain, also makes it possible to estimate in each iteration of the processing a time-frequency representation W(k) of the acoustic path w. This makes it possible to implement different adaptation strategies according to the partitions. It also makes it possible to guarantee better convergence in the case of very long filters.
Such processing allows deriving a step size which optimizes both the behavior in a double-talk situation and the acoustic channel tracking.
An embodiment of an adaptive filtering for echo cancellation is described below by producing the filter by partitions according to the “partitioned-block” technique.
Considering a target acoustic path modeled by a finite impulse response filter w(t) with a length of N samples and split into
partitions wb∈L, we can estimate the matrix W∈M×B corresponding to the frequency transforms of the partitions wb such that:
W=[w
1
, . . . ,w
B
],w
b∈M, with wb=Fwb,F∈M×L,M≥L.
F is the domain transformation matrix, for example here the redundant discrete Fourier transform (DFT) matrix such that each element is characterized by:
In practice, redundancy is achieved by padding with zeros in the time domain.
In the same manner, we consider xb∈M a temporal frame containing M samples of the reference signal x(t) and we form the matrix X∈M×B corresponding to the frequency transforms of the last B frames xb such that:
X=[x
1
, . . . ,x
B
],x
b∈M, with xb=Fxb.
By further considering a temporal frame y∈L of the microphone signal y(t), we can denote the vector y∈M such that:
To avoid the problems associated with convolution operations carried out in the frequency domain, the processing is based on an overlap-save operation (OLS). Here, the exponent ⋅(k) reflects the k-th iteration of the processing. After initialization of W(0), X(0), y(0) and the other characteristics, the processing can continue by calculating the a priori error e(k)∈M:
As indicated above, the redundancy of the DFT is achieved by means of zero-padding which is found in the expression of the a priori error and which advantageously makes it possible to avoid an artifact due to a circular convolution.
The method then continues with calculating the update to the acoustic path ΔW(k)=[Δw1(k) . . . ΔwB(k)]∈M×B in step S9, as follows:
Δwb(k)=GΛb(k)∘xb(k)*∘Fe(k), with Λ(k)[Λ1(k) . . . ΛB(k)]∈M×B,G∈M×M
In an embodiment where the update is optimal (thanks to zero-padding), we set G=FFH (called a “constrained” update).
In an embodiment where the update is sub-optimal, we can instead set G=IM (update referred to as “non-constrained”). Such an implementation has the advantage of consuming fewer resources.
Step S10 then aims to calculate W(k+1) in order to update the acoustic path:
W
(k+1)
=W
(k)
+ΔW
(k)
And in step S11, the useful signal ŝ(t)=y(t)−x(t)*ŵ(t) is obtained after convolution of x(k) by W(k) and brought back to the time domain.
For the echo cancellation solution to be robust in the situations described above a spectral normalization term Λ(k) is chosen which satisfies the BLUE criterion. This can be achieved due to knowing the power spectral densities (PSD) of the microphone signal x(t) and of the local signal s(t).
In reference [@van2007double], BLUE is obtained at the cost of strong assumptions about the local signal which must accept an autoregressive model, considered to be speech, and the use of an error prediction method. On the other hand, [@trump1998frequency] achieves BLUE by estimating the PSD of the local signal after the fact by means of the error signal only e(k), doing so at the cost of less stability and also operating with strong constraints on the local signal (stationary colored noise).
The proposed solution, explained below, overcomes the constraints used above due to a robust estimation of the DSPs over time.
Assuming:
Finally, we denote as PESR∈M×B (ESR for “Echo-to-Signal Ratio”) the matrix expressing, for each frequency band and each partition, the ratio of the energies of the echo and of the local signal. After initialization of Γx(0), Γy(0), ΓyX(0), Γs(0), PESR(0), and of other characteristics, the processing performs the following operations:
Estimation of power spectral densities (PSD):
Estimation of instantaneous Echo-to-Signal ratio (ESR):
Estimation of normalization Λ(k):
Then, by applying the method within the meaning of this description, the normalization parameter, in order to satisfy the BLUE criterion, is expressed by:
with μ∈[0,2[.
This last expression of the normalization parameter involves the term Γs(k) which is a function of the estimation of the echo-to-signal ratio which ultimately can be the only parameter (with of course the signal x(t)) to be estimated within the meaning of this description, for each frame k.
It should be noted that otherwise, by applying instead the teachings of the state of the art as described for example in [@borrallo1992implementation] to implement the classic technique of PBFD-NLMS, the normalization parameter of the filter would then be expressed as follows:
with μ∈[0,2[, without involving any measurement for an estimation of the echo-to-signal ratio.
Now by taking the steps of
ŵ∈
N×1
ŵ=argminw((y−wTx)Rs−1(y−wTx)T),
Achieving the BLUE criterion with adaptive filtering performed in the time domain generally amounts to finding a solution such that:
ŵ∈
N×1
ŵ=argminw((y−wTx)Rs−1(y−wTx)T),
is the matrix of the loudspeaker signal, and Rs∈M×M is the autocorrelation matrix of the signal s.
Current echo cancellation methods are based on adaptive filtering in the frequency domain. The approach presented in [@trump1998frequency] proposes achieving a regularized version of the BLUE criterion by looking for an acoustic channel w solution in the frequency domain such that:
ŵ=argminw((y−w∘x*)H(λΓs+Γx)−1(y−w∘x*)),
However, the local signal s is not known. The estimator satisfying the BLUE criterion is then, in practice, very difficult to obtain without other information or a model on s.
A solution as described in [@borrallo1992implementation] can only produce an unbiased estimate (thus satisfying the BLUE criterion) of the acoustic channel Ŵ if the local signal s and the reference signal x are decorrelated, i.e. E{XsT}=0 (with E{⋅} representing the expectation operator). In practice, this condition can only be met if the signal s is white noise. In all other situations, in order to reach an unbiased estimate Ŵ, it is necessary to add to the denominator of the normalization factor Λ(k) a fraction of the variance of the local signal s, i.e. E{ssH} or its power spectral density (PSD) in the frequency domain (which takes the form of the parameter Γs∈M×B in the equations presented above).
As the expectation E{ssH} is unknown, the solution proposed here overcomes the constraints used above due to a robust estimation over time of the power spectral densities, and in particular of the one which occurs in the denominator cited above: Γs, the power spectral density of the local signal s.
The processing as described above can be used in particular in situations where it is necessary to capture sound and play it back simultaneously. The most common use cases are hands-free telephony (the person speaking at a distance hears his or her own delayed voice—the echo—mixed in with the voice of the other party), interactions with voice assistants (responses from the dialogue system and/or the music played on the voice assistant being mixed in with the commands issued by the user and interfering with voice recognition), intercoms, video-conferencing systems, and others.
A device for implementing the above method is represented in
Of course, this is an exemplary embodiment, where typically here the useful signal s(t) can be transmitted via the output interface OUT2 to a remote party for example. In this case, interface OUT2 can be connected to a communication antenna or to a router of a telecommunications network NET for example. The same is true for the input interface IN2 receiving “from the outside” a signal to be played over the loudspeaker.
In a device such as a voice assistant for example, it is typically possible to also be in a double-talk situation when the user is speaking voice commands at the same time as the assistant is responding for example to previous commands. In this case, at least part of the responses of the voice assistant can be issued locally from the content of the memory MEM for example without having to make use of a remote server and a telecommunications network. In addition, the useful signal s(t) can be locally interpreted only by the processor PROC in order to respond to voice commands from the user. Interfaces IN2 and OUT2 thus may not be necessary.
A typical use case for the processing referred to as “double-talk” processing with a voice assistant consists, for example, of listening to music through the loudspeaker of the voice assistant, while the user is speaking a command to wake up the assistant (WakeUpWord). In this case, it is advisable to eliminate the playing music x(t)*w(t) from the sound signal y(t) captured in an environment (with its reverberations) in which the assistant has just been placed, in order to be able to detect correctly the command signal actually spoken s(t).
Of course, the development is not limited to the embodiment presented above, and may extend to other variants.
For example, a practical embodiment was described above in which the signals are processed in the domain of the frequency sub-bands. Nevertheless, the development can alternatively be implemented in the time domain by exploiting a parameter such as the expectation E{ssT} equivalent to the power spectral density Γs of the local signal s in the frequency domain.
Consequently, the aforementioned estimates of power spectral densities can effectively be carried out in one possible but not necessary implementation.
The same is true for the adaptive filter partition, which can also work in the sub-band domain but this is not necessary either.
Furthermore, in
Number | Date | Country | Kind |
---|---|---|---|
FR2010570 | Oct 2020 | FR | national |
This application is filed under 35 U.S.C. § 371 as the U.S. National Phase of Application No. PCT/FR2021/051659 entitled “METHOD AND DEVICE FOR VARIABLE PITCH ECHO CANCELLATION” and filed Sep. 27, 2021, and which claims priority to FR 2010570 filed Oct. 15, 2020, each of which is incorporated by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2021/051659 | 9/27/2021 | WO |