In acoustic echo cancellation (AEC) [1], keeping computational complexity acceptable, while maintaining or only marginally degrading the cancellation performance is challenging.
There is a need for smart devices with immersive audio playback capabilities. Often, a virtualizer and a larger number of loudspeakers are provided with full-duplex communication functionality. With respect to acoustic echo control, this results in a performance degradation due to correlated loudspeaker signals (also known as the non-uniqueness problem), and in a higher computational complexity due to multichannel acoustic echo cancellation, as the number of AECs is in general equal to the number of loudspeakers times the number of microphones. Thus, the complexity increases quadratically with number of loudspeakers.
Complexity is a major issue as more and more devices with low-cost hardware become voice enabled. In the state-of-the-art, complexity reduction comes at cost of some performance degradation. However, higher sampling rates and lower delays are in particularly for communication applications would be very appreciated.
In the state-of-the-art, adaptive filters are employed for system identification [2], and the entire acoustic echo path (AEP) is estimated [3].
The problem even becomes significantly larger when multi-channel AEC (MC-AEC) is conducted [4]. Given such a setup, it is known that the computational complexity of the AEC module increases at least linearly, but commonly even quadratically with the number of loudspeakers [5]. Therefore, the practical implementation of such an algorithm may exceed the available computational resources.
Reducing the adaptive filter's length does not only reduce the computational complexity load, but also enhances the convergence rate of the adaptive algorithm. This helps mitigating the convergence rate constraints of MC-AEC for highly related loudspeaker signals [6].
By applying a nonlinear transformation to the multi-channel reference signals, the relation between them can be reduced [6]. This helps mitigating the non-uniqueness problem presented by MC-AEC for correlated loudspeaker signals [6] but also reduces the quality of the output signal.
An embodiment may have an apparatus for conducting acoustic echo cancellation by generating one or more error signals, wherein the apparatus is to generate a first echo estimate comprising one or more first echo estimation signals by filtering one or more reference signals using a first filter configuration, wherein the one or more reference signals correspond to one or more loudspeaker signals or are derived from the one or more loudspeaker signals, wherein the apparatus is to generate a second echo estimate comprising one or more second echo estimation signals by filtering the one or more first echo estimation signals using a second filter configuration; or wherein the apparatus is to generate the second echo estimate comprising the one or more second echo estimation signals by generating an intermediate signal from the one or more first estimation signals, and by filtering the intermediate signal using the second filter configuration, wherein the apparatus is to generate the one or more error signals depending on one or more microphone signals and depending on the one or more second echo estimation signals, wherein the apparatus is to update the second filter configuration depending on the one or more error signals, and wherein the apparatus is to output the one or more error signals.
Another embodiment may have a method for conducting acoustic echo cancellation by generating one or more error signals, wherein the method comprises: generating a first echo estimate comprising one or more first echo estimation signals by filtering one or more reference signals using a first filter configuration, wherein the one or more reference signals correspond to one or more loudspeaker signals or are derived from the one or more loudspeaker signals, generating a second echo estimate comprising one or more second echo estimation signals by filtering the one or more first echo estimation signals using a second filter configuration; or generating the second echo estimate comprising the one or more second echo estimation signals by generating an intermediate signal from the one or more first estimation signals, and by filtering the intermediate signal using the second filter configuration, generating the one or more error signals depending on one or more microphone signals and depending on the one or more second echo estimation signals, updating the second filter configuration depending on the one or more error signals, and outputting the one or more error signals.
Another embodiment may have a non-transitory digital storage medium having a computer program stored thereon to perform the method for conducting acoustic echo cancellation by generating one or more error signals, wherein the method comprises: generating a first echo estimate comprising one or more first echo estimation signals by filtering one or more reference signals using a first filter configuration, wherein the one or more reference signals correspond to one or more loudspeaker signals or are derived from the one or more loudspeaker signals, generating a second echo estimate comprising one or more second echo estimation signals by filtering the one or more first echo estimation signals using a second filter configuration; or generating the second echo estimate comprising the one or more second echo estimation signals by generating an intermediate signal from the one or more first estimation signals, and by filtering the intermediate signal using the second filter configuration, generating the one or more error signals depending on one or more microphone signals and depending on the one or more second echo estimation signals, updating the second filter configuration depending on the one or more error signals, and outputting the one or more error signals, when said computer program is run by a computer.
An apparatus for conducting acoustic echo cancellation by generating one or more error signals is provided. The apparatus is to generate a first echo estimate comprising one or more first echo estimation signals by filtering one or more reference signals using a first filter configuration, wherein the one or more reference signals correspond to one or more loudspeaker signals or are derived from the one or more loudspeaker signals. Moreover, the apparatus is to generate a second echo estimate comprising one or more second echo estimation signals by filtering the one or more first echo estimation signals using a second filter configuration; or, the apparatus is to generate the second echo estimate comprising the one or more second echo estimation signals by generating an intermediate signal from the one or more first estimation signals, and by filtering the intermediate signal using the second filter configuration. Furthermore, the apparatus is to generate the one or more error signals depending on one or more microphone signals and depending on the one or more second echo estimation signals. Moreover, the apparatus is to update the second filter configuration depending on the one or more error signals. Furthermore, the apparatus is to output the one or more error signals.
A method for conducting acoustic echo cancellation by generating one or more error signals. The method comprises:
Moreover, a computer program for implementing the above described method when being executed on a computer or signal processor is provided.
Some embodiments relate to voice quality enhancement (VQE). Some embodiments provide a filtered-reference multichannel interference canceller.
Embodiments provide acoustic echo cancellation (AEC) concepts that employ filtered reference signals to reduce the computational complexity of the AEC module while maintaining or only marginally degrading the cancellation performance. To do so, the state-of-the-art reference signals, e.g., signals that are reproduced by the loudspeakers, are filtered with measured acoustic echo paths (AEPs) or a part thereof, e.g., only the loudspeaker and or microphone responses, in the processing path. In doing so, it is possible to reduce the length of the adaptive filters employed for system identification since it is no longer necessary to estimate the entire AEP.
Even though the proposed concepts already provide advantages for single-microphone single-loudspeaker loudspeaker-enclosure-microphone (LEM) environments, in some embodiments, the computational complexity of AEC for acoustic setups that comprise several loudspeakers, e.g., for multi-channel acoustic echo cancellation (MC-AEC) is reduced.
The use of filtered reference signals for multi-channel acoustic echo cancellation, e.g., achieves, inter alia, a plurality of additional advantages.
E.g., reducing the adaptive filter's length does not only reduce the computational complexity load, but also enhances the convergence rate of the adaptive algorithm. This helps mitigating the convergence rate constraints of multi-channel acoustic echo cancellation for highly related loudspeaker signals [6].
By applying a nonlinear transformation to the multi-channel reference signals, the relation between them can be reduced [6]. This helps mitigating the non-uniqueness problem presented by multi-channel acoustic echo cancellation for completely correlated loudspeaker signals [6].
In multiple-loudspeaker multiple-microphone acoustic setups, for loudspeaker-enclosure-microphone systems that are equipped with more than one microphone, the proposed approach can be deployed in two different manners if one acoustic echo cancellation module is placed for each microphone before applying a beamformer (BF) [7]:
In an embodiment, the acoustic echo paths are measured for each microphone, or, in a second embodiment, the acoustic echo paths are measured for a sub-set of microphones, denoted in the following as reference microphones.
For example, an adaptive algorithm estimates a system that copes with the errors in the measured acoustic echo paths for the reference microphones. Or, a relation between a measured reference acoustic echo path and an acoustic echo path between a loudspeaker and the microphones other than the reference microphones in the frequency domain is estimated.
Alternatively, a BF first configuration could be employed [8]. If such a configuration is employed, the process of computing the filtered reference signal for the single-channel acoustic echo cancellation module would be as follows.
Filtered-reference signals for each microphone are computed. The spatial filter is applied to the filtered-reference signals to obtain an equivalent filtered-reference at the output of the beamformer.
One of the advantages offered by this concept with respect to state-of-the-art BF first configurations (for first beamformer configurations, see for example [9, 10, 11, 12, 13]) is the fact the adaptive single-channel system identification algorithm would be agnostic to changes in the spatial filter weights due to modifications in the look direction, undesired signal estimation, etc.
Particular example embodiments may, e.g., realized using one or more or all of the following features:
The acoustic echo path may, e.g., be modelled by at least two filters in series with different update rates.
The first/second filter may, e.g., be determined
The first filter may, e.g., be determined
The outputs of the first filters may, e.g., be employed as reference signals for a single or multi-channel (multi-loudspeaker) interference cancellation module.
The microphone signal(s) may, e.g., be delayed to allow a non-causal second filter. Here, the non-delayed loudspeaker signal is used as a reference/input for the first filter.
Alternatively, the loudspeaker signal is delayed.
The filtering operation may, e.g., be performed
For acoustic setups that comprise one or more microphones
Embodiments of the present invention will be detailed subsequently referring to the appended drawings, in which:
The apparatus 100 is to generate a first echo estimate comprising one or more first echo estimation signals by filtering one or more reference signals using a first filter configuration, wherein the one or more reference signals correspond to one or more loudspeaker signals or are derived from the one or more loudspeaker signals.
Moreover, the apparatus 100 is to generate a second echo estimate comprising one or more second echo estimation signals by filtering the one or more first echo estimation signals using a second filter configuration. Or, the apparatus 100 is to generate the second echo estimate comprising the one or more second echo estimation signals by generating an intermediate signal from the one or more first estimation signals, and by filtering the intermediate signal using the second filter configuration.
Furthermore, the apparatus 100 is to generate the one or more error signals depending on one or more microphone signals and depending on the one or more second echo estimation signals. Moreover, the apparatus 100 is to update the second filter configuration depending on the one or more error signals.
Furthermore, the apparatus 100 is to output the one or more error signals.
In an embodiment, the apparatus 100 may, e.g., be configured to update the second filter configuration depending on the one or more error signals to obtain an updated second filter configuration. The apparatus 100 may, e.g., be configured to update the one or more error signals depending on the updated second filter configuration to obtain one or more updated error signals. To output the one or more error signals, the apparatus 100 may, e.g., be configured to output the one or more updated error signals.
According to an embodiment, the apparatus 100 may, e.g., be configured to not update the first filter configuration at runtime. Or, the apparatus 100 may, e.g., be configured to also update the first filter configuration at runtime, wherein the apparatus 100 may, e.g., be configured to update the second filter configuration at runtime more often than the first filter configuration.
In an embodiment, the apparatus 100 may, e.g., be configured to introduce a delay into the one or more microphone signals. The apparatus 100 may, e.g., be configured to generate the one or more error signals depending on the one or more microphone signals after the delay is introduced by the apparatus 100 into the one or more microphone signals and depending on the one or more second echo estimation signals.
According to an embodiment, the apparatus 100 may, e.g., be configured to introduce a delay into the one or more loudspeaker signals. The one or more reference signals correspond to the one or more loudspeaker signals before the delay is introduced by the apparatus 100 into the loudspeaker signals; or wherein the one or more reference signals are derived from the one or more loudspeaker signals before the delay is introduced by the apparatus 100 into the loudspeaker signals.
In an embodiment, the apparatus 100 may, e.g., be configured to introduce a first delay into the one or more loudspeaker signals. The one or more reference signals may, e.g., correspond to one or more loudspeaker signals before the first delay is introduced by the apparatus 100 into the loudspeaker signals; or, the one or more reference signals may, e.g., be derived from the one or more loudspeaker signals before the first delay is introduced by the apparatus 100 into the loudspeaker signals. Moreover, the apparatus 100 is to introduce a second delay into the one or more microphone signals. Furthermore, the apparatus 100 may, e.g., be configured to generate the one or more error signals depending on the one or more microphone signals after the second delay is introduced by the apparatus 100 into the one or more microphone signals and depending on the one or more second echo estimation signals.
According to an embodiment, the one or more loudspeaker signals may, e.g., be two or more loudspeaker signals.
In an embodiment, the one or more microphone signals are two or more microphone signals. The apparatus 100 may, e.g., be configured to generate the first echo estimate comprising two or more first echo estimation signals as the one or more first echo estimation signals by filtering the one or more reference signals using the first filter configuration. For each of the two or more microphone signals, one of the two or more first echo estimation signals may, e.g., indicate a first estimation of an echo in said one of the two or more microphone signals.
According to an embodiment, the apparatus 100 may, e.g., comprise two or more first filters, wherein the first filter configuration comprises an individual filter configuration for each of the two or more first filters. Each of the two or more first filters may, e.g., be to generate one of the two or more first echo estimation signals by filtering at least one of the one or more reference signals using the individual filter configuration for said one of the two or more first filters.
In an embodiment, the apparatus 100 may, e.g., comprise exactly one first filter. The exactly one first filter may, e.g., be configured to generate all of the two or more first echo estimation signals by filtering the one or more reference signals using the first filter configuration.
According to an embodiment, there may, e.g., be three or more microphone signals. The apparatus 100 may, e.g., comprise two or more first filters being computed for a proper subset of the three or more microphones. The two or more first filters may, e.g., be configured to generate all of the two or more first echo estimation signals by filtering the one or more reference signals using the first filter configuration.
According to an embodiment, the apparatus 100 may, e.g., be configured to generate the second echo estimate, which comprises two or more second echo estimation signals, by filtering the two or more first echo estimation signals using the second filter configuration. For each of the two or more microphone signals, one of the two or more second echo estimation signals may, e.g., indicate a second estimation of the echo in said one of the two or more microphone signals. The apparatus 100 may, e.g., be configured to generate two or more error signals as the one or more error signals depending on the two or more microphone signals and depending on the two or more second echo estimation signals.
In an embodiment, the apparatus 100 may, e.g., comprise two or more second filters, wherein the second filter configuration comprises an individual filter configuration for each of the two or more second filters. Each of the two or more second filters may, e.g., be configured to generate one of the two or more second echo estimation signals by filtering one of the two or more first echo estimation signals using the individual filter configuration for said one of the two or more second filters. Moreover, the apparatus 100 may, e.g., be configured to generate each one of the two or more error signals using only one of the two or more microphone signals and using only that one of the two or more second echo estimation signals, which indicates the second estimation of the echo in said one of the two or more microphone signals. Furthermore, the apparatus 100 may, e.g., be configured to update the individual filter configuration of each second filter of the two or more second filters using the error signal of the two or more error signals which has been generated using said one of the two or more second echo estimation signals that has been generated by said second filter, without using another error signal of the two or more error signals.
According to an embodiment, the apparatus 100 may, e.g., be configured to generate a first beamformer signal from the two or more microphone signals. The apparatus 100 may, e.g., be configured to generate as the intermediate signal a second beamformer signal from the two or more first echo estimation signal Moreover, the apparatus 100 may, e.g., be configured to generate exactly one second echo estimation signal as the second echo estimate by filtering the second beamformer signal using the second filter configuration, such that the second echo estimation signal indicates an estimation of an echo in the first beamformer signal. Furthermore, the apparatus 100 may, e.g., be configured to generate exactly one error signal as the one or more error signals depending on the first beamformer signal and depending on the exactly one second echo estimation signal. Moreover, the apparatus 100 may, e.g., be configured to update the second filter configuration depending on the exactly one error signal. Furthermore, the apparatus 100 may, e.g., be configured to output the exactly one error signal.
In an embodiment, the apparatus 100 may, e.g., be configured to introduce a delay into the first beamformer signal. The apparatus 100 may, e.g., be configured to generate the exactly one error signal depending on the one or more microphone signals after the delay is introduced by the apparatus 100 into the first beamformer signal and depending on the exactly one second echo estimation signal.
According to an embodiment, the apparatus 100 may, e.g., comprise one or more loudspeakers to output the one or more loudspeaker signals. Moreover, the apparatus 100 may, e.g., comprise one or more microphones to generate the one or more microphone signals.
In an embodiment, the apparatus 100 may, e.g., be configured to generate the first echo estimate comprising the one or more first echo estimation signals by filtering the one or more reference signals using the first filter configuration, depending on at least one room impulse between at least one of one or more loudspeakers and at least one of at least one microphone, wherein the one or more loudspeakers are to output the one or more loudspeaker signals, and wherein the one or more microphones are to record the one or more microphone signals.
According to an embodiment, the apparatus 100 may, e.g., be configured to obtain at least one acoustic echo path by measuring said at least one acoustic echo path between said at least one of the one or more loudspeakers and said at least one of the at least one microphone.
In an embodiment, to generate the first echo estimate, the apparatus 100 may, e.g., be configured to filter the one or more reference signals in a time domain to obtain the one or more first echo estimation signals.
According to an embodiment, to generate the first echo estimate, the apparatus 100 may, e.g., be configured to filter the one or more reference signals in a transform domain to obtain the one or more first echo estimation signals.
In an embodiment, to generate the first echo estimate, the apparatus 100 may, e.g., be configured to filter the one or more reference signals in a time domain to obtain the one or more first echo estimation signals depending on the formula:
x
c(n)=cx(n)
wherein xc(n) is one of the one or more first echo estimation signals, wherein x(n) is one of the one or more reference signals, wherein c is a filter for filtering said one of the one or more reference signals, with
wherein T indicates a transpose.
According to an embodiment, the apparatus 100 may, e.g., be configured to filter the one or more reference signals to obtain one or more first echo estimation signals depending on the formula:
x
c()=[xc(N−N+1) . . . xc(N)]T
wherein xc() is one of the one or more first echo estimation signals, wherein
[0NTxcT()]T=diag{[0NT1NT]T}F−1Xc,
wherein
wherein
X
(−p)=diag{(Fx(−p)}
wherein
C(p)=F[cT(p)0NT]T
wherein
x(−p)=[x(N−pN−M+1) . . . x(N−pN)]T,p∈{0, . . . P−1}
wherein
c(p)[c(pN) . . . c(pN+N−1)]T,pE{0, . . . ,P−1}
wherein Lc indicates a length of an acoustic echo path, wherein N indicates a length of a filter partition of a plurality of filter partitions, wherein P=┌Lc/N┐ indicates a number of the plurality of filter partitions, wherein p is an index of a p-th loudspeaker signal of the one or more loudspeaker signals, wherein indicates a frame index, wherein M=2N indicates a discrete Fourier transform length, wherein C(p) indicates a frequency domain representation of a calibration filter, wherein X indicates a plurality of loudspeaker signal partitions, wherein F indicates a discrete Fourier transform matrix of length M, wherein ON indicates a vector of length N whose elements are all zeroes, wherein 1N indicates a vector of length N whose elements are all ones, wherein the superscript α indicates that an output of a convolution is polluted by circular convolution components, and wherein F−1 indicates an inverse DFT matrix defined such that F−1F=I.
The Apparatus 100 is to Generate Each of the One or More Echo Estimation Signals
Furthermore, the apparatus 100 is to generate the one or more error signals depending on one or more microphone signals and depending on the one or more second echo estimation signals. Moreover, the apparatus 100 is to update the second filter configuration depending on the one or more error signals.
Furthermore, the apparatus 100 is to output the one or more error signals.
Before further embodiments are described in more detail, some background considerations are presented.
(n)=d(n)s±v(n), (1)
where d(n) is the echo signal, s(n) is the near-end speech and v(n) is the background noise. The echo signal, i.e., the signal that results from the acoustic coupling between loudspeaker and microphone, is given by d(n)=hTx(n), where h is the AEP, x(n) is the signal that is reproduced by the loudspeaker and ·T denotes transposition. In the remainder, it is assumed that the AEP can be modeled as a finite impulse response (FIR) filter of length Lh, and consequently:
h=[h(0), . . . ,h(Lh−1)]T (2)
x(n)=[x(n), . . . ,x(n−Lh+1)]T (3)
The aim of AEC is to compute an echo signal estimate {circumflex over (d)}(n) that is then subtracted from the microphone signal, resulting in echo reduction [1]. To achieve this, adaptive filtering techniques [2] are employed to estimate the AEP between the loudspeaker and the microphone. In the remainder, estimates are denoted by the upper script a. The echo signal estimate {circumflex over (d)}(n) may, e.g., comprise one or more echo estimation signals.
ĥ(n+1)=ĥ(n)+μ(n)x(n)e(n) (4)
where μ(n) denotes a step-size matrix, which is algorithm dependent, and the error signal after cancellation e(n) is given by
e(n)=[h−{right arrow over (h)}(n)]Tx(n)+s(n)+v(n) (5)
Note that the adaptive filter aims at computing the AEP estimate ĥ(n) that minimizes the error signal at the output of the canceler, i.e., the one that provides the largest echo reduction.
with c=[c(0), . . . ,c(Lc−1)]T and Lc=Lh−Lg+1.
(Lc may, e.g., indicate a length of the filter c. Lg may, e.g., indicates a length of the filter ĝ.)
If this is done, the error signal can be rewritten as
where the system to be identified gopt=[gopt(0), . . . , gopt(Lg−1)]T models the difference between the true AEP and the pre-processing filter c. Note that in order to formulate the error as in Eq. (7) it is assumed that both c and g° Pt can be modeled as FIR filters of length Lc≤Lh and Lg≤Lh, respectively.
The generic update equation of an adaptive filter that aims at estimating et such that E{|e(n)|2} is minimized takes the form
ĝ(n+1)=ĝ(n)+η(n)xc(n)e(n) (8)
where E{·} denotes the expectation operator and η(n) is a step size matrix that is algorithm dependent.
Given the error equation (7), it is possible to expect that having knowledge on the true AEP would aid the adaptation process in terms of its convergence and re-convergence rate if Lg<Lh, since the shorter the adaptive filter, the faster it converges. To this end, one could employ a calibration phase to measure the acoustic echo path, such that c=c≈h.
For instance, if the AEP measurement were perfect, then xc(n)=d(n) and gopt(n)=1. Consequently, it would suffice to estimate Lg=1 filter coefficients. Nevertheless, in practice measured AEPs present measurement errors, e.g., they are degraded due to the microphone self-noise, therefore in (7) the optimum impulse response estimate ĝ(n) would aim at modelling the error in the AEP measurement such that it would be necessary to estimate Lg>1 filter coefficients.
In the following, embodiments, which realize an implementation in the frequency domain are described.
In the example described above, the process to obtain the filtered-reference signal is described in relation to the adaptive filter's update equation and the error signal formulation.
The above provided equations can be equivalently written in the frequency domain. However, assuming that the calibration filters are time invariant or slowly time varying, one would pre-process the reference signal before passing it to the MC-AEC module in practice. Therefore, the pre-processing and MC-AEC modules are assumed to be interfaced in the time domain regardless of the adaptive filter's implementation, and only the filtered reference signal computation is considered in the following.
The computational overhead introduced by computing the filtered reference signal can be kept low if the filtering operation is applied in the frequency domain. Note that frequency domain convolutions are equivalent to circulant convolutions in time domain [14, 15, 16]. For optimal performance, it is necessary to select only the linear convolution components after applying the filter.
This selection step is commonly denoted as constraint operation. In the following, the filtering operation is described in the partitioned-block frequency domain [17]. Note that this formulation degenerates into the non-partitioned one if the number of partitions P is set to 1. The advantage provided by the partitioned formulation is the fact that the buffering delay, which is proportional to the partition length, can be arbitrarily selected and is independent of the filter length.
As before, it is assumed that the length of the measured AEP is Lc. If one employs a partitioned formulation of the calibration filter, each filter partition of length N is given by
c(p)=[c(pN) . . . c(pN+N−1)]T,p∈{0, . . . ,P−1} (9)
where the number of partitions is P=┌Lc/N┐. Using an overlap save formulation with a 50% overlap [14, 16], the p-th loudspeaker signal partition can be formulated as:
x(−p)=[x(N−pN−M+1) . . . x(N−pN)]T,p∈{0, . . . ,P−1} (10)
where denotes the frame index and M=2N is the discrete Fourier transform (DFT) length. The frequency domain representation of the calibration filter and the loudspeaker signal partitions is respectively given by
C(p)=F[cT(p)0NT]T (11)
X
(−p)=diag{Fx(−p)} (12)
where F denotes the DFT matrix of length M and 0N is a vector of length N whose elements are all zeroes. The filtered reference is obtained by convolving in the DFT domain, i.e.,
In Eq. (13), the superscript a indicates that the output of the convolution is polluted by circular convolution components. Therefore, it is necessary to select only the linear components:
[0NTxcT()]T=diag{[0NT1NT]T}F−1X̊c, (14)
x
c()=[xc(N−N+1) . . . xc(N)]T (15)
where F−1 denotes the inverse DFT matrix defined such that F−1F=I and 1N is similarly defined as 0N.
Finally, the filtered reference signal in the frequency domain that is used by a frequency-domain adaptive filter (FDAF) can be obtained as
X
c()=diag{F[xcT(−1)xcT()]T} (16)
if both the filtering operation and the adaptive filter employ the same DFT length M and overlap between subsequent input frames N.
In the following, MC-AEC is considered.
In the following, particular embodiments providing implementations of filtered-reference MC-AEC are described.
and afterwards constrained to select only the linear convolution components
[0NTxc,iT()]T=diag{[0NT1NT]}F−1Xc,i() (18)
x
c,i()=[xc,i(N−N+1) . . . xc(N)]T (19)
with i∈{1, . . . , I}, where
C
i(p)=F[ciT(p)0NT]T (20)
X
i(−p)=diag{Fxi(−p)} (21)
are the frequency-domain representation of the i-th measured AEP and the i-th loudspeaker signal, respectively. Note that ci(p) and xi(−p) are analogously defined as c(p) and x(−p) in the previous section.
If the filtering of the reference signals and the adaptive algorithm for MCAEC are both implemented in the partitioned-block frequency domain using an equal transform length M and overlap between subsequent input frames N, the i-th filtered reference signal in the frequency domain can be written as
X
c,i()=diag{F[xc,iT(−1)xc,i()]T()T} (22)
Consequently, if one defines the frequency-domain representation of the q-th partition of system to be identified glopt(q)=[gopt(qN), . . . , gopt(qN+N−1)]T as
G
i
opt(q)=F[gioptT(q)0NT]T (23)
with q∈(0, . . . , Q−1)=Q=┌Lg/N┐. If one denotes the estimate of Giopt(q) as Ĝi(q), the constrained echo estimates can be written as
Therefore, analogously to any state-of-the-art MC-AEC implementation in the frequency domain, the error signal after cancellation is given by
E()=Y()−{circumflex over (D)}() (25)
where A()=F[0NTaT()]T with a∈{e,γ,{circumflex over (d)}}, and a()=[a(N+1), . . . , a(N+N)]T. As previously mentioned, the computational complexity of MC-AEC can increase either linearly or quadratically with the number of loudspeaker channels.
The adaptive algorithms whose complexity increases only linearly are those which do not take the relation across loudspeaker channels into account when computing the filter update
Ĝ
i(+1,q)=Ĝi(,q)+Fdiag{[1NT0NT]T}F−1Sii(,q)√{square root over (X)}c,iH(−q)E() (26)
where Sii denotes the i-th step-size matrix. For instance, the least-mean squares (LMS) or the normalized least-mean squares (NLMS) algorithms [2]. Alternatively, the recursive least squares (RLS) [18], the affine projection (APA) [5] or the state-space [19] algorithms take the relation between the reference signals into account, i.e.,
where Sij denotes the {ij}-th step-size matrix for all i,j∈{1, . . . , I}. This results in the prewhitening of the reference signals, which increases the convergence rate of the adaptation process. However, it comes at the cost of a significant increase in computational complexity for large number of reference channels if long adaptive filters are employed. Therefore, the importance of reducing the adaptive filters' length to ensure that the computational complexity does not exceed the available computational resources.
Summarizing the above, an underlying idea of the proposed filtered-reference MC-AEC algorithm is to reduce the length of the adaptive filters, i.e., Q=┌Lg/N┐<B=┌Lh/N┐, while maintaining or only moderately reducing the echo reduction capability of an MC-AEC implementation. As a byproduct, the convergence rate of the MCAEC algorithms is enhanced since the relation between the reference signals is reduced and the shorter the adaptive filter the faster it converges. Note that the formulation of the multi-channel adaptive filtering process stays same, and that the only difference between the proposed method and state-of-the-art MC-AEC is the formulation of the employed reference signals.
Particular embodiments may, e.g., comprise one or more or all of the following steps:
Application fields of embodiments may, e.g. be low complexity upHear® VQE technology for smart devices and VoIP communication.
In the following, the performance of the provided concepts is evaluated.
A series of simulations were conducted to assess the performance of the proposed low-complexity MC-AEC algorithm and to compare it to state-of-the-art MCAEC implementations.
All simulations were conducted employing the state-space algorithm for MCAEC implemented in the partitioned-block frequency domain as described in [20]. For all simulations, the microphone and reference signals employed were sampled at 16 kHz and the size of the filter partitions was set to N=256 samples, with a 50% overlap between subsequent frames. This implies that the transform length was set to M=512 samples. Further, the adaptive filter length for state-of-the-art MC-AEC was set to 100 ms by default, i.e., B=7 partitions. For both MC-AEC implementations, the use of different number of filter partitions was evaluated, ranging from 1 to 7 partitions.
The exact details of the MC-AEC implementation are irrelevant for the evaluation, since no parameters were modified or specifically tuned for any of the two MC-AEC methodologies under test.
Two different sets of simulations were conducted. First, simulated data was generated and employed to evaluate the proposed approach under controlled conditions. Secondly, recordings were made to assess the performance of both MC-AEC approaches.
The objective performance measure used to evaluate both MC-AEC approaches is the normalized mean-squared error (NMSE):
where ∥·∥2 is the l−2 norm. The NMSE was computed for each frame and the provided NMSE values in
In the following, generated data is considered. To generate the simulated data, room impulse responses (RIRs) were generated using the impulse response generator in [21] for a stereophonic playback setup and a uniform linear array (ULA) with four microphones. The distance between microphones in the array was 3 cm. The loudspeakers were simulated to be 10 cm away from the center of the array, with an angle of 60 degree among them. The echo signals were obtained by convolving the generated RIRs with the far-end signals. Finally, white Gaussian noise (WGN) was added to the microphone signals to generate an echo-to-noise ratio (ENR) of 40 dB. No near-end speech was considered in this evaluation.
For the evaluation of the filtered-reference MC-AEC algorithm, white Gaussian noise was added to the generated RIRs to simulate noise in the measurements.
The noisy RIRs were then truncated to the length of the default adaptive filter, i.e., 100 ms. These were then used to obtain the filtered-reference signals for stereo AEC.
Two different test-cases were evaluated:
Since all microphones in the array delivered similar results, only the obtained results obtained for one of the microphones are provided in
In the following, recorded data is considered. To evaluate the proposed approach with real data, recordings were done in a laboratory. To this end, a ULA with four microphones was employed for sound acquisition. The employed microphones were DPA microphones [ ] and the inter-microphone was of 3 cm. For playback, two Genelec loudspeakers were employed. Both loudspeakers were placed facing the array, with a distance of 30 cm to the center of the array. The angle among them was of approximately 75 degrees.
Both the playback and recording was conducted to ensure the synchronicity of the signals. The sampling rate of both the played back and acquired signals was 48 kHz. The recordings were conducted in one session, using a concatenated audio file for playback that did comprise:
For the processing stage, all signals were downsampled to 16 kHz. The downsampled files were cut into five segments. The two logarithmic sweeps were employed for the RIR measurement. The other three segmented signals were employed for processing and evaluation.
Two different sets of results are presented for the proposed approach:
Note that if this is done, the microphone signal has to be delayed by the same number of samples. This adds delay to the processing path, but as shown in the performance evaluation, it also results in a very large increase in performance.
The length of the calibration filters was for both test cases equal to 100 ms.
The results obtained for one of the microphones in the array are depicted in
Considering the different stereo signal content separately (see
In the following, a theoretical complexity analysis is described. A computational complexity analysis is provided. The analysis is based on the number of basic operations, i.e., additions and multiplications. The computational complexity introduced by the measurement of the AEPs is not considered in this analysis, since it is assumed to be negligible if compared to that of the filtered reference signal creation and MC-AEC modules.
In the following, reference signal creation is considered. The computational complexity overhead introduced by creating the filtered reference (FR) signals using a partitioned-block frequency-domain implementation is equal to:
for each input frame. In (29), (FFT)≈2M log2(M)−4M is the computational complexity of a fast Fourier transform (FFT) of length M and ( mult)=6 is that of a complex multiplication of length 1 (see [22] and references therein).
In the following, multi-channel AEC is considered. Recall that regardless of the MC-AEC method use, the adaptive filtering process for each block and frame stays same. Thus, in the following R∈{Q,B} denotes the number of filter partitions regardless of the MC-AEC approach.
The computational complexity for each frame of computing the echo estimate (EE) is similar to that of the creation of the reference signal, i.e.,
but implies one sole constraint operation per frame, since the echo estimates of the different channels are summed up. The error signal is commonly computed in the time domain, and then transformed back to the frequency domain by means of one FFT for each frame, i.e.,
(EC)=N+±(FFT) (31)
Given a multi-channel adaptive filter that takes the relation between the reference channels into account, the computational complexity of the filter update (FU) is equal to
The computational complexity overhead introduced by the computation of the multi-channel step-size matrices, i.e., Sij(, r) in (27), depends on the specific adaptive algorithm.
To provide an example for a complete adaptive algorithm, the state-space algorithm [23] is employed. The multi-channel implementation of the state-space algorithm in the partitioned-block frequency domain is described in detail in [20], including an in-depth analysis of its computational complexity.
Considering the efficient implementation of the state-space algorithm as described in [20], the computational complexity of the gradient update is:
(FU)=IR[IMO(mult)+2(FFT)+2M]+4M (33)
and that of the update of the step-size matrix (MU) (Kalman gain computation and update of the system distance covariance matrix):
Summarizing the above, given the detailed analysis above, the computational complexity of state-of-the-art MC-AEC for each frame is equal to:
(MC−AEC,B)=(EE)+(EC)+(FU)+(MU) (35)
with R=B, and that of the proposed approach with P and R=Q is equal to:
(FR MC−AEC,P,Q)=(FR)+(EE)+(EC)+(FU)+(MU) (36)
To demonstrate that filtered-reference MC-AEC is capable of reducing the computational complexity with respect to state-of-the-art MC-AEC, the difference in complexity between both approaches, i.e.,
is depicted in
Returning to
Although some aspects have been described in the context of an apparatus, these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus. Some or all of the method steps may be executed by (or using) a hardware apparatus, like for example, a microprocessor, a programmable computer or an electronic circuit. In some embodiments, one or more of the most important method steps may be executed by such an apparatus.
Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software or at least partially in hardware or at least partially in software. The implementation can be performed using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. Therefore, the digital storage medium may be computer readable.
Some embodiments according to the invention comprise a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.
Generally, embodiments of the present invention can be implemented as a computer program product with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may for example be stored on a machine-readable carrier.
Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine-readable carrier.
In other words, an embodiment of the inventive method is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.
A further embodiment of the inventive methods is, therefore, a data carrier (or a digital storage medium, or a computer-readable medium) comprising, recorded thereon, the computer program for performing one of the methods described herein. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitory.
A further embodiment of the inventive method is, therefore, a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may for example be configured to be transferred via a data communication connection, for example via the Internet.
A further embodiment comprises a processing means, for example a computer, or a programmable logic device, configured to or adapted to perform one of the methods described herein.
A further embodiment comprises a computer having installed thereon the computer program for performing one of the methods described herein.
A further embodiment according to the invention comprises an apparatus or a system configured to transfer (for example, electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.
In some embodiments, a programmable logic device (for example a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are performed by any hardware apparatus.
The apparatus described herein may be implemented using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
The methods described herein may be performed using a hardware apparatus, or using a computer, or using a combination of a hardware apparatus and a computer.
While this invention has been described in terms of several embodiments, there are alterations, permutations, and equivalents which fall within the scope of this invention. It should also be noted that there are many alternative ways of implementing the methods and compositions of the present invention. It is therefore intended that the following appended claims be interpreted as including all such alterations, permutations and equivalents as fall within the true spirit and scope of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
20215209.6 | Dec 2020 | EP | regional |
This application is a continuation of copending International Application No. PCT/EP2021/085055, filed Dec. 9, 2021, which is incorporated herein by reference in its entirety, and additionally claims priority from European Application No. EP 20 215 209.6, filed Dec. 17, 2020, which is incorporated herein by reference in its entirety. The present invention relates to acoustic echo cancellation, and, in particular, to an apparatus and method for filtered-reference acoustic echo cancellation.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/EP2021/085055 | Dec 2021 | US |
Child | 18334813 | US |