The present invention is related to acoustic echo cancellation (AEC), and more particularly, to an AEC system that uses an adaptive filter to generate an estimated echo signal for canceling/reducing an echo signal in a data signal before the data signal is fed into a neural network.
Acoustic echo often occurs in audio/video calls if a far-end speaker's voice is played by a near-end speaker and is picked up by a near-end microphone (e.g. a near-end microphone signal generated by the near-end microphone may include an echo signal). For a conventional AEC system, an adaptive filter and a neural network are utilized to suppress the echo signal, wherein the adaptive filter is a part of neural network architecture. Some problems may occur, however. Since the neural network architecture includes the adaptive filter, the adaptive filter also needs to be considered during training, which may reduce the training efficiency or reduce the training effect. As a result, a novel AEC system that uses the adaptive filter to generate an estimated echo signal for canceling/reducing the echo signal in a data signal (e.g. the near-end microphone signal) before the data signal is fed into the neural network is urgently needed.
It is therefore one of the objectives of the present invention to provide an AEC system that uses the adaptive filter to suppress the echo signal in the data signal before the data signal is fed into the neural network, to address the above-mentioned issues
According to an embodiment of the present invention, an AEC system is provided. The AEC system comprises an adaptive filter, a subtraction circuit, and a processor. The adaptive filter is arranged to generate an estimated echo signal according to a first microphone signal played by a loudspeaker. The subtraction circuit is arranged to subtract the estimated echo signal from a signal that is output from a microphone receiving both a speech signal and an echo signal, to generate a second microphone signal. The first microphone signal is not output from the microphone, and the echo signal is transmitted from the loudspeaker to the microphone. The processor is arranged to execute a model. The model is arranged to perform short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, to generate a first transformed microphone signal and a second transformed microphone signal, and generate an estimated speech signal through a neural network according to the first transformed microphone signal and the second transformed microphone signal.
According to an embodiment of the present invention, an AEC method is provided. The AEC method comprises: generating an estimated echo signal according to a first microphone signal played by a loudspeaker; subtracting the estimated echo signal from a signal that is output from a microphone receiving both of a speech signal and an echo signal, to generate a second microphone signal, wherein the first microphone signal is not output from the microphone, and the echo signal is transmitted from the loudspeaker to the microphone; performing short-time Fourier transform upon the first microphone signal and the second microphone signal, respectively, to generate a first transformed microphone signal and a second transformed microphone signal; and generating an estimated speech signal through a neural network according to the first transformed microphone signal and the second transformed microphone signal.
One of the benefits of the present invention is that, in the AEC system and associated method of the present invention, before a data signal (e.g. a signal that is output from the microphone receiving an echo signal) is fed into an AEC model for training, an adaptive filter is utilized to generate an estimated echo signal to cancel/reduce most of the echo signal in the data signal. In this way, the train efficiency and the training effect of the AEC model can be improved.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
Certain terms are used throughout the following description and claims, which refer to particular components. As one skilled in the art will appreciate, electronic equipment manufacturers may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not in function. In the following description and in the claims, the terms “include” and “comprise” are used in an open-ended fashion, and thus should be interpreted to mean “include, but not limited to . . . ”.
It should be noted that, before the near-end microphone signal y(n) is transmitted to the AEC model 206, the adaptive filter 204 may be arranged to generate an estimated echo signal d′(n) according to the far-end microphone signal x (n) for canceling the echo signal d(n) in the signal MS output from the microphone 202. For example, the adaptive filter 204 may be arranged to multiply a room response r(n) and an impulse response h(n) (which is a time-varying impulse response between the loudspeaker 200 and the microphone 202), to generate the estimated echo signal d′(n) (i.e. d′(n)=r(n)*h(n), where d′ (n)≈d(n)). In this embodiment, the AEC system 20 may further include a subtraction circuit (which may be implemented by an adder that is configured to perform a subtraction operation) 208, wherein the subtraction circuit 208 may be coupled to the microphone 202, the adaptive filter 204, and the AEC model 206, and may be arranged to subtract the estimated echo signal d′(n) from the signal MS to generate the near-end microphone signal y(n) (i.e. y(n)=MS−d′(n) =s(n)+d(n)+v(n)−d′(n)≈(n)+v(n)). In this way, before the near-end microphone signal y(n) is transmitted to the AEC model 206 for training, most of the echo signal d (n) in the near-end microphone signal y(n) has been canceled/reduced by the adaptive filter 204.
Afterwards, the AEC model 206 may be arranged to generate the estimated speech signal s′(n) through the separation kernel 314 according to the first transformed microphone signal X_T and the second transformed microphone signal Y_T. In this embodiment, the separation kernel 314 may include multiple long short term memory (LSTM) layers (e.g. 3 LSTM layers 316-320) and a fully-connected layer 322 (labeled as “FC” in
Specifically, for the training of the AEC model 206, a noisy speech signal Y(k, l) is a sum of a clean speech signal S(k, l) and a noise signal N(k, l) (which may correspond to the external noise signal v(n) and the remnant echo signal that is generated by subtracting the estimated echo signal d′(n) from the echo signal d(n)), that is, Y(k, l)=S(k, l)+N(k, l), wherein k is a frame index, and l is a frequency bin index. After the AEC model 206 is trained according to the AI-based algorithms, a spectral magnitude mask (SMM) may be predicted and generated through the LSTM layers 316-320 and the fully-connected layer 322 with the sigmoid activation 324, wherein the SMM is equal to a ratio of a spectral magnitude of the clean speech signal S (k, l) and a spectral magnitude of the noisy speech signal Y(k, l) (i.e.
the real part mask RM is a real part of the SMM, and the imaginary part mask IM is an imaginary part of the SMM. In this way, a real part of the estimated speech signal s′(n) can be obtained by multiplying the real part mask RM by a real part of the near-end microphone signal y(n), and an imaginary part of the estimated speech signal s′(n) can be obtained by multiplying the imaginary part mask IM by an imaginary part of the near-end microphone signal y(n).
In Step S400, the far-end microphone signal x(n) is received and played by the loudspeaker 200.
In Step 402, the speech signal s(n) is received by the microphone 202, wherein the far-end microphone signal x(n) is not output from the microphone 202, the echo signal d(n) is transmitted from the loudspeaker 200 to the microphone 202, and the external noise signal v(n) may also be received by the microphone 202.
In Step 404, the estimated echo signal d′(n) is generated by the adaptive filter 204 according to the far-end microphone signal x(n).
In Step 406, by the subtraction circuit 208, the estimated echo signal d′(n) is subtracted from the signal MS that is output from the microphone 202 receiving the speech signal s(n), the echo signal d(n), and the external noise signal v(n), to generate the near-end microphone signal y(n).
In Step 408, by the AEC model 206, the short-time Fourier transform is performed upon the far-end microphone signal x(n) and the near-end microphone signal y(n), respectively, to generate the first transformed microphone signal X_T and the second transformed microphone signal Y_T.
In Step 410, the estimated speech signal s′(n) is generated through the neural network according to the first transformed microphone signal X_T and the second transformed microphone signal Y_T.
Since a person skilled in the pertinent art can readily understand details of the steps after reading above paragraphs directed to the AEC system 20 shown in
In summary, in the AEC system 20 and associated method of the present invention, before the signal MS that is output from the microphone 202 receiving the echo signal d(n) is fed into the AEC model 206 for training, the adaptive filter 204 is utilized to generate the estimated echo signal d′(n) to cancel/reduce most of the echo signal d(n) in the signal MS. In this way, the train efficiency and the training effect of the AEC model 206 can be improved.
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.
This application claims the benefit of U.S. Provisional Application No. 63/396,218, filed on Aug. 8, 2022. The content of the application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63396218 | Aug 2022 | US |