The present invention is related to acoustic echo cancellation (AEC), and more particularly, to an AEC system in which a loss function of a model is a mean square error between a true mask and a spectral magnitude mask (SMM).
Acoustic echo often occurs in audio/video calls if a far-end speaker's voice (e.g. a far-end microphone signal) is played by a near-end speaker and is picked up by a near-end microphone (e.g. a near-end microphone signal generated by the near-end microphone may include an echo signal and a clean speech signal). For a conventional AEC system, a model is trained and built through a neural network to predict and generate an estimated speech signal according to the far-end microphone signal and the near-end microphone signal, wherein the purpose of the training of the model is to minimize a difference between the estimated speech signal and the clean speech signal (e.g. a loss function of the model is a mean square error between the estimated speech signal and the clean speech signal). Some problems may occur, however. The loss function of the model in the conventional AEC system may have a larger loss range, which may reduce the training effect. As a result, a novel AEC system in which a loss function of a model is a mean square error between a true mask (which is a ratio of a spectral magnitude of the clean speech signal and a spectral magnitude of a noisy speech signal) and an SMM (which is a ratio of a spectral magnitude of the estimated speech signal and the spectral magnitude of the noisy speech signal) is urgently needed.
It is therefore one of the objectives of the present invention to provide an AEC system in which a loss function of a model is a mean square error between a true mask and an SMM.
According to an embodiment of the present invention, an AEC system is provided. The AEC system comprises a loudspeaker interface, a microphone interface, and a processor. The loudspeaker interface is coupled to a loudspeaker. The microphone interface is coupled to a microphone. The processor is arranged to execute a model. The model is arranged to predict and generate an SMM through a neural network according to a first microphone signal output by the loudspeaker and a second microphone signal output by the microphone, wherein a noisy speech signal is a sum of a clean speech signal and a noise signal; the SMM is a ratio of a spectral magnitude of an estimated speech signal and a spectral magnitude of the noisy speech signal; a true mask is a ratio of a spectral magnitude of the clean speech signal and the spectral magnitude of the noisy speech signal; and a loss function of the model is a mean square error between the SMM and the true mask.
According to an embodiment of the present invention, an AEC method is provided. The AEC method comprises: executing a model to predict and generate an SMM through a neural network according to a first microphone signal output by a loudspeaker and a second microphone signal output by a microphone, wherein a noisy speech signal is a sum of a clean speech signal and a noise signal; the SMM is a ratio of a spectral magnitude of an estimated speech signal and a spectral magnitude of the noisy speech signal; a true mask is a ratio of a spectral magnitude of the clean speech signal and the spectral magnitude of the noisy speech signal; and a loss function of the model is a mean square error between the SMM and the true mask.
One of the benefits of the present invention is that, in the AEC system and associated method of the present invention, a loss function of an AEC model is a mean square error between an SMM and a true mask. Compared with a conventional model in which a loss function is a mean square error between an estimated speech signal and a clean speech signal, the loss range of the proposed AEC model can be reduced and the training effect of the proposed AEC model can be improved.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
Certain terms are used throughout the following description and claims, which refer to particular components. As one skilled in the art will appreciate, electronic equipment manufacturers may refer to a component by different names. This document does not intend to distinguish between components that differ in name but not in function. In the following description and in the claims, the terms “include” and “comprise” are used in an open-ended fashion, and thus should be interpreted to mean “include, but not limited to . . . ”.
The AEC model 204 may be arranged to receive the far-end microphone signal x(n) and the near-end microphone signal y(n) through a loudspeaker interface (not shown) and a microphone interface (not shown) of the AEC system 20. The AEC model 204 may be arranged to predict and generate a spectral magnitude mask (SMM) through a neural network according to the far-end microphone signal x(n) and the near-end microphone signal y(n), for generating an estimated speech signal s′ (n). Specifically, please refer to
Afterwards, the AEC model 204 may be arranged to generate the SMM through the separation kernel 314 according to the first transformed microphone signal X_T and the second transformed microphone signal Y_T. In this embodiment, the separation kernel 314 may include multiple long short term memory (LSTM) layers (e.g. 3 LSTM layers 316-320) and a fully-connected layer 322 (labeled as “FC” in
Specifically, for the training of the AEC model 204, a noisy speech signal YSS (which may correspond to the near-end microphone signal y(n)) is a sum of a clean speech signal CSS (which may correspond to the speech signal s(n)) and a noise signal NSS (which may correspond to the external noise signal v(n) and the echo signal d(n)), that is, YSS=CSS+NSS. After the AEC model 204 is trained according to the AI-based algorithms, the SMM may be predicted and generated through the LSTM layers 316-320 and the fully-connected layer 322 with the sigmoid activation 324, wherein the SMM is equal to a ratio of a spectral magnitude S′(k, l) of the estimated speech signal s′(n) and a spectral magnitude Y(k, l) of the noisy speech signal YSS
k is a frame index, and l is a frequency bin index.
In this embodiment, the training objective of the AEC model 204 is to minimize a difference between the SMM and a true mask, wherein the true mask is a ratio of a spectral magnitude S(k, l) of the clean speech signal CSS and the spectral magnitude Y(k, l) of the noisy speech signal YSS
In other words, a loss function LmaskMSE of the AEC model 204 is a mean square error between the SMM and the true mask. Since the SMM includes the real part mask RM and the imaginary part mask IM, the loss function LmaskMSE of the AEC model 204 is a sum of a mean square error between the real part mask RM and a real part of the true mask (labeled as “RT” in the following equation) and a mean square error between the imaginary part mask IM and an imaginary part of the true mask (labeled as “IT” in the following equation), which can be expressed by the following equation:
L
maskMSE=MSE(RT,RM)+MSE(IT,IM)
In Step S400, the far-end microphone signal x(n) is received and played by the loudspeaker 200.
In Step 402, the speech signal s(n) is received and the near-end microphone signal y(n) is output by the microphone 202, wherein the far-end microphone signal x(n) is not output from the microphone 202, the echo signal d(n) is transmitted from the loudspeaker 200 to the microphone 202, and the external noise signal v(n) may also be received by the microphone 202.
In Step 404, the AEC model 204 is executed to predict and generate the SMM through the neural network according to the far-end microphone signal x(n) and the near-end microphone signal y(n), wherein the loss function LmaskMSE of the AEC model 204 is a mean square error between the SMM and the true mask.
Since a person skilled in the pertinent art can readily understand details of the steps after reading above paragraphs directed to the AEC system 20 shown in
In summary, in the AEC system 20 and associated method of the present invention, the loss function LmaskMSE of the AEC model 204 is a mean square error between the SMM and the true mask. Compared with a conventional model in which a loss function is a mean square error between the estimated speech signal s′ (n) and the clean speech signal CSS (e.g. the speech signal s(n)), the loss range of the proposed AEC model 204 can be reduced and the training effect of the proposed AEC model 204 can be improved.
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.
This application claims the benefit of U.S. Provisional Application No. 63/396,218, filed on Aug. 8, 2022. The content of the application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63396218 | Aug 2022 | US |