SINGLE-MICROPHONE ACOUSTIC ECHO AND NOISE SUPPRESSION

Description

TECHNICAL FIELD

The present implementations relate generally to audio signal processing, and specifically to single-microphone acoustic echo and noise suppression techniques.

BACKGROUND OF RELATED ART

Many hands-free communication devices (such as voice over Internet protocol (VOIP) phones, speakerphones, and mobile phones configured to operate in a handsfree mode) include microphones and speakers that are located in relatively close proximity to one another. The microphones are configured to convert sound waves from the surrounding environment into audio signals (also referred to as “microphone signals”) that can be transmitted, over a communications channel, to a far-end device. The speakers are configured to convert audio signals received from the far-end device into sound waves that can be heard by a near-end user. Due to the proximity of the speakers and microphones, the microphone signals may include a speech component (representing audio originating from the near-end user), an echo component (representing audio emitted by the speakers), and a noise component (representing ambient audio from the background environment).

Acoustic echo cancellation (AEC) refers to various techniques that attempt to cancel or suppress the echo component of the microphone signal. Many existing AEC techniques rely on linear transfer functions that approximate the impulse response between a speaker and a microphone. For example, the linear transfer function may be determined using an adaptive filter (such as a normalized least mean square (NLMS) algorithm) that models the acoustic coupling (or channel) between the speaker and the microphone. However, the converge rate of the NLMS algorithm may depend on double-talk conditions (such as where the near-end user and far-end user speak at the same time) and changes to the echo path. Moreover, such linear transfer functions cannot account for nonlinearities introduced along the echo path by amplifiers and various mechanical components of the speaker. Thus, there is a need to further improve the quality of speech in microphone signals.

SUMMARY

This Summary is provided to introduce in a simplified form a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.

One innovative aspect of the subject matter of this disclosure can be implemented in a method of speech enhancement. The method includes steps of receiving a first audio signal via a microphone; receiving a second audio signal for output via a speaker; estimating a reference audio signal based on a delay between the first audio signal and the second audio signal; determining a plurality of masks based on the first audio signal and the reference audio signal, where the plurality of masks includes a speech mask associated with a speech component of the first audio signal, an echo mask associated with an echo component of the first audio signal, and a noise mask associated with a noise component of the first audio signal; and suppressing the echo component and the noise component of the first audio signal based at least in part on the plurality of masks.

Another innovative aspect of the subject matter of this disclosure can be implemented in a speech enhancement system, including a processing system and a memory. The memory stores instructions that, when executed by the processing system, cause the speech enhancement system to receive a first audio signal via a microphone; receive a second audio signal for output via a speaker; estimate a reference audio signal based on a delay between the first audio signal and the second audio signal; determine a plurality of masks based on the first audio signal and the reference audio signal, where the plurality of masks includes a speech mask associated with a speech component of the first audio signal, an echo mask associated with an echo component of the first audio signal, and a noise mask associated with a noise component of the first audio signal; and suppress the echo component and the noise component of the first audio signal based at least in part on the plurality of masks.

BRIEF DESCRIPTION OF THE DRAWINGS

The present implementations are illustrated by way of example and are not intended to be limited by the figures of the accompanying drawings.

FIG. 1 shows an example hands-free communication system.

FIG. 2 shows a block diagram of an example speech enhancement system, according to some implementations.

FIG. 3 shows a block diagram of an example acoustic echo and noise (AEN) decoupling system, according to some implementations.

FIG. 4A shows a block diagram of an example audio mask generation system, according to some implementations.

FIG. 4B shows another block diagram of an example audio mask generation system, according to some implementations.

FIG. 5 shows another block diagram of an example speech enhancement system, according to some implementations.

FIG. 6 shows an illustrative flowchart depicting an example operation for speech enhancement, according to some implementations.

DETAILED DESCRIPTION

In the following description, numerous specific details are set forth such as examples of specific components, circuits, and processes to provide a thorough understanding of the present disclosure. The term “coupled” as used herein means connected directly to or connected through one or more intervening components or circuits. The terms “electronic system” and “electronic device” may be used interchangeably to refer to any system capable of electronically processing information. Also, in the following description and for purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the aspects of the disclosure. However, it will be apparent to one skilled in the art that these specific details may not be required to practice the example embodiments. In other instances, well-known circuits and devices are shown in block diagram form to avoid obscuring the present disclosure. Some portions of the detailed descriptions which follow are presented in terms of procedures, logic blocks, processing and other symbolic representations of operations on data bits within a computer memory.

These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present disclosure, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.

Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present application, discussions utilizing the terms such as “accessing,” “receiving,” “sending,” “using,” “selecting,” “determining,” “normalizing,” “multiplying,” “averaging,” “monitoring,” “comparing,” “applying,” “updating,” “measuring,” “deriving” or the like, refer to the actions and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.

In the figures, a single block may be described as performing a function or functions; however, in actual practice, the function or functions performed by that block may be performed in a single component or across multiple components, and/or may be performed using hardware, using software, or using a combination of hardware and software. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described below generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure. Also, the example input devices may include components other than those shown, including well-known components such as a processor, memory and the like.

The techniques described herein may be implemented in hardware, software, firmware, or any combination thereof, unless specifically described as being implemented in a specific manner. Any features described as modules or components may also be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a non-transitory processor-readable storage medium including instructions that, when executed, performs one or more of the methods described above. The non-transitory processor-readable data storage medium may form part of a computer program product, which may include packaging materials.

The non-transitory processor-readable storage medium may comprise random access memory (RAM) such as synchronous dynamic random-access memory (SDRAM), read only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, other known storage media, and the like. The techniques additionally, or alternatively, may be realized at least in part by a processor-readable communication medium that carries or communicates code in the form of instructions or data structures and that can be accessed, read, and/or executed by a computer or other processor.

The various illustrative logical blocks, modules, circuits and instructions described in connection with the embodiments disclosed herein may be executed by one or more processors (or a processing system). The term “processor,” as used herein may refer to any general-purpose processor, special-purpose processor, conventional processor, controller, microcontroller, and/or state machine capable of executing scripts or instructions of one or more software programs stored in memory.

As described above, many hands-free communication devices include microphones and speakers that are located in relatively close proximity to one another. As such, microphone signals captured by the microphones may include a speech component (representing audio originating from a near-end user), an echo component (representing audio emitted by the speakers), and a noise component (representing ambient audio from the background environment). Acoustic echo cancellation (AEC) refers to various techniques that attempt to cancel or suppress the echo component of the microphone signal. Many existing AEC techniques rely on linear transfer functions that approximate the impulse response between a speaker and a microphone. However, such linear transfer functions cannot account for nonlinearities introduced along the echo path by amplifiers and various mechanical components of the speakers.

Some modern AEC techniques rely on machine learning to separate the speech component of microphone signals from the echo and noise components. Machine learning, which generally includes a training phase and an inferencing phase, is a technique for improving the ability of a computer system or application to perform a certain task. During the training phase, a machine learning system is provided with one or more “answers” and a large volume of raw training data associated with the answers. The machine learning system analyzes the training data to learn a set of rules that can be used to describe each of the one or more answers. During the inferencing phase, the machine learning system may infer answers from new data using the learned set of rules. Unlike linear AEC filters, machine learning models can be trained to account for nonlinear distortions along the echo path.

Many existing machine learning models used for AEC are trained to enhance speech by jointly suppressing the echo and noise components of microphone signals. However, because the echo component and the noise component originate from different audio sources, such machine learning models may perform poorly in untrained environments (such as with speakers or background audio sources that are different than those used for training). Aspects of the present disclosure recognize that machine learning models also can be used to decouple the echo component of a microphone signal from the noise component of the microphone signal. As a result, a microphone signal can be decomposed into separate speech, echo, and noise signals (representing the speech, echo, and noise components, respectively, of the microphone signal).

Various aspects relate generally to audio signal processing, and more particularly, to speech enhancement techniques for separating microphone signals into speech, echo, and noise signals. In some aspects, a speech enhancement system may include a delay estimator and an acoustic echo and noise (AEN) decoupling filter. The delay estimator receives a microphone signal (X(l, k)) via a microphone and a far-end audio signal (R(l, k)) for output via a speaker and estimates a reference audio signal (R(l, k)) based on a delay between the microphone signal X(l, k) and the far-end audio signal R(l, k). In some implementations, the microphone signal X(l, k) may include a speech component (S(l, k)), an echo component (E(l, k)), and a noise component (V(l, k)), where the echo component E(l, k) is associated with the reference audio signal R(l, k).

In some aspects, the AEN decoupling filter may determine a set of masks based on the microphone signal X(l, k) and the reference audio signal R(l, k) and may suppress the echo component E(l, k) and the noise component V(l, k) of the microphone signal X(l, k) based on the determined set of masks. In some implementations, the set of masks may include a speech mask (M_S(l, k)), an echo mask (M_E(l, k)), and a noise mask (M_V(l, k)), where S(l, k)=X(l, k)M_S(l, k), E(l, k)=X(l, k)M_E(l, k), and V(l, k)=X(l, k)M_V(l, k)). In some implementations, the AEN decoupling filter may include a neural network trained to infer a number of outputs based on the microphone signal X(l, k) and the reference audio signal R(l, k). In such implementations, the AEN decoupling filter may determine the masks M_S(l, k), M_E(l, k), and M_V(l, k) based on the outputs inferred by the neural network.

Particular implementations of the subject matter described in this disclosure can be implemented to realize one or more of the following potential advantages. By producing three masks M_S(l, k), M_E(l, k), and M_V(l, k) based on a microphone signal X(l, k) and a reference audio signal R(l, k), aspects of the present disclosure can improve the quality of speech in microphone signals. For example, the AEN decoupling filter can apply the speech mask M_S(l, k) to the microphone signal X(l, k) so that only the speech component S(l, k) is output to a far-end device. In some aspects, the AEN decoupling filter may further separate an echo signal E(l, k) and a noise signal V(l, k) from the microphone signal X(l, k) based on the echo mask M_E(l, k) and the noise mask M_V(l, k), respectively. Because the echo signal E(l, k) is decoupled from the noise signal V(l, k), aspects of the present disclosure can more effectively suppress acoustic echo and noise in microphone signals, compared to existing AEC techniques, even in untrained environments.

FIG. 1 shows an example hands-free communication system 100. The system 100 includes a set of communication devices 110 and 120 that are communicatively coupled via a wired or wireless communication channel (not shown for simplicity). More specifically, the first communication device 110 is located in a far end environment (also referred to as the “far-end device”) and the second communication device 120 is located in a near end environment (also referred to as the “near-end device”).

The far-end device 110 includes a microphone 112 and a speaker 114. The microphone 112 is configured to detect acoustic waves propagating through the far end environment. In the example of FIG. 1, such acoustic waves may include speech 102 from a user 101 in the far-end environment (also referred to as the “far-end user”). The microphone 112 converts the detected acoustic waves to an electrical signal 103 (also referred to as the “far-end audio signal) representative of the acoustic waveform. The far-end device 110 is configured to transmit the far-end audio signal 103 to the near-end device 120 and receive a microphone signal 109 from the near-end device 120. The speaker 114 is configured to convert the microphone signal 109 to acoustic sound waves that can be heard in the far end environment.

The near-end device 120 includes a speaker 122 and a microphone 124. The speaker 122 is configured to convert the far-end audio signal 103 to acoustic sound waves 104 that can be heard in the near end environment. The microphone 124 is configured to detect acoustic waves propagating through the near end environment. In the example of FIG. 1, such acoustic waves may include the acoustic waves 104 output by the speaker 122 (also referred to as “acoustic echoes”), speech 106 from a user 105 in the near-end environment (also referred to as the “near-end user”), and ambient noise 108 produced by one or more background audio sources 107. The microphone 124 converts the detected acoustic waves to the microphone signal 109 that is transmitted to the far-end device 110.

The acoustic echoes 104 and background noise 108 may mix with and distort the user speech 106 detected by the microphone 124. As a result, the microphone signal 109 may include a speech component (representing the user speech 106), an echo component (representing the acoustic echoes 104), and a noise component (representing the background noise 108). In some aspects, the near-end device 120 may improve the quality of speech in the microphone signal 109 (also referred to as “speech enhancement”) by suppressing the echo and noise components of the microphone signal 109 or otherwise increasing the signal-to-echo ratio (SER) and the signal-to-noise ratio (SNR) of the microphone signal 109. As a result, the microphone signal 109 may include a relatively unaltered copy of the speech component with only minor (if any) residuals of the echo and noise components.

FIG. 2 shows a block diagram of an example speech enhancement system 200, according to some implementations. The speech enhancement system 200 is configured to receive a microphone signal (X(l, k)) and a far-end audio signal (R(l, k)) and produce an enhanced audio signal 201 based on the received audio signals X(l, k) and R(l, k). More specifically, the speech enhancement system 200 may produce the enhanced audio signal 201 by suppressing acoustic echo and noise in the microphone signal X(l, k).

In some implementations, the microphone signal X(l, k) and the far-end audio signal R(l, k) may be examples of the microphone signal 109 and the far-end audio signal 103, respectively, of FIG. 1. For example, the microphone signal X(l, k) may include a speech component (S(l, k)), an echo component (E(l, k)), and a noise component (V(l, k)), where l is a frame index and k is a frequency index associated with a time-frequency domain:

$\begin{matrix} X (l, k) = S (l, k) + E (l, k) + V (l, k) & (1) \end{matrix}$

With reference for example to FIG. 1, the speech component S(l, k) may represent the user speech 106, the echo component E(l, k) may represent the acoustic echoes 104, and the noise component V(l, k) may represent the background noise 108.

The speech enhancement system 200 includes a delay estimator 210 and an acoustic echo and noise (AEN) decoupling filter 220. The delay estimator 210 is configured to estimate a delay (δ) between the microphone signal X(l, k) and the far-end audio signal R(l, k) and produce a reference audio signal (R(l−δ, k)) based on the estimated delay. As described with reference to FIG. 1, the acoustic echoes 104 detected by the microphone 124 represent a delayed version of the far-end audio signal 103. More specifically, the echo component E(l, k) of the microphone signal X(l, k) can be described as a function of the far-end audio signal R(l, k):

$\begin{matrix} E (l, k) = f (R (l, k)) H (l, k) & (2) \end{matrix}$

where f(⋅) is a nonlinear function that describes the effects of the speaker 122 on the microphone signal R(l, k), and H(l, k) is the acoustic transfer function between the speaker 122 and the microphone 124.

In some implementations, the delay estimator 210 may estimate the delay δ between the microphone signal X(l, k) and the far-end audio signal R(l, k) based on a generalized cross-correlation phase transform (GCC-PHAT) algorithm. For example, the audio signals R(l, k) and X(l, k) can be expressed as time-domain signals x₁(t) and x₂(t), respectively:

$x_{1} (t) = s (t) + n_{1} (t)$

$x_{2} (t) = α s (t + D) + n_{2} (t)$

where s(t) represents the far-end speech component in each of the audio signals x₁(t) and x₂(t); n₁(t) and n₂(t) represent the noise components in the audio signals x₁(t) and x₂(t), respectively; a is an attenuation factor associated with the second audio signal x₂(t); and D is the delay (in the time domain) between the first audio signal x₁(t) and the second audio signal x₂(t).

Aspects of the present disclosure recognize that the time-domain delay D can be determined by computing the cross correlation (R_x₁_x₂(τ)) of the audio signals x₁(t) and x₂(t):

$R_{x_{1} x_{2}} (τ) = E [x_{1} (t) x_{2} (t - τ)]$

where E[⋅] is the expected value, and the value of t that maximizes R_x₁_x₂(τ) provides an estimate of the time-domain delay D (and thus, the delay δ in the time-frequency domain).

In some implementations, the microphone signal X(l, k) and the reference audio signal R(l−δ, k) may be passed directly to the AEN decoupling filter 220. In some other implementations, the speech enhancement system 200 may further include an acoustic echo cancellation (AEC) filter 230 configured to reduce the acoustic echo in the microphone signal X(l, k) based on the reference audio signal R(l−δ, k). In some implementations, the AEC filter 230 may rely on a linear transfer function to approximate the impulse response between the speaker 122 and the microphone 124. For example, the linear transfer function may be determined using an adaptive filter (such as a normalized least mean square (NLMS) algorithm) that models the acoustic coupling (or channel) between the speaker 122 and the microphone 124.

More specifically, the AEC filter 230 may estimate the echo to be subtracted from the microphone signal X(l, k) based on a direct path attenuation factor (γ). In other words, the AEC filter 230 may adjust the reference audio signal R(l−δ, k) by the direct path attenuation factor γ to produce an adjusted reference audio signal (R(l, k)) that can be subtracted from the microphone signal X(l, k):

$\bar{R} (l, k) = γ R (l - δ, k)$

$\bar{X} (l, k) = X (l, k) - \bar{R} (l, k)$

where X(l, k) represents a filtered microphone signal. As described above, linear AEC filters cannot account for nonlinearities introduced along the echo path by amplifiers and various mechanical components of the speaker 124. As a result, the filtered microphone signal X(l, k) may retain some residual echo Ē(l, k), where |Ē(l, k)|≤|E(l, k)|:

$\begin{matrix} \bar{X} (l, k) = \bar{E} (l, k) + V (l, k) + S (l, k) & (3) \end{matrix}$

The AEN decoupling filter 220 is configured to produce the enhanced audio signal 201 based on the filtered microphone signal X(l, k) and the reference audio signal R(l−δ, k). For example, in some implementations, the enhanced audio signal 201 may include only the speech component S(l, k) of the microphone signal X(l, k) or X(l, k). In some aspects, the AEN decoupling filter 220 may decouple the echo component Ē(l, k) or E(l, k) from the noise component V(l, k) of the microphone signal X(l, k) or X(l, k). More specifically, the AEN decoupling filter 220 may decompose the microphone signal X(l, k) or X(l, k) into a first audio signal that includes only the speech component S(l, k), a second audio signal that includes only the echo component Ē(l, k) or E(l, k), and a third audio signal that includes only the noise component V(l, k) of the microphone signal.

In some implementations, the AEN decoupling filter 220 may decompose the microphone signal X(l, k) or X(l, k) into the component audio signals based on a machine learning (ML) model 222. As described above, machine learning generally includes a training phase and an inferencing phase. During the training phase, a machine learning system is provided with one or more “answers” and a large volume of raw training data associated with the answers. The machine learning system analyzes the training data to learn a set of rules (also referred to as the “machine learning model”) that can be used to describe each of the one or more answers. During the inferencing phase, the machine learning system may infer answers from new data using the learned set of rules. Unlike linear AEC filters, machine learning models can be trained to account for nonlinearities along the echo path. In some aspects, the AEN decoupling filter 220 may use the component audio signals to further suppress acoustic noise and echo in the filtered microphone signal X(l, k).

FIG. 3 shows a block diagram of an example acoustic echo and noise (AEN) decoupling system 300, according to some implementations. The AEN decoupling system 300 is configured to receive a microphone signal 302 and a reference audio signal 304 and produce a set of component audio signals 322-326 based on the received audio signals 302 and 304. In some implementations, the AEN decoupling system 300 may be one example of the AEN decoupling filter 220 of FIG. 2. With reference for example to FIG. 2, the microphone signal 302 may be one example of any of the microphone signals X(l, k) or X(l, k) and the reference audio signal 304 may be one example of any of the reference audio signals R(l−δ, k) or R(l, k).

The AEN decoupling system 300 includes a deep neural network (DNN) 310 and a mask generator 320. The DNN 310 is configured to infer a number (N) of outputs 312(1)-312(N) from the microphone signal 302 and the reference audio signal 304 based on a neural network model. Deep learning is a particular form of machine learning in which the inferencing (and training) phases are performed over multiple layers, producing a more abstract dataset in each successive layer. Deep learning architectures are often referred to as “artificial neural networks” due to the manner in which information is processed (similar to a biological nervous system). For example, each layer of an artificial neural network may be composed of one or more “neurons.” The neurons may be interconnected across the various layers so that the input data can be processed and passed from one layer to another. More specifically, each layer of neurons may perform a different transformation on the output data from a preceding layer so that the final output of the neural network results in the desired inferences. The set of transformations associated with the various layers of the network is referred to as a “neural network model.”

The mask generator 320 is configured to generate a set of audio masks based on the outputs 312(1)-312(N) of the DNN 310. In some aspects, the set of audio masks may include a speech mask (M_S(l, k)) associated with a speech component of the microphone signal 302, an echo mask (M_E(l, k)) associated with an echo component of the microphone signal 302, and a noise mask (M_V(l, k)) associated with a noise component of the microphone signal 302. The audio masks M_S(l, k), M_E(l, k), and M_V(l, k) can be used to decompose the microphone signal 302 into the component audio signals 322-326. In some implementations, the AEN decoupling system 300 may apply the speech mask M_S(l, k) to the microphone signal 302 to obtain a speech signal 322. In some other implementations, the AEN decoupling system 300 may apply the echo mask M_E(l, k) to the microphone signal 302 to obtain an echo signal 324. Still further, in some implementations, the AEN decoupling system 300 may apply the noise mask M_V(l, k) to the microphone signal 302 to obtain a noise signal 326.

In some implementations, the microphone signal 302 may be one example of the microphone signal X(l, k) of FIG. 2. With reference for example to Equation 1, the speech signal 322 may include only the speech component S(l, k), the echo signal 324 may include only the echo component E(l, k), and the noise signal 326 may include only the noise component V(l, k) of the microphone signal X(l, k), where:

S(l,k)=X(l,k)M_S(l,k)

E(l,k)=X(l,k)M_E(l,k)

V(l,k)=X(l,k)M_V(l,k)

In some other implementations, the microphone signal 302 may be one example of the filtered microphone signal X(l, k) of FIG. 2. With reference for example to Equation 3, the speech signal 322 may include only the speech component S(l, k), the echo signal 324 may include only the residual echo Ē(l, k), and the noise signal 326 may include only the noise component V(l, k) of the filtered microphone signal X(l, k), where:

$\begin{matrix} S (l, k) = \bar{X} (l, k) M_{S} (l, k) & (4) \end{matrix}$

$\begin{matrix} \bar{E} (l, k) = \bar{X} (l, k) M_{E} (l, k) & (5) \end{matrix}$

$\begin{matrix} V (l, k) = \bar{X} (l, k) M_{V} (l, k) & (6) \end{matrix}$

In some implementations, the DNN 310 may be trained to infer a number (M) of outputs for each component of the microphone signal 302 (where N=3*M). In such implementations, the mask generator 320 may use each set of M DNN outputs to produce a respective one of the audio masks M_S(l, k), M_E(l, k), and M_V(l, k). In some other implementations, the DNN 310 may be trained to infer a number (P) of outputs for only two components of the microphone signal 302 (where N=2*P). In such implementations, the mask generator 320 may use the DNN outputs 312(1)-312(N) to produce two of the audio masks and may produce the third audio mask based on the other two audio masks. With reference for example to Equations 3-6, the audio masks M_S(l, k), M_E(l, k), and M_V(l, k) must sum to 1:

$\bar{X} (l, k) M_{S} (l, k) + \bar{X} (l, k) M_{E} (l, k) + \bar{X} (l, k) M_{V} (l, k) = \bar{X} (l, k)$

$\begin{matrix} M_{S} (l, k) + M_{E} (l, k) + M_{V} (l, k) = 1 & (7) \end{matrix}$

Thus, any of the audio masks M_S(l, k), M_E(l, k), or M_V(l, k) can be determined based on a sum of the other two audio masks.

In some implementations, the mask generator 320 may determine the speech mask M_S(l, k) and the echo mask M_E(l, k) based on the DNN outputs 312(1)-312(N) and may further determine the noise mask M_V(l, k) based on the speech mask M_S(l, k) and the echo mask M_E(l, k):

$\begin{matrix} M_{V} (l, k) = 1 - (M_{S} (l, k) + M_{E} (l, k)) & (8) \end{matrix}$

In such implementations, the DNN 310 can be implemented using a smaller or more compact neural network model (compared to neural network models that are trained to infer outputs associated with all three audio masks). In other words, for a given neural network size, Equation 8 allows the DNN 310 to produce more accurate inferencing results compared to neural network models that are trained to infer outputs associated with all three audio masks.

FIG. 4A shows a block diagram of an example audio mask generation system 400, according to some implementations. The audio mask generation system 400 is configured to generate a set of audio masks M_S(l, k), M_E(l, k), or M_V(l, k) based on outputs O_S1(l,k)-O_S5(l,k) and O_E1(l,k)-O_E5(l,k) of a DNN. In some implementations, the audio mask generation system 400 may be one example of the mask generator 320 of FIG. 3. With reference for example to FIG. 3, the DNN outputs O_S1(l, k)-O_S5(l, k) and O_E1(l, k)-O_E5(l, k) may be examples of the DNN outputs 312(1)-312(N).

The audio mask generation system 400 includes a speech mask generation component 402, an echo mask generation component 404, and a noise mask generation component 406. In some implementations, the speech mask generation component 402 may generate the speech mask M_S(l, k) based on the DNN outputs O_S1-O_S5 and a complementary speech mask (M_S(l, k)), where M̌_S(l,k)=1−M_S(l,k) and X(l,k)M̌_S(l,k)=E(l,k)+V(l,k). More specifically, the speech mask generation component 402 may determine the magnitude of the speech mask (|M_S(l, k)|) based on the DNN outputs O_S1(l, k), O_S2(l, k), and O_S3(l, k):

$❘ M_{S} (l, k) ❘ = β_{S} (l, k) σ_{1} (Z_{S} (l, k)) = \frac{β_{S} (l, k)}{1 + e^{- Z_{S} (l, k)}}$

$Z_{S} (l, k) = O_{S} 1 (l, k) - O_{S} 2 (l, k)$

$β_{S} (l, k) = \min (1 + s o f t p l us (O_{S} 3 (l, k)), \frac{1}{❘ σ_{1} (Z_{S} (l, k)) - σ_{1} (- Z_{S} (l, k)) ❘})$

where softplus(⋅) is a smooth approximation of the rectified linear unit (ReLU) activation function (also referred to as the “softplus function”) and σ₁(⋅) is the sum of positive divisors function.

The speech mask generation component 402 also may determine the magnitude of the complementary speech mask (|M̌_S(l, k)|) based on the DNN outputs O_S1(l,k), O_S2(l,k), and O_S3(l,k):

$❘ {\overset{ˇ}{M}}_{S} (l, k) ❘ = β_{S} (l, k) σ_{1} (- Z_{S} (l, k)) = \frac{β_{S} (l, k)}{1 + e^{Z_{S} (l, k)}}$

The speech mask generation component 402 may further determine the phase of the speech mask (θ_S(l, k)) based on the magnitude of the speech mask |M_S(l, k)|, the magnitude of the complementary speech mask |M̌_S(l, k)|, and the DNN outputs O_S4(l, k) and O_S5(l, k), where:

$\cos θ_{S} (l, k) = \frac{1 + {❘ M_{S} (l, k) ❘}^{2} - {❘ {\overset{ˇ}{M}}_{S} (l, k) ❘}^{2}}{2 {❘ M_{S} (l, k) ❘}^{2}}$

$\sin θ_{S} (l, k) = \sqrt{1 - \cos^{2} θ_{S} (l, k)}$

$e^{j θ_{S} (l, k)} = \cos θ_{S} (l, k) + j b_{S} (l, k) \sin θ_{S} (l, k)$

$b_{S} (l, k) = {\begin{matrix} 1 & if γ_{S}^{(0)} (l, k) > γ_{S}^{(1)} (l, k) \\ - 1 & otherwise \end{matrix}$

$γ_{S}^{(0)} (l, k) = \frac{e^{\frac{(O_{S} 4 (l, k) + g_{0})}{τ}}}{e^{\frac{(O_{S} 4 (l, k) + g_{0})}{τ}} + e^{\frac{(O_{S} 5 (l, k) + g_{1})}{τ}}}$

$γ_{S}^{(1)} (l, k) = \frac{e^{\frac{(O_{S} 5 (l, k) + g_{1})}{τ}}}{e^{\frac{(O_{S} 4 (l, k) + g_{0})}{τ}} + e^{\frac{(O_{S} 5 (l, k) + g_{1})}{τ}}} = 1 - γ_{S}^{(0)} (l, k)$

where g₀and g₁can be sampled using inverse transform sampling by drawing u_n˜Uniform(0,1) and computing g_n=−log(−log(u_n)), n∈{0,1}.

In some implementations, the echo mask generation component 404 may generate the echo mask M_E(l, k) based on the DNN outputs O_E1-O_E5 and a complementary echo mask (M̌_E(l, k)), where M̌_E(l, k)=1−M_E(l,k) and X(l,k)M̌_E(l,k)=V(l,k)+S(l, k). More specifically, the echo mask generation component 404 may determine the magnitude of the echo mask (|M_E(l, k)|) based on the DNN outputs O_E1(l, k), O_E2(l, k), and O_E3(l, k):

$❘ M_{E} (l, k) ❘ = β_{E} (l, k) σ_{1} (Z_{E} (l, k)) = \frac{β_{E} (l, k)}{1 + e^{- Z_{E} (l, k)}}$

$Z_{S} (l, k) = O_{E} 1 (l, k) - O_{E} 2 (l, k)$

$β_{E} (l, k) = \min (1 + s o f t p l us (O_{E} 3 (l, k)), \frac{1}{❘ σ_{1} (Z_{E} (l, k)) - σ_{1} (- Z_{E} (l, k)) ❘})$

The echo mask generation component 404 also may determine the magnitude of the complementary echo mask (|M̌_E(l, k)|) based on the DNN outputs O_E1(l, k), O_E2(l, k), and O_E3(l,k):

$❘ {\overset{ˇ}{M}}_{E} (l, k) ❘ = β_{E} (l, k) σ_{1} (- Z_{E} (l, k)) = \frac{β_{E} (l, k)}{1 + e^{Z_{E} (l, k)}}$

The echo mask generation component 404 may further determine the phase of the echo mask (θ_E(l, k)) based on the magnitude of the echo mask |M_E(l, k)|, the magnitude of the complementary echo mask |M̌_E(l, k)|, and the DNN outputs O_E4(l, k) and O_E5(l, k), where:

$\cos θ_{E} (l, k) = \frac{1 + {❘ M_{E} (l, k) ❘}^{2} - {❘ {\overset{ˇ}{M}}_{E} (l, k) ❘}^{2}}{2 {❘ M_{E} (l, k) ❘}^{2}}$

$\sin θ_{E} (l, k) = \sqrt{1 - \cos^{2} θ_{E} (l, k)}$

$e^{j θ_{E} (l, k)} = \cos θ_{E} (l, k) + j b_{E} (l, k) \sin θ_{E} (l, k)$

$b_{E} (l, k) = {\begin{matrix} 1 & if γ_{E}^{(0)} (l, k) > γ_{E}^{(1)} (l, k) \\ - 1 & otherwise \end{matrix}$

$γ_{E}^{(0)} (l, k) = \frac{e^{\frac{(O_{E} 4 (l, k) {Eg}_{0})}{τ}}}{e^{\frac{(O_{E} 4 (l, k) + g_{0})}{τ}} + e^{\frac{(O_{E} 5 (l, k) + g_{1})}{τ}}}$

$γ_{E}^{(1)} (l, k) = \frac{e^{\frac{O_{E} 5 (l, k) + g_{1}}{τ}}}{e^{\frac{(O_{E} 4 (l, k) + g_{0})}{τ}} + e^{\frac{(O_{E} 5 (l, k) + g_{1})}{τ}}} = 1 - γ_{E}^{(0)} (l, k)$

In some implementations, the noise mask generation component 406 may generate the noise mask M_V(l, k) based on the speech mask M_S(l, k) and the echo mask M_E(l, k). More specifically, the noise mask generation component 406 may generate the noise mask M_V(l, k) based on Equation 8 (such as described with reference to FIG. 3).

As described with reference to FIG. 2, the echo component E(l, k) of a microphone signal X(l, k) can be expressed as a function of the far-end audio signal R(l, k) that produces the echo (such as shown in Equation 2). As such, the echo component E(l, k) or E(l, k) of a microphone signal X(l,k) or X(l, k), respectively, also may be derived by masking the reference audio signal R(l−δ, k) or R(l, k). For example, the residual echo Ē(l, k) can be expressed as a function of a reference mask (M_R(l, k)) associated with the adjusted reference audio signal (R(l,k)):

$\begin{matrix} \bar{E} (l, k) = \bar{R} (l, k) M_{R} (l, k) & (9) \end{matrix}$

By combining Equations 5 and 9, the echo mask M_E(l, k) can be expressed as a function of the reference mask M_R(l, k):

$\begin{matrix} M_{E} (l, k) = \frac{\bar{R} (l, k)}{\bar{X} (l, k)} M_{R} (l, k) & (10) \end{matrix}$

$❘ M_{E} (l, k) ❘ = α (l, k) ❘ M_{R} (l, k) ❘$

where α(l, k) is a ratio of the reference audio signal R(l, k) to the microphone signal X(l, k) per frequency bin. Thus, the ratio α(l, k) must be bounded between 0 and 1. In some implementations, the ratio α(l, k) may be clipped (such as for values less than 0 and for values greater than 1). In some other implementations, the ratio α(l, k) may be normalized, where:

$\begin{matrix} α_{raw} (l, k) = \frac{❘ \bar{R} (l, k) ❘}{❘ \bar{X} (l, k) ❘} & (11) \end{matrix}$

$α (l, k) = \max (\min (α_{raw} (l, k), 1), 0)$

Aspects of the present disclosure further recognize that the complementary echo mask M̆_E(l, k) can be estimated based on a complementary reference mask (M̆_R(l,k)), where:

$\begin{matrix} \bar{X} (l, k) - \bar{E} (l, k) = (\bar{X} (l, k) - \bar{R} (l, k)) {\overset{ˇ}{M}}_{R} (l, k) & (12) \end{matrix}$

$\bar{X} (l, k) {\overset{ˇ}{M}}_{E} (l, k) = (\bar{X} (l, k) - \bar{R} (l, k)) {\overset{ˇ}{M}}_{R} (l, k)$

${\overset{ˇ}{M}}_{E} (l, k) = (1 - \frac{\bar{R} (l, k)}{\bar{X} (l, k)}) {\overset{ˇ}{M}}_{R} (l, k)$

$❘ {\overset{ˇ}{M}}_{E} (l, k) ❘ = \overset{ˇ}{α} (l, k) ❘ {\overset{ˇ}{M}}_{R} (l, k) ❘$

where ᾰ (l, k) is a complementary ratio of the reference audio signal R(l, k) to the microphone signal X(l, k) per frequency bin. Thus, the ratio ᾰ (l, k) also must be bounded between 0 and 1. In some implementations, the ratio ᾰ (l, k) may be clipped (such as for values less than 0 and for values greater than 1). In some other implementations, the ratio ᾰ (l, k) may be normalized, where:

$\begin{matrix} {\overset{ˇ}{α}}_{eaw} (l, k) = ❘ 1 - \frac{\bar{R} (l, k)}{\bar{X} (l, k)} ❘ & (13) \end{matrix}$

$\overset{ˇ}{α} (l, k) = \max (\min ({\overset{ˇ}{α}}_{raw} (l, k), 1), 0)$

The example of Equations 9-13 assumes that the speech enhancement system 300 receives a filtered microphone signal X(l, k) and an adjusted reference audio signal R(l, k) as the microphone signal 302 and the reference audio signal 304, respectively, of FIG. 3. However, in some other implementations, the speech enhancement system 300 may receive the original microphone signal X(l, k) and the original reference audio signal R(l, k) as the microphone signal 302 and the reference audio signal 304, respectively. In such implementations, the reference audio signal R(l−δ, k), the microphone signal X(l, k), and the echo component thereof E(l, k), may be substituted for the filtered microphone signal X(l, k), the reference audio signal R(l, k), and the residual echo Ē(l, k), respectively, in Equations 9-13.

FIG. 4B shows another block diagram of an example audio mask generation system 410, according to some implementations. The audio mask generation system 410 is configured to generate a set of audio masks M_S(l, k), M_E(l, k), or M_V(l, k) based on outputs O_S1(l,k)-O_S5(l,k) and O_R1(l,k)-O_R5(l, k) of a DNN. In some implementations, the audio mask generation system 410 may be one example of the mask generator 320 of FIG. 3. With reference for example to FIG. 3, the DNN outputs O_S1(l, k)-O_S5(l, k) and O_R1(l, k)-O_R5(l, k) may be examples of the DNN outputs 312(1)-312(N).

The audio mask generation system 400 includes a reference mask generation component 412 and an echo mask generation component 414 in addition to the speech mask generation component 402 and the noise mask generation component 406 of FIG. 4A. In some implementations, the reference mask generation component 412 may determine the magnitude of the reference mask |M_R(l, k)| based on the DNN outputs O_R1(l, k), O_R2(l, k), and O_R3(l,k):

$❘ M_{R} (l, k) ❘ = β_{R} (l, k) σ_{1} (Z_{R} (l, k)) = \frac{β_{R} (l, k)}{1 + e^{- Z_{R} (l, k)}}$

$Z_{R} (l, k) = O_{R} 1 (l, k) - O_{R} 2 (l, k)$

$β_{R} (l, k) = \min (1 + softplus (O_{R} 3 (l, k)), \frac{1}{❘ q (l, k) - \overset{ˇ}{q} (l, k) ❘})$

$q (l, k) = \frac{α (l, k)}{1 - e^{- Z_{R} (l, k)}}$

$\overset{ˇ}{q} (l, k) = \frac{\overset{ˇ}{α} (l, k)}{1 - e^{Z_{R} (l, k)}}$

where α(l, k) is the ratio of the reference audio signal R(l, k) to the microphone signal X(l, k) (such as described with reference to Equation 11) and ᾰ (l, k) is the complementary ratio of the reference audio signal R(l, k) to the microphone signal X(l, k) (such as described with reference to Equation 13).

In some implementations, the reference mask generation component 412 may further determine the magnitude of the complementary reference mask |M̆_R(l, k)| based on the DNN outputs O_R1(l, k), O_R2(l, k), and O_R3(l, k):

$❘ {\overset{ˇ}{M}}_{R} (l, k) ❘ = β_{R} (l, k) σ_{1} (- Z_{R} (l, k)) = \frac{β_{R} (l, k)}{1 + e^{Z_{R} (l, k)}}$

In some implementations, the echo mask generation component 414 may generate the echo mask M_E(l, k) based on the magnitude of the reference mask |M_R(l, k)|, the magnitude of the complementary reference mask |M̆_R(l, k)|, and the DNN outputs O_R4(l, k) and O_R5(l, k). More specifically, the echo mask generation component 414 may determine the magnitude of the echo mask |M_E(l, k)| and the magnitude of the complementary echo mask |M̆_E(l, k)| based on the magnitude of the reference mask |M_R(l, k)| and the magnitude of the complementary reference mask |M̆_R(l, k)|, respectively:

$❘ M_{E} (l, k) ❘ = β_{R} (l, k) q (l, k) = \frac{α (l, k) β_{R} (l, k)}{1 + e^{- Z_{R} (l, k)}}$

$❘ {\overset{ˇ}{M}}_{E} (l, k) ❘ = β_{R} (l, k) \overset{ˇ}{q} (l, k) = \frac{\overset{ˇ}{α} (l, k) β_{R} (l, k)}{1 + e^{- Z_{R} (l, k)}}$

The echo mask generation component 414 may further determine the phase of the echo mask θ_E(l, k) based on the magnitude of the echo mask |M_E(l, k)|, the magnitude of the complementary echo mask |M̆_E(l, k)|, and the DNN outputs O_R4(l, k) and O_R5(l, k), where:

$\cos θ_{E} (l, k) = \frac{1 + {❘ M_{E} (l, k) ❘}^{2} - {❘ {\overset{ˇ}{M}}_{E} (l, k) ❘}^{2}}{2 {❘ M_{E} (l, k) ❘}^{2}}$

$\sin θ_{E} (l, k) = \sqrt{1 - \cos^{2} θ_{E} (l, k)}$

$e^{j θ_{E} (l, k)} = \cos θ_{E} (l, k) + j b_{E} (l, k) \sin θ_{E} (l, k)$

$b_{E} (l, k) = {\begin{matrix} 1 & if γ_{E}^{(0)} (l, k) > γ_{E}^{(1)} (l, k) \\ - 1 & otherwise \end{matrix}$

$γ_{E}^{(0)} (l, k) = \frac{e^{\frac{(O_{R} 4 (l, k) {Eg}_{0})}{τ}}}{e^{\frac{(O_{R} 4 (l, k) + g_{0})}{τ}} + e^{\frac{(O_{R} 5 (l, k) + g_{1})}{τ}}}$

$γ_{E}^{(1)} (l, k) = \frac{e^{\frac{O_{R} 5 (l, k) + g_{1}}{τ}}}{e^{\frac{(O_{R} 4 (l, k) + g_{0})}{τ}} + e^{\frac{(O_{R} 5 (l, k) + g_{1})}{τ}}} = 1 - γ_{E}^{(0)} (l, k)$

FIG. 5 shows another block diagram of an example speech enhancement system 500, according to some implementations. The speech enhancement system 500 may be configured to produce an enhanced audio signal based on a microphone signal and a reference audio signal. In some implementations, the speech enhancement system 500 may be one example of the speech enhancement system 200 of FIG. 2.

The speech enhancement system 500 includes a device interface 510, a processing system 520, and a memory 530. The device interface 510 is configured to communicate with one or more components of an audio communication device (such as the near-end device 120 of FIG. 1). In some implementations, the device interface 510 may include a microphone interface (I/F) 512 and a speaker interface (I/F) 514. The microphone interface 512 is configured to receive the microphone signal via a microphone (such as the microphone 124). The speaker interface 514 is configured to receive a far-end audio signal for output via a speaker (such as the speaker 122). For example, the speaker interface 514 may receive the far-end audio signal from a far-end device (such as the far-end device 110).

The memory 530 may include an audio data store 532 configured to store frames of the microphone signal and the reference audio signal as well as any intermediate signals that may be produced by the speech enhancement system 500 as a result of producing the enhanced audio signal. The memory 530 also may include a non-transitory computer-readable medium (including one or more nonvolatile memory elements, such as EPROM, EEPROM, Flash memory, or a hard drive, among other examples) that may store at least the following software (SW) modules:

- a delay estimation SW module 534 to estimate the reference audio signal based on a delay between the microphone signal and the far-end audio signal;
- a mask generation SW module 536 to determine a plurality of masks based on the microphone signal and the reference audio signal, where the plurality of masks includes a speech mask associated with a speech component of the microphone signal, an echo mask associated with an echo component of the microphone signal, and a noise mask associated with a noise component of the microphone signal; and
- a speech enhancement SW module 538 to suppress the echo component and the noise component of the microphone signal based at least in part on the plurality of masks.
  
  Each software module includes instructions that, when executed by the processing system 520, causes the speech enhancement system 500 to perform the corresponding functions.

The processing system 520 may include any suitable one or more processors capable of executing scripts or instructions of one or more software programs stored in the speech enhancement system 500 (such as in the memory 530). For example, the processing system 520 may execute the delay estimation SW module 534 to estimate the reference audio signal based on a delay between the microphone signal and the far-end audio signal. The processing system 520 also may execute the mask generation SW module 536 to determine a plurality of masks based on the microphone signal and the reference audio signal, where the plurality of masks includes a speech mask associated with a speech component of the microphone signal, an echo mask associated with an echo component of the microphone signal, and a noise mask associated with a noise component of the microphone signal. Further, the processing system 520 may execute the speech enhancement SW module 538 to suppress the echo component and the noise component of the microphone signal based at least in part on the plurality of masks.

FIG. 6 shows an illustrative flowchart depicting an example operation 600 for speech enhancement, according to some implementations. In some implementations, the example operation 600 may be performed by a speech enhancement system such as any of the speech enhancement systems 200 or 500 of FIGS. 2 and 5, respectively.

The speech enhancement system receives a first audio signal via a microphone (610). The speech enhancement system also receives a second audio signal for output via a speaker (620). The speech enhancement system estimates a reference audio signal based on a delay between the first audio signal and the second audio signal (630). In some aspects, the speech enhancement system my perform an AEC operation on the first audio signal based on the reference audio signal. In some implementations, the AEC operation may be associated with a linear filter.

The speech enhancement system determines a plurality of masks based on the first audio signal and the reference audio signal, where the plurality of masks includes a speech mask (M_S) associated with a speech component of the first audio signal, an echo mask (M_E) associated with an echo component of the first audio signal, and a noise mask (M_V) associated with a noise component of the first audio signal (640). The speech component may include audio originating from a near-end user associated with the microphone, the echo component may include audio output by the speaker based on the second audio signal, and the noise component may include audio that does not originate from the near-end user and is not output by the speaker. The speech enhancement system further suppresses the echo component and the noise component of the first audio signal based at least in part on the plurality of masks (650).

In some aspects, the determining of the plurality of masks may include inferring a plurality of outputs from the first audio signal and the reference audio signal based on a neural network model. In some implementations, the determining of the plurality of masks may further include estimating the speech mask M_Sbased on a first subset of the plurality of outputs and a complementary speech mask (M̆_S), where M̆_S=1−M_S; estimating the echo mask M_Ebased on a second subset of the plurality of outputs and a complementary echo mask (M̆_E), where M̆_E=1−M_E; and determining the noise mask M_Vbased on the speech mask M_Sand the echo mask M_E.

In some implementations, the estimating of the speech mask M_Smay include determining a magnitude of the speech mask M_Sbased on one or more first outputs of the first subset of the plurality of outputs; determining a magnitude of the complementary speech mask M̆_Sbased on the one or more first outputs; and determining a phase of the speech mask M_Sbased on the magnitude of the speech mask M_S, the magnitude of the complementary speech mask M̆_S, and one or more second outputs of the first subset of the plurality of outputs.

In some implementations, the estimating of the echo mask M_Emay include determining a magnitude of the echo mask M_Ebased on one or more first outputs of the second subset of the plurality of outputs; determining a magnitude of the complementary echo mask M̆_Ebased on the one or more first outputs; and determining a phase of the echo mask M_Ebased on the magnitude of the echo mask M_E, the magnitude of the complementary echo mask M̆_E, and one or more second outputs of the second subset of the plurality of outputs.

In some other implementations, the estimating of the echo mask M_Emay include estimating a reference mask (M_R) associated with the reference audio signal based on the second subset of the plurality of outputs and a complementary reference mask (M̆_R), where M̆_R=1−M_R. In such implementations, the speech enhancement system may further determine a magnitude of the reference mask M_Rbased on one or more first outputs of the second subset of the plurality of outputs; determine a magnitude of the complementary reference mask M̆_Rbased on the one or more first outputs; determine a magnitude of the echo mask M_Ebased on the magnitude of the reference mask M_R; determine a magnitude of the complementary echo mask M̆_Ebased on the magnitude of the complementary reference mask M̆_R; and determine a phase of the echo mask M_Ebased on the magnitude of the echo mask M_E, the magnitude of the complementary echo mask M̆_E, and one or more second outputs of the second subset of the plurality of outputs.

Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.

Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosure.

The methods, sequences or algorithms described in connection with the aspects disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.

In the foregoing specification, embodiments have been described with reference to specific examples thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader scope of the disclosure as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

Claims

1. A method of speech enhancement, comprising: receiving a first audio signal via a microphone;receiving a second audio signal for output via a speaker;estimating a reference audio signal based on a delay between the first audio signal and the second audio signal;determining a plurality of masks based on the first audio signal and the reference audio signal, the plurality of masks including a speech mask (MS) associated with a speech component of the first audio signal, an echo mask (ME) associated with an echo component of the first audio signal, and a noise mask (MV) associated with a noise component of the first audio signal; andsuppressing the echo component and the noise component of the first audio signal based at least in part on the plurality of masks.
2. The method of claim 1, wherein the speech component includes audio originating from a near-end user associated with the microphone, the echo component includes audio output by the speaker based on the second audio signal, and the noise component includes audio that does not originate from the near-end user and is not output by the speaker.
3. The method of claim 1, further comprising: performing an acoustic echo cancellation (AEC) operation on the first audio signal based on the reference audio signal prior to determining the plurality of masks.
4. The method of claim 3, wherein the AEC operation is associated with a linear filter.
5. The method of claim 1, wherein the determining of the plurality of masks comprises: inferring a plurality of outputs from the first audio signal and the reference audio signal based on a neural network model.
6. The method of claim 5, wherein the determining of the plurality of masks further comprises: estimating the speech mask MS based on a first subset of the plurality of outputs and a complementary speech mask (M̆S), where M̆S=1−MS;estimating the echo mask ME based on a second subset of the plurality of outputs and a complementary echo mask (M̆E), where M̆E=1−ME; anddetermining the noise mask MV based on the speech mask MS and the echo mask ME.
7. The method of claim 6, wherein the estimating of the speech mask MS comprises: determining a magnitude of the speech mask MS based on one or more first outputs of the first subset of the plurality of outputs;determining a magnitude of the complementary speech mask M̆S based on the one or more first outputs; anddetermining a phase of the speech mask MS based on the magnitude of the speech mask MS, the magnitude of the complementary speech mask M̆S, and one or more second outputs of the first subset of the plurality of outputs.
8. The method of claim 6, wherein the estimating of the echo mask ME comprises: determining a magnitude of the echo mask ME based on one or more first outputs of the second subset of the plurality of outputs;determining a magnitude of the complementary echo mask M̆E based on the one or more first outputs; anddetermining a phase of the echo mask ME based on the magnitude of the echo mask ME, the magnitude of the complementary echo mask M̆E, and one or more second outputs of the second subset of the plurality of outputs.
9. The method of claim 6, wherein the estimating of the echo mask ME comprises: estimating a reference mask (MR) associated with the reference audio signal based on the second subset of the plurality of outputs and a complementary reference mask (M̆R), where M̆R=1−MR.
10. The method of claim 9, wherein the estimating of the echo mask ME further comprises: determining a magnitude of the reference mask MR based on one or more first outputs of the second subset of the plurality of outputs;determining a magnitude of the complementary reference mask MR based on the one or more first outputs;determining a magnitude of the echo mask ME based on the magnitude of the reference mask MR;determining a magnitude of the complementary echo mask M̆E based on the magnitude of the complementary reference mask M̆R; anddetermining a phase of the echo mask ME based on the magnitude of the echo mask ME, the magnitude of the complementary echo mask M̆E, and one or more second outputs of the second subset of the plurality of outputs.
11. A speech enhancement system comprising: a processing system; anda memory storing instructions that, when executed by the processing system, causes the speech enhancement system to: receive a first audio signal via a microphone;receive a second audio signal for output via a speaker;estimate a reference audio signal based on a delay between the first audio signal and the second audio signal;determine a plurality of masks based on the first audio signal and the reference audio signal, the plurality of masks including a speech mask (MS) associated with a speech component of the first audio signal, an echo mask (ME) associated with an echo component of the first audio signal, and a noise mask (MV) associated with a noise component of the first audio signal; andsuppress the echo component and the noise component of the first audio signal based at least in part on the plurality of masks.
12. The speech enhancement system of claim 11, wherein the speech component includes audio originating from a near-end user associated with the microphone, the echo component includes audio output by the speaker based on the second audio signal, and the noise component includes audio that does not originate from the near-end user and is not output by the speaker.
13. The speech enhancement system of claim 11, wherein execution of the instructions further causes the speech enhancement system to: perform an acoustic echo cancellation (AEC) operation on the first audio signal based on the reference audio signal prior to determining the plurality of masks.
14. The speech enhancement system of claim 13, wherein the AEC operation is associated with a linear filter.
15. The speech enhancement system of claim 11, wherein the determining of the plurality of masks comprises: inferring a plurality of outputs from the first audio signal and the reference audio signal based on a neural network model.
16. The speech enhancement system of claim 15, wherein the determining of the plurality of masks further comprises: estimating the speech mask MS based on a first subset of the plurality of outputs and a complementary speech mask (M̆S), where M̆S=1−MS;estimating the echo mask ME based on a second subset of the plurality of outputs and a complementary echo mask (M̆E), where M̆E=1−ME; anddetermining the noise mask My based on the speech mask MS and the echo mask ME.
17. The speech enhancement system of claim 16, wherein the estimating of the speech mask MS comprises: determining a magnitude of the speech mask MS based on one or more first outputs of the first subset of the plurality of outputs;determining a magnitude of the complementary speech mask M̆S based on the one or more first outputs; anddetermining a phase of the speech mask MS based on the magnitude of the speech mask MS, the magnitude of the complementary speech mask M̆S, and one or more second outputs of the first subset of the plurality of outputs.
18. The speech enhancement system of claim 16, wherein the estimating of the echo mask ME comprises: determining a magnitude of the echo mask ME based on one or more first outputs of the second subset of the plurality of outputs;determining a magnitude of the complementary echo mask M̆E based on the one or more first outputs; anddetermining a phase of the echo mask ME based on the magnitude of the echo mask ME, the magnitude of the complementary echo mask M̆E, and one or more second outputs of the second subset of the plurality of outputs.
19. The speech enhancement system of claim 16, wherein the estimating of the echo mask ME comprises: estimating a reference mask (MR) associated with the reference audio signal based on the second subset of the plurality of outputs and a complementary reference mask (M̆R), where M̆R=1−MR.
20. The speech enhancement system of claim 19, wherein the estimating of the echo mask ME further comprises: determining a magnitude of the reference mask MR based on one or more first outputs of the second subset of the plurality of outputs;determining a magnitude of the complementary reference mask M̆R based on the one or more first outputs;determining a magnitude of the echo mask ME based on the magnitude of the reference mask MR;determining a magnitude of the complementary echo mask M̆E based on the magnitude of the complementary reference mask M̆R; anddetermining a phase of the echo mask ME based on the magnitude of the echo mask ME, the magnitude of the complementary echo mask M̆E, and one or more second outputs of the second subset of the plurality of outputs.

SINGLE-MICROPHONE ACOUSTIC ECHO AND NOISE SUPPRESSION

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims