The present implementations relate generally to signal processing, and specifically to neural noise reduction techniques with linear and nonlinear filtering for single-channel audio signals.
Many hands-free communication devices include microphones configured to convert sound waves into audio signals that can be transmitted, over a communications channel, to a receiving device. The audio signals often include a speech component (such as from a user of the communication device) and a noise component (such as from a reverberant enclosure). Speech enhancement is a signal processing technique that attempts to suppress the noise component of the received audio signals without distorting the speech component. Many existing speech enhancement techniques rely on statistical signal processing algorithms that continuously track the pattern of noise in each frame of the audio signal to model a spectral suppression gain or filter that can be applied to the received audio signal in a time-frequency domain.
Some modern speech enhancement techniques implement machine learning to model a spectral suppression gain or filter that can be applied to the received audio signal in a time-frequency domain. Machine learning, which generally includes a training phase and an inferencing phase, is a technique for improving the ability of a computer system or application to perform a certain task. During the training phase, a machine learning system is provided with one or more “answers” and a large volume of raw training data associated with the answers. The machine learning system analyzes the training data to learn a set of rules that can be used to describe each of the one or more answers. During the inferencing phase, the machine learning system may infer answers from new data using the learned set of rules.
Deep learning is a particular form of machine learning in which the inferencing (and training) phases are performed over multiple layers, producing a more abstract dataset in each successive layer. Deep learning architectures are often referred to as “artificial neural networks” due to the manner in which information is processed (similar to a biological nervous system). For example, each layer of an artificial neural network may be composed of one or more “neurons.” The neurons may be interconnected across the various layers so that the input data can be processed and passed from one layer to another. More specifically, each layer of neurons may perform a different transformation on the output data from a preceding layer so that the final output of the neural network results in a desired inference. The set of transformations associated with the various layers of the network is referred to as a “neural network model.”
The size of a neural network (such as the number of layers in the neural network or the number of neurons in each layer) generally affects the accuracy of the inferencing result. More specifically, larger neural networks tend to produce more accurate inferences than smaller or more compact neural networks. However, speech enhancement for single-channel audio is often implemented by low power edge devices with very limited resources (such as battery-powered headsets, earbuds, and other hands-free communication devices with a single microphone input). As such, many existing single channel speech enhancement techniques rely on compact neural network architectures that produce filtered audio signals with some amount of speech distortion or noise leakage (also referred to as “residual noise”). Thus, there is a need to improve the quality of speech in single-channel audio signals.
This Summary is provided to introduce in a simplified form a selection of concepts that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.
One innovative aspect of the subject matter of this disclosure can be implemented in a method of speech enhancement. The method includes steps of receiving a series of frames of an audio signal; denoising a first frame in the series of frames based at least in part on a temporal correlation between the series of frames; inferring a probability of speech associated with the denoised first frame based on a neural network model; generating a first speech signal and a first noise signal based on the probability of speech associated with the denoised first frame, where the first speech signal and the first noise signal represent a speech component and a noise component, respectively, of the audio signal in the first frame; determining a first spectral suppression gain based on the first speech signal and the first noise signal; and suppressing residual noise in the first speech signal based on the first spectral suppression gain.
Another innovative aspect of the subject matter of this disclosure can be implemented in a speech enhancement system, including a processing system and a memory. The memory stores instructions that, when executed by the processing system, cause the speech enhancement system to receive a series of frames of an audio signal; denoise a first frame in the series of frames based at least in part on a temporal correlation between the series of frames; infer a probability of speech associated with the denoised first frame based on a neural network model; generate a first speech signal and a first noise signal based on the probability of speech associated with the denoised first frame, where the first speech signal and the first noise signal represent a speech component and a noise component, respectively, of the audio signal in the first frame; determine a first spectral suppression gain based on the first speech signal and the first noise signal; and suppress residual noise in the first speech signal based on the first spectral suppression gain.
The present implementations are illustrated by way of example and are not intended to be limited by the figures of the accompanying drawings.
In the following description, numerous specific details are set forth such as examples of specific components, circuits, and processes to provide a thorough understanding of the present disclosure. The term “coupled” as used herein means connected directly to or connected through one or more intervening components or circuits. The terms “electronic system” and “electronic device” may be used interchangeably to refer to any system capable of electronically processing information. Also, in the following description and for purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the aspects of the disclosure. However, it will be apparent to one skilled in the art that these specific details may not be required to practice the example embodiments. In other instances, well-known circuits and devices are shown in block diagram form to avoid obscuring the present disclosure. Some portions of the detailed descriptions which follow are presented in terms of procedures, logic blocks, processing and other symbolic representations of operations on data bits within a computer memory.
These descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. In the present disclosure, a procedure, logic block, process, or the like, is conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, although not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities.
Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present application, discussions utilizing the terms such as “accessing,” “receiving,” “sending,” “using,” “selecting,” “determining,” “normalizing,” “multiplying,” “averaging,” “monitoring,” “comparing,” “applying,” “updating,” “measuring,” “deriving” or the like, refer to the actions and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
In the figures, a single block may be described as performing a function or functions; however, in actual practice, the function or functions performed by that block may be performed in a single component or across multiple components, and/or may be performed using hardware, using software, or using a combination of hardware and software. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described below generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure. Also, the example input devices may include components other than those shown, including well-known components such as a processor, memory and the like.
The techniques described herein may be implemented in hardware, software, firmware, or any combination thereof, unless specifically described as being implemented in a specific manner. Any features described as modules or components may also be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a non-transitory processor-readable storage medium including instructions that, when executed, performs one or more of the methods described above. The non-transitory processor-readable data storage medium may form part of a computer program product, which may include packaging materials.
The non-transitory processor-readable storage medium may comprise random access memory (RAM) such as synchronous dynamic random-access memory (SDRAM), read only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, other known storage media, and the like. The techniques additionally, or alternatively, may be realized at least in part by a processor-readable communication medium that carries or communicates code in the form of instructions or data structures and that can be accessed, read, and/or executed by a computer or other processor.
The various illustrative logical blocks, modules, circuits and instructions described in connection with the embodiments disclosed herein may be executed by one or more processors (or a processing system). The term “processor,” as used herein may refer to any general-purpose processor, special-purpose processor, conventional processor, controller, microcontroller, and/or state machine capable of executing scripts or instructions of one or more software programs stored in memory.
As described above, some modern speech enhancement techniques utilize neural networks to model a spectral suppression gain or filter that can be applied to an audio signal in the time-frequency domain. Generally, larger neural networks tend to produce more accurate inferences than smaller or more compact neural networks. However, speech enhancement for single-channel audio is often implemented by low power edge devices with limited resources (such as battery-powered headsets, earbuds, and other hands-free communication devices with a single microphone input). As such, many existing single channel speech enhancement techniques rely on compact neural network architectures that produce filtered audio signals with some amount of speech distortion or noise leakage (also referred to as “residual noise”). Aspects of the present disclosure recognize that statistical signal processing techniques can be combined with neural network inferencing to further improve the quality of speech in single-channel audio signals.
Various aspects relate generally to audio signal processing, and more particularly, to speech enhancement techniques that combine statistical signal processing with neural network inferencing. In some aspects, a speech enhancement system may include a linear filter, a deep neural network (DNN), and a nonlinear post-filter. The linear filter and the nonlinear post-filter are configured to suppress noise in audio signals using statistical signal processing techniques. More specifically, the linear filter denoises an input audio signal based on a temporal correlation between successive frames of the audio signal. The DNN infers a probability of speech in the denoised audio signal and produces a speech signal and a noise signal (representing a speech component and a noise component, respectively, of the audio signal) based on the inferred probability of speech. In some implementations, the probability of speech also may be used to update various parameters of the linear filter (such as a vector of weights associated with a multi-frame beamformer). The nonlinear post-filter suppresses residual noise in the speech signal based on a Gaussian mixture model (GMM) associated with the speech signal and the noise signal.
Particular implementations of the subject matter described in this disclosure can be implemented to realize one or more of the following potential advantages. By combining statistical signal processing with neural network inferencing, aspects of the present disclosure can significantly improve the speech quality of hands-free communication devices. More specifically, the linear filter preconditions audio signals to have reduced noise prior to being input to the DNN while the nonlinear post-filter further suppresses residual noise in the audio signals output by the DNN. Accordingly, the speech enhancement system of the present implementations may use a relatively compact neural network to achieve inferencing results similar to much larger neural networks. Because statistical signal processing techniques require relatively low overhead (compared to larger neural networks), the speech enhancement systems of the present implementations may be well-suited for implementation in low power edge devices with very limited resources.
In some implementations, the sound waves 101 may include user speech mixed with background noise or interference (such as reverberant noise from a headset enclosure). Thus, the audio signal 102 may include a speech component and a noise component. For example, the audio signal 102 (X(l,k)) can be expressed as a combination of the speech component (S(l,k)) and the noise component (N(l,k)), where I is a frame index and k is a frequency index associated with a time-frequency domain:
The speech enhancement component 120 is configured to improve the quality of speech in the audio signal 102, for example, by suppressing the noise component N(l,k) or otherwise increasing the signal-to-noise ratio (SNR) of the audio signal 102. In some implementations, the speech enhancement component 120 may apply a spectral suppression gain or filter to the audio signal 102. The spectral suppression gain attenuates the power of the noise component N(l,k) of the audio signal 102, in the time-frequency domain, to produce an enhanced speech signal 104. As a result, the enhanced speech signal 104 may have a higher SNR than the audio signal 102.
In some aspects, the speech enhancement component 120 may determine a spectral suppression gain based, at least in part, on a deep neural network (DNN) 122. For example, the DNN 122 may be trained to infer a likelihood or probability of speech in audio signals. Example suitable DNNs include, among other examples, convolutional neural networks (CNNs) and recurrent neural networks (RNNs). During a training phase, the DNN 122 may be provided with a large volume of audio signals containing speech mixed with background noise (also referred to as “noisy speech” signals). The DNN 122 also may be provided with clean speech signals representing only the speech component of each audio signal (without the background noise). The DNN 122 compares the noisy speech signals with the clean speech signals to determine a set of features that can be used to classify speech.
During an inferencing phase, the DNN 122 may determine a probability of speech (pDNN(l,k)) in each frame l of the audio signal 102, at each frequency index k associated with the time-frequency domain, based on the classification results. In some implementations, the DNN 122 may further convert the probability of speech pDNN(l,k) into a spectral suppression gain (GDNN(l,k)) that can be used to produce an enhanced speech signal (Z(l,k)), where Z(l,k)=GDNN(l,k)X(l,k). More specifically, the spectral suppression gain GDNN(l,k) may suppress the noise component N(l,k) of the audio signal 102 in the lth audio frame. For example, if there is a low probability of speech in the Lth frame of the audio signal 102 at the Kth frequency index (indicating that noise is dominant at this time-frequency index), the value of GDNN(L,K) may be relatively low so that the power of N(L,K) is attenuated when applying the spectral suppression gain to the audio signal 102.
As described above, the size of a neural network (such as the number of layers in the neural network or the number of neurons in each layer) generally affects the accuracy of the inferencing result. More specifically, larger neural networks tend to produce more accurate inferences than smaller or more compact neural networks. As such, existing neural network architectures require significant processing power and memory to achieve accurate speech enhancement, particularly for single-channel audio signals. However, single channel speech enhancement is often used in low power edge devices with limited resources (such as battery-powered headsets, earbuds, and other hands-free communication devices with a single microphone input). As such, compact neural networks may be more suitable than larger neural networks for many single channel speech enhancement applications.
In some implementations, the DNN 122 may be a relatively compact neural network. As a result, the DNN 122 may not filter at least some of the noise in the audio signal 102. In other words, the DNN 122 may produce an enhanced speech signal Z(l,k) having some speech distortion or residual noise. Aspects of the present disclosure recognize that statistical signal processing techniques can be combined with neural network inferencing to further improve the quality of speech in single-channel audio signals. Example suitable statistical processing techniques include, among other examples, linear and nonlinear filtering techniques. In some aspects, the speech enhancement component 120 may perform linear filtering on the input of the DNN 122 to suppress noise in the audio signal 102 prior to being processed by the DNN 122. In some other aspects, the speech enhancement component 120 may perform nonlinear filtering on the output of the DNN 122 to further suppress residual noise in the enhanced speech signal 104.
The series of input audio frames X(l,k) includes the current audio frame to be processed (X0(l,k)), a number (c) of future audio frames (Xfuture(l,k)) that follow the current audio frame X0(l,k) in time, and a number (d) of past audio frames (Xpast(l,k)) that precede the current audio frame X0(l,k) in time, such that:
where c+d≥1 and Δ is a delay parameter that determines a delay between successive frames in the series of input audio frames X(l,k). In some implementations, the delay parameter Δ may be set to a value less than a frame hop associated with the speech enhancement system 200 (such as 1 sample) to ensure temporal speech correlation across the input audio frames X(l,k).
For example, given a fast Fourier transform (FFT) size of K (where K is the number of frequency bins associated with the FFT), the current audio frame X0(l,k) can be expressed as:
and the past audio frames Xpast(l,k) can be expressed as:
and the future audio frames Xfuture(l,k) can be expressed as:
In some implementations, the speech enhancement system 220 may include a linear filter 210, DNN 220, and a nonlinear post-filter 230. The linear filter 210 is configured to produce a denoised audio frame Y(l,k) based on the series of input audio frames X(l,k). More specifically, the linear filter 210 may suppress or attenuate a noise component (N(l,k)) of the current audio frame X0(l,k) based on a temporal correlation associated with the series of input audio frames X(l,k). In some implementations, the linear filter 210 may include a multi-frame beamformer. Example suitable multi-frame beamformers include, but are not limited to, multi-frame minimum variance distortionless response (MF-MVDR) beamformers.
Multi-frame beamformers exploit the temporal characteristics of single-channel audio signals to enhance speech. More specifically, multi-frame beamforming relies on accurate predictions or estimations of the temporal correlation of speech between consecutive audio frames (also referred to as the “interframe correlation of speech”). With reference for example to Equation 1, the speech component S(l,k) of an audio signal can be decomposed into a correlated part (a(l,k) s (l,k)) and an uncorrelated part (s′(l,k)):
where a(l,k) is an interframe correlation (IFC) vector associated with the speech component of the audio frames X(l,k), ΦSS(l,k) is a matrix representing the covariance of the speech component, and e is a vector selecting the first column of ΦSS(l,k). Accordingly, the multi-frame signal model can be expressed as:
where the uncorrelated speech component s′(l,k) is treated as interference.
A multi-frame beamformer may use the IFC vector a(l,k) to align the series of input frames X(l,k), for example, so that the speech component S(l,k) combines in a constructive manner (or the noise component N(l,k) combines in a destructive manner) when the input frames X(l,k) are summed together. For example, an MF-MVDR beamformer may apply a vector of weights w=[w0, . . . , WQ]T to the series of audio frames X(l,k) to produce the denoised audio frame Y(l,k):
Y(l,k)=wH(l,k)X(l,k)
In some aspects, the linear filter 210 may determine a vector of weights w(l,k) that optimizes the denoised audio frame Y(l,k) with respect to one or more conditions. For example, the linear filter 210 may determine a vector of weights w(l,k) that reduces or minimizes the variance of the noise component of the audio frame Y(l,k) without distorting the speech component of the audio frame Y(l,k). In other words, the vector of weights w(l,k) may satisfy the following condition:
argminww(l,k)ΦNN(l,k)w(l,k)s.t.wH(l,k)a(l,k)=1
where ΦNN(l,k) is a matrix representing the covariance of the noise component of the audio frames X(l,k). The resulting vector of weights w(l,k) represents an MF-MVDR beamforming filter (WMVDR(l,k)), which can be expressed as:
In some implementations, the linear filter 210 may estimate or track the IFC vector a(l,k) and the noise covariance matrix ΦNN(l,k), over time, as a function of X(l,k)XH(l,k). More specifically, the linear filter 210 may update the IFC vector a(l,k) when speech is present or otherwise detect in the input audio signal and may refrain from updating the IFC vector a(l,k) when speech is absent or otherwise not detected in the input audio signal. On the other hand, the linear filter 210 may update the noise covariance matrix ΦNN(l,k) when speech is absent or otherwise not detected in the input audio signal and may refrain from updating the noise covariance matrix ΦNN(l,k) when speech is present or otherwise detected in the input audio signal.
The DNN 220 is configured to infer a probability of speech pDNN(l,k) in the current audio frame X0(l,k) based on a neural network model, where 0≤pDNN(l,k)≤1. In some implementations, the DNN 220 may be one example of the DNN 122 of
In some aspects, the DNN 220 may further produce a speech signal (Z(l,k)) and a noise signal (N(l,k)) based on the probability of speech pDNN(l,k), where the speech signal Z(l,k) represents a speech component of the denoised audio frame Y(l,k) and the noise signal N(l,k) represents a noise component of the denoised audio frame Y(l,k). For example, the DNN 220 may compute a spectral suppression gain (GDNN(l,k)) based on the probability of speech pDNN(l,k) and may apply the spectral suppression gain GDNN(l,k) to the denoised audio frame Y(l,k) to produce the speech signal Z(l,k), where Z(l,k)=GDNN(l,k) Y(l,k). The noise signal N(l,k) may be computed as a difference between the denoised audio frame Y(l,k) and the speech signal Z(l,k), where N(l,k)=Y(l,k)−Z(l,k)
In some aspects, the DNN 220 may be biased towards minimizing speech distortion, rather than maximizing noise suppression. In other words, the spectral suppression gain GDNN(l,k) may be tuned to ensure that the speech component of the denoised audio frame Y(l,k) is not distorted in the resulting speech signal Z(l,k). For example, the DNN 220 may calculate the speech signal Z(l,k) as a function of the denoised audio frame Y(l,k), the probability of speech pDNN(l,k), and a tuning parameter (σ) that controls the amount of noise reduction by the DNN 220:
where |Z(l,k)| is the magnitude of the speech signal Z(l,k), phase (Z(l,k)) is the phase of the speech signal Z(l,k), and 0≤σ<1. In some implementations, the tuning parameter o may be configured so that the speech signal Z(l,k) contains no speech distortion (and may thus contain some residual noise).
In the example of
The nonlinear post-filter 230 is configured to produce the enhanced audio frame
Aspects of the present disclosure recognize that the speech signal Z(l,k) contains mostly target speech and the noise signal N(l,k) contains mostly background noise. As such, the normalized difference e (l,k) between the signals Z(l,k) and N(l,k) may be closer to +1 when the target speech is present in the speech signal Z(l,k) and closer to −1 when the target speech is absent from the speech signal Z(l,k):
In some implementations, the nonlinear post-filter 320 may use an online Gaussian mixture model (GMM) to determine a probability of speech (PGMM(l,k)) associated with the speech signal Z(l,k) based on the normalized difference e (l,k) between the speech signal Z(l,k) and the noise signal N(l,k). For example, the normalized difference e (l,k) can be used to create a bimodal model with two Gaussian probability density functions (PDFs), including a Gaussian PDF for which the target speech is dominant and a Gaussian PDF for which the noise is dominant. The online GMM can be used to calculate a weight (wc), mean (μc), and variance (σc) for each Gaussian PDF, where c=1 represents the Gaussian PDF for which target speech is dominant and c=2 represents the Gaussian PDF for which noise is dominant:
where PGMM(l,k) is a soft probability of speech at each time-frequency index, λ(l,k)={w1(l,k), μ1(l,k), σ1(l,k), w2(l,k), μ2(l,k), σ2(l,k)}, and ηc is a learning rate step size.
In some implementations, the step size ηc can be adaptively determined based on the denoised audio frame Y(l,k) and the probability of speech pDNN(l,k) inferred by the DNN 220:
where η is a maximum step size that can be used for sub-band parameter tracking (and may be a tunable hyperparameter associated with the speech enhancement system 200), and kmin to kmax represents a frequency range for which speech is dominant (such as 0-2 kHz).
In some implementations, a VAD based on multi-frame temporal processing (VADt(l)) can be expressed as a function of the current audio frame X0(l,k) and the denoised audio frame Y(l,k):
where c1, C2, C3, and c4 are tuning thresholds (between 0 and 1). In some implementations, VADupdate(l) may be used to indicate which sets of GMM parameters (such as for speech or noise) should be updated:
In some implementations, the nonlinear post-filter 230 may estimate the magnitude (or power) of noise (Pn(l,k)) in the speech signal Z(l,k) based on the probability of speech PGMM(l,k) and, using spectral subtraction, determine a spectral suppression gain (GGMM(l,k)) that can be applied to the speech signal Z(l,k) to produce the enhanced audio frame S(l,k):
where g is a tuning parameter which represents a floor gain associated with the spectral subtraction.
Aspects of the present disclosure recognize that multiple VAD features (including the normalized difference e (l,k) between the speech signal Z(l,k) and the noise signal N(l,k)) can be combined to produce a spectral suppression gain GGMM(l,k) that is more robust to different noisy environments. Other example suitable VAD features may include a cepstral peak, a spectral entropy, and a harmonic product spectrum (HPS) of the speech signal Z(l,k), among other examples. In some implementations, the nonlinear post-filter 230 may determine a respective probability of speech (PGMMi(l,k)) associated with each VAD feature (ei(l,k)), where i≥1, and calculate the spectral suppression gain GGMM(l,k) based on the lowest probability among the probabilities of speech PGMMi(l,k) associated with any of the VAD features ei(l,k).
In some aspects, the nonlinear filter 300 may suppress residual noise in the speech signal Z(l,k) based on a number (M) of VAD features associated with the speech signal Z(l,k) and the noise signal N(l,k). In some implementations, the nonlinear filter 300 may suppress the residual noise in the speech signal Z(l,k) based on multiple VAD features (M>1). In the example of
In some implementations, at least one of the feature extractors 310(1)-310(M) may compute a normalized difference between the speech signal Z(l,k) and the noise signal N(l,k). For example, a first VAD feature (e1(l,k)) may be calculated according to Equation 3. Although each of the feature extractors 310(1)-310(M) is shown to receive the speech signal Z(l,k) and the noise signal N(l,k), some feature extractors may calculate a respective VAD feature ei(l,k) based on the speech signal Z(l,k) alone. For example, one or more of the VAD features ei(l,k) may be computed based on a power spectrum φZZ (l,k) of the speech signal Z(l,k):
where β is a smoothing factor (between 0 and 1) and φNN(l,k) is a noise power spectrum, which can be estimated by averaging the power spectrum φZZ (l,k) during pauses in speech (such as when speech is not detected in the speech signal Z(l,k)).
In some implementations, at least one of the feature extractors 310(1)-310(M) may compute a cepstral peak associated with the speech signal Z(l,k). For example, a second VAD feature (e2(l,k)) may be computed as:
where τ represents a lag associated with the nth sample of the lth frame and K is the total number of frequency bins associated with the speech signal Z(l,k).
In some implementations, at least one of the feature extractors 310(1)-310(M) may compute a spectral entropy associated with the speech signal Z(l,k). For example, a third VAD feature (e3(l,k)) may be computed as:
where K is the total number of frequency bins associated with the speech signal Z(l,k).
In some implementations, at least one of the feature extractors 310(1)-310(M) may compute a harmonic product spectrum (HPS) associated with the speech signal Z(l,k). For example, a fourth VAD feature (e4(l,k)) may be computed as:
where R is a number of harmonic components corresponding to k, τ represents a lag associated with the nth sample of the lth frame, and K is the total number of frequency bins associated with the speech signal Z(l,k).
Each of the GMMs 320(1)-320(M) is configured to determine a probability of speech PGMMi(l,k), associated with the speech signal Z(l,k), based on a respective VAD feature ei(l,k). For example, each of the VAD features ei(l,k) may be used to create a respective bimodal model with two Gaussian PDFs, including a Gaussian PDF for which the target speech is dominant and a Gaussian PDF for which the noise is dominant. In some implementations, each of the GMMs 320(1)-320(M) may compute the respective probability of speech PGMMi(l,k) based on a weight wc, mean μc, and variance σc of each Gaussian PDF associated with the corresponding VAD feature ei(l,k) (such as described with reference to
The noise suppressor 330 is configured to produce the enhanced audio frame S(l,k) based, at least in part, on the probabilities of speech PGMM1(l,k)-PGMMM(l,k) associated with the VAD features e1(l,k)-eM(l,k), respectively. In some implementations, the noise suppressor 330 may select the lowest probability of speech (PGMM(l,k)) among the probabilities of speech PGMM1(l,k)-PGMMM(l,k), where pGMM(l,k)=min {pGMMi(l,k)}, and compute a spectral suppression gain GGMM(l,k) based on the lowest probability of speech PGMM(l,k). For example, the noise suppressor 330 may use the probability of speech PGMM(l,k) to calculate the magnitude (or power) of noise Pn(l,k) in the speech signal Z(l,k), according to Equation 5, and may calculate the spectral suppression gain GGMM(l,k) based on the magnitude (or power) of noise Pn(l,k), according to Equation 6. The noise suppressor 330 may further apply the spectral suppression gain GGMM(l,k) to the speech signal Z(l,k) to produce the enhanced audio frame
The device interface 410 is configured to communicate with one or more components of an audio receiver (such as the microphone 110 of
The memory 430 may include an audio data store 432 configured to store a series of frames of the audio signal as well as any intermediate signals that may be produced by the speech enhancement system 400 as a result of performing the speech enhancement operation. The memory 430 also may include a non-transitory computer-readable medium (including one or more nonvolatile memory elements, such as EPROM, EEPROM, Flash memory, or a hard drive, among other examples) that may store at least the following software (SW) modules:
Each software module includes instructions that, when executed by the processing system 420, causes the speech enhancement system 400 to perform the corresponding functions.
The processing system 420 may include any suitable one or more processors capable of executing scripts or instructions of one or more software programs stored in the speech enhancement system 400 (such as in the memory 430). For example, the processing system 420 may execute the linear filtering SW module 434 to denoise a first frame in the series of frames based at least in part on a temporal correlation between the series of frames. The processing system 420 also may execute the DNN SW module 436 to infer a probability of speech associated with the denoised first frame based on a neural network model and to generate a speech signal and a noise signal based on the probability of speech associated with the denoised first frame, where the speech signal and the noise signal represent a speech component and a noise component, respectively, of the audio signal in the first frame. Further, the processing system 420 may execute the nonlinear filtering SW module 438 to determine a spectral suppression gain based on the speech signal and the noise signal and suppress residual noise in the speech signal based on the spectral suppression gain.
The speech enhancement system receives a series of frames of an audio signal (510). In some aspects, the audio signal may represent a single channel of audio data. The speech enhancement system denoises a first frame in the series of frames based at least in part on a temporal correlation between the series of frames (520). In some implementations, the first frame may be denoised based on an MF-MVDR beamformer that reduces a power of the noise component of the audio signal without distorting the speech component.
The speech enhancement system infers a probability of speech associated with the denoised first frame based on a neural network model (530). The speech enhancement system further generates a first speech signal and a first noise signal based on the probability of speech associated with the denoised first frame, where the first speech signal and the first noise signal represent a speech component and a noise component, respectively, of the audio signal in the first frame (540). In some implementations, the speech component of the audio signal in the denoised first frame may be equal to the speech component of the audio signal in the first speech signal.
The speech enhancement system determines a first spectral suppression gain based on the first speech signal and the first noise signal (550). In some implementations, the determining of the first spectral suppression gain may include determining a number (M) of VAD features that are indicative of whether speech is present in the first frame based at least in part on the first speech signal and the first noise signal; determining M probabilities of speech associated with the first speech signal based on the M VAD features, respectively; and determining a magnitude or power of the residual noise in the first speech signal based on the M probabilities of speech associated with the first speech signal. The speech enhancement system further suppresses residual noise in the first speech signal based on the first spectral suppression gain (560).
In some implementations, the number of VAD features may be greater than 1(M>1). In some implementations, the magnitude or power of the residual noise in the first speech signal may be determined based only on the lowest probability of speech among the M probabilities of speech associated with the first speech signal. In some implementations, each of the M probabilities of speech associated with the first speech signal is determined based on a respective GMM. In some implementations, the M VAD features may include a normalized difference between the first speech signal and the first noise signal. In some implementations, the M VAD features may include at least one of a cepstral peak, a spectral entropy, or a harmonic product spectrum (HPS) associated with the first speech signal.
In some aspects, the speech enhancement system may further determine an IFC vector associated with a speech component of the audio signal based at least in part on the probability of speech associated with the denoised first frame; denoise a second frame in the series of frames based at least in part on the IFC vector; infer a probability of speech associated with the denoised second frame based on the neural network model; generate a second speech signal and a second noise signal based on the probability of speech associated with the denoised second frame, where the second speech signal and the second noise signal represent a speech component and a noise component, respectively, of the audio signal in the second frame; determine a second spectral suppression gain based on the second speech signal and the second noise signal; and suppress residual noise in the second speech signal based on the second spectral suppression gain.
Those of skill in the art will appreciate that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Further, those of skill in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the aspects disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosure.
The methods, sequences or algorithms described in connection with the aspects disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor.
In the foregoing specification, embodiments have been described with reference to specific examples thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader scope of the disclosure as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.