The technology described in this document relates generally to audio signal processing and more particularly to systems and methods for reducing background noise in an audio signal.
Noise suppression systems including computer hardware and/or software are used to improve the overall quality of an audio sample by distinguishing the desired signal from ambient background noise. For example, in processing audio samples that include speech, it is desirable to improve the signal noise ratio (SNR) of the speech signal to enhance the intelligibility and/or perceived quality of the speech. Enhancement of speech degraded by noise is an important field of speech enhancement and is used in a variety of applications (e.g., mobile phones, voice over IP, teleconferencing systems, speech recognition, and hearing aids). Such speech enhancement may be particularly useful in processing audio samples recorded in environments having high levels of ambient background noise, such as an aircraft, a vehicle, or a noisy factory.
The present disclosure is directed to systems and methods for reducing noise from an input signal to generate noise-reduced output signal. In an example method of reducing noise from an input signal to generate a noise-reduced output signal, an input signal is received. The input signal is transformed from a time domain to a plurality of subbands in a frequency domain, where each subband of the plurality of subbands includes a speech component and a noise component. For each of the subbands, an amplitude of the speech component is estimated based on an amplitude of the subband and an estimate of at least one signal-to-noise ratio (SNR) of the subband. The estimating of the amplitude of the speech component is not based on an exponential function or a Bessel function. The estimating of the amplitude of the speech component is based on a closed-form solution. The plurality of subbands in the frequency domain are filtered based on the estimated amplitudes of the speech components to generate the noise-reduced output signal.
An example system for reducing noise from an input signal to generate a noise-reduced output signal includes a time-to-frequency transformation device. The time-to-frequency transformation device is configured to transform an input signal from a time domain to a plurality of subbands in the frequency domain, where each subband of the plurality of subbands includes a speech component and a noise component. The system further includes a filter coupled to the time-to-frequency device. The filter is configured, for each of the subbands, to estimate an amplitude of the speech component based on an amplitude of the subband and an estimate of at least one signal-to-noise ratio (SNR) of the subband. The estimating of the amplitude of the speech component is not based on an exponential function or a Bessel function. The estimating of the amplitude of the speech component is based on a closed-form solution. The filter is also configured to filter the plurality of subbands in the frequency domain based on the estimated amplitudes of the speech components to generate the noise-reduced output signal. The system also includes a frequency-to-time transformation device configured to transform the noise-reduced output signal from the frequency domain to the time domain.
In another example, a filter includes an input for receiving an input signal in a frequency domain. The input signal includes a plurality of subbands in the frequency domain, where each subband of the plurality of subbands includes a speech component and a noise component. The filter also includes an attenuation filter coupled to the input. The attenuation filter is configured to attenuate frequencies in the input signal based on
where Âk is an estimate of an amplitude of the speech component for a subband k of the plurality of subbands, γk is an estimate of an a posteriori SNR of the subband k, Yk is an amplitude of the subband k, and νk is
where ξk is an estimate of an a priori SNR of the subband k. The filter also includes an output coupled to the attenuation filter for outputting a noise-reduced output signal.
Noise suppression filter system 106 is used to lower the noise in the input signal. The noise suppression filter system 106 may be understood as performing “speech enhancement” because suppressing the noise in the input signal may enhance the intelligibility and/or perceived quality of the speech components of the signal. The noise suppression filter system 106, described in greater detail below with reference to
Example features of the noise suppression filter system 106 of
The time domain signal y(t) 206 is received at a time-to-frequency domain converter 208. In an example, the time-to-frequency domain converter 208 comprises hardware and/or computer software for converting the frames of the signal y(t) 206 from the time domain to the frequency domain. The time-to-frequency domain conversion is achieved in the converter 208, for example, using a Fast Fourier Transform (FFT) algorithm, a short-time Fourier transform (STFT) (i.e., short-term Fourier transform) algorithm, or another algorithm (e.g., an algorithm that performs a discrete Fourier transform mathematical process). The conversion of the frames from the time domain to the frequency domain permits analysis and filtering of the speech sample to occur in the frequency domain, as explained in further detail below. In an example, the time-to-frequency domain converter 208 operates on individual frames of the signal y(t) 206 and determines the Fourier transform of each frame individually using the STFT algorithm.
The time-to-frequency domain converter 208 converts each frame of the signal 206 into K subbands in the frequency domain and determines amplitude values Yk 210, k=1, . . . , K. The amplitude values Yk 210 are amplitude values for each of the K frequency subbands. For example, if a frequency domain representation of a frame includes frequency components over a range of 0 Hz to 20 kHz, and if each subband has a width of 20 Hz, then K=1,000, and the amplitude values Yk 210 include one thousand (1,000) amplitude values, with each of the K subbands being associated with an amplitude value. In this example, a first subband has an amplitude value (e.g., Y1) for frequency components ranging from 0 to 20 Hz, a second subband has an amplitude value (e.g., Y2) for frequency components ranging from 20 Hz to 40 Hz, and so on. Each frequency subband includes a speech component and a noise component.
The frequency subbands may be known as “frequency bins.”
With reference again to
In an example, the estimating of the amplitude of the speech component is based on a simple function having few terms. The simple function (described in further detail below) is in contrast to the complex mathematical functions that are used in conventional speech enhancement systems. Such complex mathematical functions may be based on exponential functions, gamma functions, and modified Bessel functions, among others, that are difficult and costly to implement in hardware. By contrast, the attenuation filter 212 described herein utilizes the aforementioned simple function that includes few terms and does not require solving exponential functions, gamma functions, and modified Bessel functions. The attenuation filter 212 described herein is based on a closed-form solution (e.g., a non-infinite order polynomial function). The simple function described herein can be efficiently implemented in hardware. The hardware implementation may include, for example, a computer processor, a non-transitory computer-readable storage medium (e.g., a memory device), and additional components (e.g., multiplier, divider, and adder components implemented in hardware, etc.). It should be understood that the function used in estimating the amplitude of the speech component may be implemented in hardware in a variety of different ways.
Based on the estimates of the amplitudes of the speech components for each of the plurality of frequency subbands for the frame, the attenuation filter 212 filters the plurality of frequency subbands. The attenuation filter 212 thus performs frequency domain filtering on the input signals and the result is transformed back into the time domain using a frequency-to-time domain converter 218. The output of the frequency-to-time domain converter 218 is the noise-reduced output signal 220. The noise-reduced output signal 220 varies from the noisy speech sample 202 because frequencies of the noisy speech sample 202 determined to have high noise levels are suppressed in the noise-reduced output signal 220. In an example, the frequency-to-time domain converter 218 includes hardware and/or computer software for generating the noise-reduced output signal 220 based on an inverse Fourier transform operation.
The input Y 402 is an amplitude value for the particular frequency subband, where the particular frequency subband is part of a frequency domain representation of a noisy speech sample. The input Y 402 is similar to one of the amplitude values Yk 210, k=1, . . . , K, described above with reference to
The output ÂN_MMSE 404 of the spectral amplitude estimator 400 is an estimated amplitude of the speech component of the particular subband. Determining the output ÂN_MMSE 404 is based on a minimization of a normalized mean squared error. As illustrated in
The output ÂN_MMSE 404 of the spectral amplitude estimator 400 is the value of  that minimizes
where E[A|Y]*E[Â|Y] is a term that normalizes the mean squared error represented by E[(A−Â)2|Y]. The spectral amplitude estimator 400 of
To determine the value of  that minimizes Equation 1, the derivative of Equation 1 is taken with respect to  as follows:
Equation 2 is set equal to zero to determine a value of  that minimizes Equation 1, as follows:
Although the value Y is known (i.e., the value Y is the input 402 received by the spectral amplitude estimator), A is an unknown value representing the actual value of the amplitude of the speech component, as noted above. Thus, additional transformation of Equation 3 is used to eliminate this equation's dependence on A. In the additional transformation, because  is always positive, Equation 3 is rewritten as
where ÂN_MMSE is the value of  that minimizes Equation 1.
The expectation term of Equation 4 is evaluated as a function of an assumed probabilistic model and likelihood function. The assumed model utilizes asymptotic properties of the Fourier expansion coefficients. Specifically, the model assumes that the Fourier expansion coefficients of each process can be modeled as statistically independent Gaussian random variables. The mean of each coefficient is assumed to be zero, since the processes involved here are assumed to have zero mean. The variance of each speech Fourier expansion coefficient is time-varying due to speech non-stationarity. Thus, the expectation term of Equation 4 is evaluated as a function of the assumed probabilistic model and likelihood function:
E[A2|Y|]=∫0∞A2p(A|Y)dA. Equation 5
The term p(A|Y) is a probability density function of A given Y. Using Bayes' theorem, Equation 5 can be rewritten to include a probability density function of Y given A, as follows:
Based on the assumed probabilistic model for speech and additive noise, terms of Equation 6 are as follows:
where I0 is the modified Bessel function of order zero, λN is a variance of noise for the particular frequency subband being considered, and λX is a variance of clean speech for the particular frequency subband. One or more assumptions regarding the probabilistic model of speech may be used in estimating the values of λN and λX. For example, it may be assumed that clean speech has some mean and variance and that clean speech follows a Gaussian distribution. Further, it may be assumed that noise has some other mean and variance and that noise also follows a Gaussian distribution. Equation 6.1 is a probability density function of Y given A, Equation 6.2 is a probability density function of A, and Equation 6.3 is a probability density function of Y. Substituting Equations 6.1, 6.2, and 6.3 into Equation 6 yields the following:
The integral in Equation 7 can be calculated based on the following formulas:
Specifically, using the above formulas, the integral of Equation 7 is rewritten as follows:
where Γ is the gamma function and F1 is the confluent hypergeometric function. The gamma function is defined as
Γ(z)=∫0∞e−ttz−1dt. [Re z>0] Equation 10.1
Some particular values of the gamma function are
Γ(2)=Γ(1)=1. Equation 11
The confluent hypergeometric function is defined based on a geometric series expansion as follows:
In Equation 11.1, Φ(α, γ; z) is equivalent to F1(α; γ; z). Changing the notation of the confluent hypergeometric function as shown in Equation 11.1 and substituting Equations 10 and 11 into Equation 7 yields the following:
The confluent hypergeometric function has a property Φ(α,γ;z)=ezΦ(γ−α,γ;−z). Using this property, Equation 12 is rewritten as follows:
Parameters α and β, defined in Equations 8 and 9, respectively, are rewritten in terms of the a priori signal-to-noise ratio (SNR) ξ of the particular frequency subband, the a posteriori SNR γ of the particular subband, and a parameter ν for the particular frequency subband. Equations 14, 15, and 16 define the a priori SNR ξ, the a posteriori SNR γ, and the parameter ν for the particular frequency subband, respectively, and Equations 17 and 18 rewrite equations for the parameters α and β in terms of ξ, γ, and ν:
Using the notation for parameters α and β as shown in Equations 17 and 18, Equation 13 is rewritten as follows:
Based on Equation 11.1, the series expansion Φ(−1,1,−ν) of Equation 19 simplifies to the following:
Φ(−1,1,−ν)=1+ν Equation 20
Substituting the expansion of Equation 20 into Equation 19 yields the following:
By inserting Equation 21 into Equation 4, the equation for the value of  that minimizes Equation 1 is rewritten as follows:
In Equation 22, the term
is a gain function GN
ÂN_MMSE=GN
The value ÂN_MMSE from Equations 22 and 23 is the output 404 of the spectral amplitude estimator 400 and is equal to the estimated amplitude of the speech component of the particular subband. The calculation of the value ÂN_MMSE is performed for each subband of the plurality of frequency subbands corresponding to a frame of the input signal. Based on the estimates of the amplitudes of the speech components for each of the frequency subbands of the frame, the plurality of frequency subbands are filtered. Thus, as explained above with reference to
It should be appreciated that the spectral amplitude estimator 400 of
In
This written description uses examples to disclose the invention, including the best mode, and also to enable a person skilled in the art to make and use the invention. The patentable scope of the invention includes other examples. Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to carry out the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes but is not limited to a unit of code that performs a software operation, and can be implemented for example as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
It should be understood that as used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise. Further, as used in the description herein and throughout the claims that follow, the meaning of “each” does not require “each and every” unless the context clearly dictates otherwise. Finally, as used in the description herein and throughout the claims that follow, the meanings of “and” and “or” include both the conjunctive and disjunctive and may be used interchangeably unless the context expressly dictates otherwise; the phrase “exclusive of” may be used to indicate situations where only the disjunctive meaning may apply.
This disclosure claims priority to U.S. Provisional Patent Application No. 61/916,622, filed on Dec. 16, 2013, which is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4897878 | Boll | Jan 1990 | A |
5012519 | Adlersberg | Apr 1991 | A |
5577161 | Pelaez Ferrigno | Nov 1996 | A |
6249762 | Kirsteins | Jun 2001 | B1 |
7885810 | Wang | Feb 2011 | B1 |
8098842 | Florencio | Jan 2012 | B2 |
8180069 | Buck | May 2012 | B2 |
8560320 | Yu | Oct 2013 | B2 |
20020002455 | Accardi | Jan 2002 | A1 |
20030040908 | Yang | Feb 2003 | A1 |
20040052383 | Acero | Mar 2004 | A1 |
20040071284 | Abutalebi | Apr 2004 | A1 |
20050027520 | Mattila | Feb 2005 | A1 |
20050261894 | Balan | Nov 2005 | A1 |
20060206322 | Deng | Sep 2006 | A1 |
20070055505 | Doclo | Mar 2007 | A1 |
20070106504 | Deng | May 2007 | A1 |
20080167866 | Hetherington | Jul 2008 | A1 |
20090177468 | Yu | Jul 2009 | A1 |
20090292536 | Hetherington | Nov 2009 | A1 |
20100145687 | Huo | Jun 2010 | A1 |
20120123772 | Thyssen | May 2012 | A1 |
Entry |
---|
Ephraim et al., “Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-32, No. 6, Dec. 1984, pp. 1109 to 1121. |
Wolfe et al., “Efficient Alternatives to the Ephraim and Malah Suppression Rule for Audio Singal Enhancement”, EURASIP Journal on Applied Signal Processing 2003: 10, pp. 1043 to 1051. |
Number | Date | Country | |
---|---|---|---|
61916622 | Dec 2013 | US |