The invention relates generally to digital audio signal processing, and more particularly relates to noise suppression in voice or speech signals.
Noise suppression (NS) of speech signals can be useful to many applications. In cellular telephony, for example, noise suppression can be used to remove background noise to provide more readily intelligible speech from calls made in noisy environments. Likewise, noise suppression can improve perceptual quality and speech intelligibility in teleconferencing, voice chat in on-line games, Internet-based voice messaging and voice chat, and other like communications applications. The input audio signal is typically noisy for these applications since the recording environment is less than ideal. Further, noise suppression can improve compression performance when used prior to coding or compression of voice signals (e.g., via the Windows Media Voice codec, and other like codecs). Noise suppression also can be applied prior to speech recognition to improve recognition accuracy.
There are some well-known techniques for noise suppression in speech signals, such as spectral subtraction and Minimum Mean Square Error (MMSE). Almost all of these known techniques suppress the noise by applying a spectral gain G(m, k) based on an estimate of noise in the speech signal to each short-time spectrum value S(m, k) of the speech signal, where m is the frame number and k is the spectrum index. (See, e.g., S. F. Boll, A. V. Oppenheim, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoustics, Speech and Signal Processing, ASSP-27(2), April 1979; and Rainer Martin, “Noise Power Spectral Density Estimation Based on Optimal Smoothing and Minimum Statistics,” IEEE Transactions on Speech and Audio Processing, Vol. 9, No. pp. 504-512, July 2001.) A very low spectral gain is applied to spectrum values estimated to contain noise, so as to suppress the noise in the signal.
Unfortunately, the use of noise suppression may introduce artificial distortions (audible “artifacts”) into the speech signal, such as because the spectral gain applied by the noise suppression is either too great (removing more than noise) or too little (failing to remove the noise completely). One artifact that many NS techniques suffer from is called musical noise, where the NS technique introduces an artifact perceived as a melodic audio signal pattern that was not present in the input. In some cases, this musical noise can become noticeable and distracting, in addition to being an inaccurate representation of the speech present in the input signal.
In a speech noise suppressor implementation described herein, a novel gain-constrained technique is introduced to improve noise suppression precision and thereby reduce occurrence of musical noise artifacts. The technique estimates the noise spectrum during speech, and not just during pauses in speech, so that the noise estimation can be kept more accurate during long speech periods. Further, a noise estimation smoothing is used to achieve better noise estimation. The listening test shows this gain-constrained noise suppression and noise estimation smoothing techniques improve the voice quality of speech signals significantly.
The gain-constrained noise suppression and smoothed noise estimation techniques can be used in noise suppressor implementations that operate by applying a spectral gain G(m, k) to each short-time spectrum value S(m, k). Here m is the frame number and k is the spectrum index.
More particularly in one example noise suppressor implementation, the input voice signal is divided into frames. An analysis window is applied to each frame and then the signal is converted into a frequency domain signal S(m, k) using the Fast Fourier Transform (FFT). The spectrum values are grouped into N bins for further processing. A noise characteristic is estimated for each bin when it is classified as being a noise bin. An energy parameter is smoothed in both the time domain and the frequency domain to get better noise estimation per bin. The gain factors G(m, k) are calculated based on the current signal spectrum and the noise estimation. A gain smoothing filter is applied to smooth the gain factors before they are applied on the signal spectral values S(m, k). This modified signal spectrum is converted into time domain for output.
The gain smoothing filter performs two steps to smooth the gain factors before they are applied to the spectrum values. First, a noisy factor ξ(m)∈[0,1] is computed for the current frame. It is determined based on a ratio of the number of noise bins to the total number of bins. A zero-valued noisy factor ξ(m)=0 means only using constant gain for all the spectrum values, whereas a noisy factor ξ(m)=1 means no smoothing at all. Then, this noisy factor is used to alter the gain factors G(m, k) to produce smoothed gain factors GS(m, k). In the example noise suppressor implementation, this is done by applying the FFT on G(m, k), then cutting off the high frequency components.
Additional features and advantages of the invention will be made apparent from the following detailed description of embodiments that proceeds with reference to the accompanying drawings.
The following description is directed to gain-constrained noise suppression techniques for use in audio or speech processing systems. As illustrated in
1. Illustrated Embodiment
At a pre-emphasis stage 220, this input speech signal (x(i)) is processed to emphasize speech, e.g., via a high-pass filtering (although other forms of emphasis can alternatively be used). First, framing is performed to group the speech signal samples into frames of a preset length, N, which may be 160 samples. The framed speech signal is denoted as x(m,n), where m is the frame number, and n is the number of the sample within the frame. A suitable high-pass filtering for emphasis can be represented in the following formula:
H(z)=1+βz−1
with a suitable value of β being −0.8. This high pass filter can be realized by calculating the emphasized speech signal, xh(m,n), as a weighted moving average of the corresponding sample of the input speech signal with its immediately preceding sample, as in the following equation:
xh(m,n)=x(m,n)+βx(m,n−1)
A windowing function 300 (shown in
This windowing function is multiplied by an overlapped frame (xw) of the emphasized (high-pass filtered) signal, xh(m,n−Lw), given by:
The multiplication produces a windowed signal, sw(m,n), as in the following equation:
sw(m,n)=xw(n)w(n), 0≦n<L
After windowing, the speech signal is transformed via a frequency analysis (e.g., using the Fast Fourier Transform (FFT) 240 or other like transform) to the frequency domain. This yields a set of spectral coefficients or frequency spectrum for each frame of the signal, as shown in the following equation:
S(m,k)=FFTL(sw(m,n))
The spectral coefficients are complex values, and thus represent both the spectral amplitude (SA) and phase (SP) of the speech signal according to the following relationships:
SA(m,k)=|S(m,k)|
SP(m,k)=tan−1S(m,k)
The spectral amplitude is analyzed in the following process to provide a more accurate estimate of the gain to be used in noise suppression, whereas the phase is preserved for use in the inverse FFT.
At stages 250-251, frequency and time domain smoothing is performed on the energy bands of the spectrum for each frame. A sliding window smoothing in the frequency domain is first performed is as in the following equation:
This is followed by a time domain smoothing given by the following equation:
where
Here, the value of γ is a parameter that can be variably chosen to control the amount of smoothing. In particular, as the value of γ approaches the ratio (N/Fs), then a goes to zero, resulting in less smoothing when the above time domain smoothing is applied. On the other hand, as the value is made larger (γ→∞), then α approaches a unity value, producing greater smoothing.
Stages 260 and 261 calculate the frame energy and historical lowest energy, respectively. The frame energy is calculated from the following equation:
The historical lowest energy is given by:
where M is a constant parameter typically representing 1 or 2 seconds.
At an update checking stage 262, the noise suppressor 120 judges whether to update noise statistics of the speech signal that are tracked on a frequency bin basis. The noise suppressor 120 groups the spectrum values of the speech signal frames into a number of frequency bins. In the illustrated implementation, the spectrum values (k) are grouped one spectrum value per frequency bin. However, in alternative implementations, various other groupings of the frames' spectrum values into frequency bins can be made, such as more than one spectrum value per frequency bin, or non-uniform groupings of spectrum values into frequency bins.
First, in determining whether to reset the noise statistics, the noise suppressor checks (decision 410) whether the frame energy is below a first threshold multiple (λ1) of the historical lowest energy for the speech signal (which generally indicates a pause in speech), as shown in the following equation:
SE(m)<λ1Smin(m)
If so (at block 415), the noise suppressor sets a reset flag for the frame to one (R(m)=1), which indicates the noise statistics are to be reset in the current frame.
Otherwise, the noise suppressor proceeds to check whether to update the frequency bins. For this check (decision 420), the noise suppressor checks whether the frame energy is below a second (higher) threshold multiple (λ2) of the historical lowest energy (which generally indicates a continuing speech pause), as in the following equation:
SE(m)<λ2Smin(m)
If so, the noise suppressor sets the update flags for the frame's frequency bins to one (i.e., U(m,k)=1).
Otherwise (inside “for” loop blocks 430, 460), the noise suppressor makes determination on a per frequency bin basis whether to update the respective frequency bin. For each frequency bin, the noise suppressor checks whether the frame energy is lower than a function of the noise mean and noise variance of the respective frequency bin in the preceding frame (decision 440), as shown in the following equation:
logSE(m)<SM(m−1,k)+λ3{square root}{square root over (SV(m−1,k))}
If the logarithmic energy of the frequency bin is lower than this threshold function of the noise mean and variance of the frequency bin in the preceding frame, then the noise suppressor sets the update flag for the frequency bin to one (U(m,k)=1) at block 445. The update flag for the current frequency bin is otherwise set to zero (U(m,k)=0) for no update, at block 445.
With reference again to
SM(m,k)=logSS(m,k)
Otherwise, if the reset flag for the frame is not set (R(m)≠1), the noise suppressor updates the noise mean for the frequency bins according to their update flags. In “for” loop 520, 550, the noise suppressor checks the update flag of each frequency bin (decision 530). If the update flag is set (U(m,k)=1), the noise mean for the frequency bin is updated as a weighted sum of the noise mean of the frequency bin in the preceding frame and the speech signal of the frequency bin in the present frame, as shown in the following equation:
SM(m,k)=αMSM(m−1,k)+(1−αM)logSS(m,k)
Otherwise, the noise mean of the frequency bin is not updated, and therefore carried forward from the preceding frame, as in the following equation:
SM(m,k)=SM(m−1,k)
SV(m,k)=|logSS(m,k)−SM(m,k)|2
Otherwise, if the reset flag for the frame is not set (R(m)≠1), the noise suppressor updates the noise variance for the frequency bins according to their update flags. In “for” loop 620, 650, the noise suppressor checks the update flag of each frequency bin (decision 630). If the update flag is set (U(m,k)=1), the noise variance for the frequency bin is updated as a weighted function of the noise variance of the frequency bin in the preceding frame and that of the speech signal of the frequency bin in the present frame, as shown in the following equation:
SV(m,k)=αVSV(m−1,k)+(1−αV)|logSS(m,k)−SM(m,k)|2
Otherwise, the noise variance of the frequency bin is not updated, and therefore carried forward from the preceding frame, as in the following equation:
SV(m,k)=SV(m−1,k)
With reference again to
In a Signal-to-Noise Ratio (SNR) gain filter stage 270, the noise suppressor initially calculates the SNR of the frequency bins, as in the following equation:
The noise suppressor then uses the SNR to calculate the gain factors for the gain filter, as follows:
In a gain smoothing stage 271, the noise suppressor then smoothes the gain factors according to a calculation of the “noisy”-ness (herein termed a “noisy factor”) of the frame, where a stronger smoothing is applied to more noisy frames than is applied to speech frames. The noise suppressor calculates a noise ratio for the frame as a ratio of the number of noisy frequency bins (i.e., the bins flagged for update) to the total number of bins, as follows:
The noise suppressor then calculates a smoothing factor for the frame (clamped to the range 0 to 1), as follows:
In this implementation, the noise suppressor applies smoothing in the frequency domain, using the FFT to transform the gain filter to the frequency domain. For the frequency domain transform, the noise suppressor calculates a set of expanded gain factors (G′(m,k)) from the gain factors (G(m,k)), as follows:
where K is the number of frequency bins. L is typically 2K. The expanded gain factors thus effectively copy the gain factors from 0 to K−1, and copy a mirror image of the gain factors from K to L−1.
The noise suppressor then calculates a gain spectrum (g(Λ)) via the FFT of the expanded gain factors, as follows:
g({overscore (Λ)})=FFT(G′(m,k))
The FFT produces spectrum coefficients having complex values, from which amplitude and phase of the gain spectrum are calculated as follows:
gA({overscore (Λ)})=|g({overscore (Λ)})|
gP({overscore (Λ)})=tan−1(g({overscore (Λ)}))
The noise suppressor then smoothes the gain filter by zeroing high frequency components of the gain spectrum. The noise suppressor retains a number of gain spectrum coefficients up to a number based on the smoothing factor (M(m)) and zeroing the components above this number, according to the following equation:
Ng=roundoff[(1−M(m))(k−1)]+1
such that,
An inverse FFT is then applied to this reduced gain spectrum to produce the smoothed gain filter, by:
GS(m,k)=IFFT(g′A({overscore (Λ)}),gp({overscore (Λ)}))
This FFT based smoothing effectively produces little or no smoothing for a smoothing factor near zero (e.g., with no or few “noisy” frequency bins marked by the update flag in the frame), and smoothes the gain filter toward a constant value as the smoothing factor approaches one (e.g., with all or nearly all “noisy” bins). Accordingly, for a zero smoothing factor (M(m)=0), the smoothed gain filter is:
Gs(m,k)=G(m,k)
Whereas, for a smoothing factor equal to one (M(m)=1), the smoothed gain filter is:
At a next stage 272, the noise suppressor applies the resulting smoothed gain filter to the spectral amplitude of speech signal frame, as follows:
S′A(m,k)=SA(m,k)Gs(m,k)
As a result of the noise statistic estimation and smoothing processes, the gain factors applied to noisy bins should be much lower relative to non-noise frequency bins, such that noise in the speech signal is suppressed.
At stage 280, the noise suppressor applies the inverse transform to the spectrum of the speech signal as modified by the gain filter, as follows:
yo(m,n)=IFFTL(S′A(m,k),SP(m,k))
An inverse of the overlap and pre-emphasis (high-pass filtering) are then applied at stages 281, 282 to produce the final output 290 of the noise suppressor, as per the following formulas:
2. Computing Environment
The above described noise suppression system 100 (
With reference to
A computing environment may have additional features. For example, the computing environment (700) includes storage (740), one or more input devices (750), one or more output devices (760), and one or more communication connections (770). An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing environment (700). Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment (700), and coordinates activities of the components of the computing environment (700).
The storage (740) may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, CD-RWs, DVDs, or any other medium which can be used to store information and which can be accessed within the computing environment (700). The storage (740) stores instructions for the software (780) implementing the gain-constrained noise suppression processing 200 (
The input device(s) (750) may be a touch input device such as a keyboard, mouse, pen, or trackball, a voice input device, a scanning device, or another device that provides input to the computing environment (700). For audio, the input device(s) (750) may be a sound card or similar device that accepts audio input in analog or digital form, or a CD-ROM reader that provides audio samples to the computing environment. The output device(s) (760) may be a display, printer, speaker, CD-writer, or another device that provides output from the computing environment (700).
The communication connection(s) (770) enable communication over a communication medium to another computing entity. The communication medium conveys information such as computer-executable instructions, compressed audio or video information, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media include wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.
The fast headphone virtualization techniques herein can be described in the general context of computer-readable media. Computer-readable media are any available media that can be accessed within a computing environment. By way of example, and not limitation, with the computing environment (700), computer-readable media include memory (720), storage (740), communication media, and combinations of any of the above.
The fast headphone virtualization techniques herein can be described in the general context of computer-executable instructions, such as those included in program modules, being executed in a computing environment on a target real or virtual processor. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or split between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed within a local or distributed computing environment.
For the sake of presentation, the detailed description uses terms like “determine,” “generate,” “adjust,” and “apply” to describe computer operations in a computing environment. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on implementation.
In view of the many possible embodiments to which the principles of our invention may be applied, we claim as our invention all such embodiments as may come within the scope and spirit of the following claims and equivalents thereto.