Speech enhancement involves processing either degraded speech signals or clean speech that is expected to be degraded in the future, where the goal of processing is to improve the quality and intelligibility of speech for the human listener. Though it is possible to enhance speech that is not degraded, such as by high pass filtering to increase perceived crispness and clarity, some of the most significant contributions that can be made by speech enhancement techniques is in reducing noise degradation of the signal. The applications of speech enhancement are numerous. Examples include correction for room reverberation effects, reduction of noise in speech to improve vocoder performance and improvement of un-degraded speech for people with impaired hearing. The degradation can be as different as room echoes, additive random noise, multiplicative or convolutional noise, and competing speakers. Approaches differ, depending on the context of the problem. One significant problem is that of speech degraded by additive random noise, particularly in the context of a Harmonic Excitation Linear Predictive Speech Coder H-LPC).
The selection of an error criteria by which speech enhancement systems are optimized and compared is of central importance, but there is no absolute best set of criteria. Ultimately, the selected criteria must relate to the subjective evaluation by a human listener, and should take into account traits of auditory perception. An example of a system that exploits certain perceptual aspects of speech is that developed by Drucker, as described in “Speech Processing in a High Ambient Noise Environment”, IEEE Trans. On AudioElectroacoustics, Vol.: Au-16, pp: 165-168, June 1968. Based on experimental findings, Drucker concluded that a primary cause for intelligibility loss in speech degraded by wide-band noise is confusion between fricatives and plosive sounds, which is partially due to a loss of short pauses immediately before the plosive sounds. Drucker reports a significant improvement in intelligibility after high pass filtering the /s/ fricative and inserting short pauses before the plosive sounds. However, Drucker's assumption that the plosive sounds can be accurately determined limits the usefulness of the system.
Many speech enhancement techniques take a more mathematical approach, which are empirically matched to human perception. An example of a mathematical criterion that is useful in matching short time spectral magnitudes, a perceptually important characterization of speech, is the mean squared error (MSE). A computational advantage to using this criteria is that the minimum MSE reduces to a linear set of equations. Other factors, however, can make an “optimally small” MSE misleading. In the case of speech degraded by narrow-band noise, which is considerably less comfortable to listen to than wide-band noise, wide-band noise can be added to mask the more unpleasant narrow-band noise. This technique makes the mean squared error larger.
The enhancement of speech degraded by additive noise has led to diverse approaches and systems. Some systems, like Drucker's, exploit certain perceptual aspects of speech. Others have focused on improving the estimate of the short time Fourier transform magnitude (STFTM), which is perceptually important in characterizing speech. The phase, on the other hand, may be considered as relatively unimportant.
Because the STFTM of speech is perceptually very important, one approach has been to estimate the STEM of clean speech, given information about the noise source. Two classes of techniques have evolved out of this approach. In the first, the short time spectral amplitude is estimated from the spectrum of degraded speech and information about the noise source. Usually, the processed spectrum adopts the phase of the spectrum of the noisy speech because phase information is not as important perceptually. This first class includes spectral subtraction, correlation subtraction and maximum likelihood estimation techniques. The second class of techniques, which includes Wiener filtering, uses the degraded speech and noise information to create a zero-phase filter that is then applied to the noisy speech. As reported by H. L. Van Trees in “Detection, Estimation and Modulation Theory”, Pt. 1, John Wiley and Sons, New York, N.Y. 1968, with Wiener filtering the goal is to develop a filter which can be applied to noisy speech to form the enhanced speech.
Turning first to the class concerned with estimation of short time spectral amplitude, particularly where spectral subtraction is used, statistical information is obtained about the noise source to estimate the STFTM of clean speech. This technique is also known as power spectrum subtraction. Variations of these techniques included the more general relation identified by Lim et al in “Enhancement and Bandwidth Compression of Noisy Speech”, Proc. of the IEEE, Vol:. 67, No.: 12, December 1979, as:
|{circumflex over (S)}(ω)|α=|Y(ω)|α−βE[|N(ω)|α] (1)
where α and β are parameters that can be chosen. Magnitude spectral subtraction is the case where α=1, and β=1. A different subtractive speech enhancement algorithm was presented by McAulay and Malpass in “Speech Enhancement Using Soft Decision Noise Suppression Filter”, IEEE Trans. on Acoustics, Speech and Signal Processing, Vol:. ASSP-28, No.: 2, pp: 137-145, April 1980. Their method uses a maximum-likelihood estimate of the noisy speech signal assuming that the noise is gaussian. When the enhanced magnitude yields a value smaller than an attenuation threshold, however, the spectral magnitude is automatically set to the defined threshold.
Spectral subtraction is generally considered to be effective at reducing the apparent noise power in degraded speech. Lim has shown however that this noise reduction is achieved at the price of lower speech inteligibility (8). Moderate amounts of noise reduction can be achieved without significant intelligibility loss, however, large amount of noise reduction can seriously degrade the intelligibility of the speech. Other researchers have also drawn attention to other distortions which are introduced by spectral subtraction (5). Moderate to high amounts of spectral subtraction often introduce “tonal noise” into the speech.
Another class of speech enhancement methods exploits the periodicity of voiced speech to reduce the amount of background noise. These methods average the speech over successive pitch periods, which is equivalent to passing the speech through an adaptive comb filter. In these techniques, harmonic frequencies are passed by the filter while other frequencies are attenuated. This leads to a reduction in the noise between the harmonics of voiced speech. One problem with this technique is that it severely distorts any unvoiced spectral regions. Typically this problem is handled by classifying each segment as either voiced or unvoiced and then only applying the comb filter to voiced regions. Unfortunately, this approach does not account for the fact that even at modest noise levels many voiced segments have large frequency regions which are dominated by noise. Comb filtering these noise dominated frequency regions severely changes the perceived characteristics of the noise.
These known problems with current speech enhancement methods have generated considerable interest in developing new or improved speech enhancement methods which are capable of reducing the substantial amount of noise without adding noticeable artifacts into the speech signal. A particular application for such technique is the Harmonic Excitation Linear Predictive Coding (HE-LPC), although it is desirable for such technique to be applicable to any sinusoidal based speech coding algorithm.
The conventional Harmonic Excitation Linear Predictive Coder (HE-LPC) is disclosed in disclosed in S. Yeldener “A 4 kb/s Toll Quality Harmonic Excitation Linear Predictive Speech Coder”, Proc. of ICASSP-1999, Phoenix, Ariz., pp: 481-484, March 1999, which is incorporated herein by reference. A simplified block diagram of the conventional HE-LPC coder is shown in
In the HE-LPC speech coder, the pitch detection circuit 120 uses a pitch estimation algorithm that takes advantage of the most important frequency components to synthesize speech and then estimate the pitch based on a mean squared error approach. The pitch search range is first partitioned into various sub-ranges, and then a computationally simple pitch cost function is computed. The computed pitch cost function is then evaluated and a pitch candidate for each sub-range is obtained. After pitch candidates are selected, an analysis by synthesis error minimization to procedure is applied to choose the most optimal pitch estimate. In is case, the LPC residual signal is low pass filtered first and then the low pass filter excitation signal is passed through an LPC synthesis filter to obtain the reference speech signal. For each candidate of pitch, the LPC residual spectrum is sampled at the harmonics of the corresponding pitch candidate to get the harmonic amplitude and phases. These harmonic components are used to generated a synthetic excitation signal based on the assumption that the speech is purely voiced. This synthetic excitation signal is then passed through the LPC synthesis filter to obtain the synthesized speech signal. The perceptually weighted mean squared error (PWMSE) in between the reference and synthesized signal is then computed and repeated for each candidate of pitch. The candidate pitch period having the least PWMSE is then chosen as the most optimal pitch estimate P.
Also significant to the operation of the HE-LPC is the computation of the voicing probability that defines a cutoff frequency in voicing estimation circuit 160. First, a synthetic speech spectrum is computed based on the assumption that speech signal is fully voiced. The original and synthetic speech signals are then compared and a voicing probability is computed on a harmonic-by-harmonic basis, and the speech spectrum is assigned as either voiced or unvoiced, depending on the magnitude of the error between the original and reconstructed spectra for the corresponding harmonic. The computed voicing probability Pv is then applied to a spectral amplitude estimation circuit 170 for an estimation of spectral amplitude Ak for the kth harmonic. A quantize and encoder unit 180 receives the pitch detection signal P, the noise residual in the amplitude, the voicing probability Pv and the spectral amplitude Ak, along with the output lsfj of the LPC-LCF transform 140 to generate an encoded output speech signal for application to the output channel 181.
In other coders to which the invention would apply, the excitation signal would also be specified by a consideration of the fundamental frequency, spectral amplitudes of the excitation spectrum and the voicing information.
At the decoder 200, as illustrated in
The voiced part of the excitation signal is determined as the sum of the sinusoidal harmonics. The unvoiced part of the excitation signal is generated by weighting the random noise spectrum with the original excitation spectrum for the frequency regions determined as unvoiced. The voiced and unvoiced excitation signals are then added together at mixer 270 and passed through an LPC synthesis filter 280, which responds to an input from the LPC-LSF transform 220 to form the final synthesized speech. At the output, a post-filter 290, which also receives an input from the LSF-LPC transform circuit 220 via an amplifier 225 with a constant gain α is used to further enhance the output speech quality. This arrangement produces high quality speech.
However, the conventional arrangement of HE-LPC encoder and decoder does not provide the desired performance for a variety of input signal and background noise conditions. Accordingly, there is a need for a flirter way to improve speech quality significantly in background noise conditions.
The present invention comprises the reduction of background noise in a processed speech signal prior to quantization and encoding for transmission on an output channel.
More specifically, the present invention comprises the application of an algorithm to the spectral amplitude estimation signal generated in a speech codec on the basis of detected pitch and voicing information for reduction of background noise.
The present invention further concerns the application of a background noise algorithm on the basis of individual harmonics k in a spectral amplitude estimated signal Ak in a speech codec.
The present invention more specifically concerns the application of a background noise elimination algorithm to any sinusoidal based speech coding algorithm, and in particular, an algorithm based on harmonic excitation linear predictive encoding.
The preferred embodiment of the present invention can be best appreciated by considering in
In considering the detailed operation of the background noise-compensating encoder of the present invention, reference is made to
The first step S1 of the speech enhancement process is to have a voice activity detection (VAD) decision for each frame of speech signal. The VAD decision in block 410 is based on the periodicity P0 and the auto-correlation function ACF of the speech signal, which appear as inputs on lines 401 and 405, respectively, of
If the VAD decision is that there is no speech, in step S2, the noise spectrum is updated every speech segment where speech is not active, and a long term noise spectrum is estimated in noise spectrum estimation unit 420. The long term average noise spectrum is formulated as (2):
where 0≦ω≦π, |Nm(ω)| is the long term noise spectrum magnitude, α is a constant that is can be set to 0.95, and VAD=0 means that speech is not active. In this formulation |U(ω)| can be formed by two ways. In the first way, |U(ω)| can be considered to be directly the current signal spectrum. In the second case, harmonic spectral amplitudes are first estimated according to equation (3) as:
where Ak is the kth harmonic spectral amplitude, and ω0 is the fundamental frequency of the current signal, |S(ω)|, which is an input to the noise spectrum estimation circuit 320 along with the pitch P0. Notably, S(ω) and P0 are inputs to each of the VAD decision circuit 410, noise spectrum estimation unit 420, harmonic-by harmonic noise-signal ratio unit 430 and the harmonic noise attenuation factor unit 460, as subsequently discussed.
In step S3, the Estimated Noise to Signal Ratio (ENSR) for each harmonic lobe is calculated on the basis of S(w), excitation spectrum and pitch input. In this case, the ENSR for the kth harmonic is computed as:
where γk is the kth ENSR, Nm (m}(ω) is the estimated noise spectrum, S(ω) is the speech spectrum and Wk(ω) is the window function computed as:
where BkL and BkU are the lower and upper limits for the kth harmonic and computed as:
In step S4, long term average ACF is calculated section 440, using an ACF-autocorrelation function, and on the basis of an input of the VAD decision in section 410, an input is provided to noise reduction control circuit 450, which in step S5 is used to control the noise reduction gain, βm, from one frame to the next one:
where Δ is a constant (typically Δ=0.1) and
where min is the lowest noise attenuation factor (typically, min=0.5).
In step S5, a harmonic-by-harmonic noise-signal ratio is calculated in section 430 and the harmonic spectral amplitudes are interpolated according to equation (4) to have a fixed dimension spectrum as:
where 1≦k≦L and L is the total number of harmonics within the 4 kHz speech band. The noise gain control that is calculated in step S7, on the basis of the VAD decision output 1 and 0, and as represented in the block 450 of
αk=βm√{square root over ((1.0−μγε)} (11)
In this case, if αk<0.1, then αk is set to 0.1. Here, μ is a constant factor that can be set as:
where Em is the long term average energy that can be computed as:
Em=αEm−1+(1.0−α)E0 (13)
where α is a constant factor (typically α=0.95) and E0 is the average energy of the current frame of the speech signal.
The noise attenuation factor for each harmonic that was computed in step S5 is used in step S6 to scale the harmonic amplitudes that are computed during the encoding process of HE-LPC coder, and to attenuate noise in the residual spectral amplitudes Ak, and produce the modified spectral amplitudes Ak (hat).
The background noise reduction algorithm discussed above may be incorporated into the Harmonic Excitation Linear Predictive Coder (HE-LPC), or any other coder for a sinusoidal based speech coding algorithm.
The decoder as illustrated in
While the present invention is described with respect to certain preferred embodiments, the invention is not limited thereto. The full scope of the invention is to be determined on the basis of the issued claims, as interpreted in accordance with applicable principles of the U.S. Patent Laws.
This is a continuation of application Ser. No. 11/598,813 filed Nov. 14, 2006, which is a continuation of application Ser. No. 10/504,131 filed Aug. 8, 2002, and of PCT/US01/04526 filed Feb. 12, 2001, which claims benefit of Provisional Application No. 60/181,734 filed Feb. 11, 2000. The entire disclosures of the prior applications are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4937873 | McAulay et al. | Jun 1990 | A |
5054072 | McAulay et al. | Oct 1991 | A |
5664051 | Hardwick et al. | Sep 1997 | A |
6070137 | Bloebaum et al. | May 2000 | A |
6182033 | Accardi et al. | Jan 2001 | B1 |
6453287 | Unno et al. | Sep 2002 | B1 |
6691082 | Aguilar et al. | Feb 2004 | B1 |
6862567 | Gao | Mar 2005 | B1 |
6931373 | Bhaskar et al. | Aug 2005 | B1 |
6996523 | Bhaskar et al. | Feb 2006 | B1 |
7013269 | Bhaskar et al. | Mar 2006 | B1 |
7092881 | Aguilar et al. | Aug 2006 | B1 |
7590531 | Khalil et al. | Sep 2009 | B2 |
Number | Date | Country | |
---|---|---|---|
20080140395 A1 | Jun 2008 | US |
Number | Date | Country | |
---|---|---|---|
60181734 | Feb 2000 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11598813 | Nov 2006 | US |
Child | 11772768 | US | |
Parent | 10504131 | US | |
Child | 11598813 | US |