The present disclosure relates to the field of signal processing in hearing aids and speech communication devices, and more specifically relates to a method and system for suppressing background noise in the input speech signal, using spectral subtraction wherein the noise spectrum is updated using quantile based estimation and the quantile values are approximated using dynamic quantile tracking.
Sensorineural loss is caused by degeneration of the sensory hair cells of the inner ear or the auditory nerve. Persons with such loss experience severe difficulty in speech perception in noisy environments. Suppression of wide-band non-stationary background noise as part of the signal processing in hearing aids and other speech communication devices can serve as a practical solution for improving speech quality and intelligibility for persons with sensorineural or mixed hearing loss. Many signal processing techniques developed for improving speech perception require noise-free speech signal as the input and these techniques can benefit from noise suppression as a pre-processing stage. Noise suppression can also be used for improving the performance of speech codecs, speech recognition systems, and speaker recognition systems under noisy conditions.
For implementing the noise suppression on a low-power processor in a hearing aid or a communication device, the technique should have low algorithmic delay and low computational complexity. Spectral subtraction (M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” Proc. IEEE ICASSP 1979, pp. 208-211; S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2, pp. 113-120, 1979) can be used as a single-input speech enhancement technique for this application. A large number of variations of the basic technique have been developed for use in audio codecs and speech recognition (P. C. Loizou, “Speech Enhancement: Theory and Practice,” CRC Press, 2007). The processing steps are segmentation and spectral analysis, estimation of the noise spectrum, calculation of the enhanced magnitude spectrum, and re-synthesis of the speech signal. Due to non-stationary nature of the interfering noise, its spectrum needs to be dynamically estimated. Under-estimation of the noise results in residual noise and over-estimation results in distortion leading to degraded quality and reduced intelligibility. Noise can be estimated during the silence intervals identified by a voice activity detector, but the detection may not be satisfactory under low SNR conditions and the method may not correctly track the noise spectrum during long speech segments.
Several techniques based on minimum statistics for estimating the noise spectrum, without voice activity detection, have been reported (R. Martin, “Spectral subtraction based on minimum statistics,” Proc. EUSIPCO 1994, pp. 1182-1185; I. Cohen, “Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging,” IEEE Trans. Speech Audio Process., vol. 11, no. 5, pp. 466-475, 2003; G. Doblinger, “Computationally efficient speech enhancement by spectral minima tracking in subbands,” Proc. EUROSPEECH 1995, pp. 1513-1516). These techniques involve tracking the noise (as minima of the magnitude spectra of the past frames and are suitable for real-time operation. However, they often underestimate the noise and need estimation of an SNR-dependent subtraction factor. In the absence of significant silence segments, processing may remove some parts of the speech signal during the weaker speech segments. Stahl et al. (V. Stahl, A. Fisher, and R. Bipus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” Proc. IEEE ICASSP 2000, pp. 1875-1878) reported that a quantile-based estimation of the noise spectrum from the spectrum of the noisy speech can be used for spectral subtraction based noise suppression. It is based on the observation that the signal energy in a particular frequency bin is low in most of the frames and high only in 10-20% frames corresponding to voiced speech segments. For improving word accuracy in a speech recognition task, a time-frequency quantile based noise estimation was reported by Evans and Mason (N. W. Evans and J. S. Mason, “Time-frequency quantile-based noise estimation,” Proc. EUSIPCO 2002, pp. 539-542). These quantile-based noise estimation techniques use quantiles obtained by ordering the spectral samples or from dynamically generated histograms. Due to large memory space required for storing the spectral samples and high computational complexity, they are not suited for use in hearing aids and communication devices. Use of median, i.e. 0.5-quantile, considerably reduces the computation requirement, but still does not permit real-time implementation. Waddi et at. (S. K. Waddi, P. C. Pandey, and N. Tiwari, “Speech enhancement using spectral subtraction and cascaded-median based noise estimation for hearing impaired listeners,” Proc. NCC 2013, paper no. 1569696063) used a cascaded-median as an approximation to median for real-time implementation of speech enhancement. The improvements in speech quality were found to be different for different types of noises, indicating the need for using frequency-bin dependent quantiles for suppression of non-white and non-stationary noises.
Kazama et al. (M. Kazama, M. Tohyama, and T. Hirai, “Current noise spectrum estimation method and apparatus with correlation between previous noise and current noise signal,” U.S. Pat. No. 7,596,495 B2, 2009) have disclosed a method for updating the noise spectrum based on the correlation between the envelope of previously estimated noise spectrum and the envelope of the current spectrum of the input. It has high computational complexity due to the need for calculating the spectral envelopes and the correlation. As all the spectral samples of the noise are updated using a single mixing ratio, the method may not be effective in suppressing non-stationary non-white noises.
In a noise suppression method disclosed by Schmidt et al. (G. U. Schmidt, T. Wolff, and M. Buck, “System for speech signal enhancement in a noisy environment through corrective adjustment of spectral noise power density estimations,” U.S. Pat. No. 8,364,479 B2, 2013), the noise spectrum is estimated using moving average and minimum statistics and a frequency-dependent correction factor is obtained using the variance of relative spectral noise power density estimation error, estimated noise spectrum, and the input spectrum. The relative spectral noise power density estimation error is calculated during non-speech frames whose identification requires a voice activity detector and minimum statistics based noise estimation requires an SNR-dependent subtraction factor, leading to increased computational complexity.
In a method for estimating noise spectrum using quantile-based noise estimation, disclosed by Jabloun (F. Jabloun “Quantile based noise estimation,” UK patent No. GB 2426167 A, 2006), spectra of a fixed number of past input frames are stored in a buffer and sorted using a fast sorting algorithm for obtaining the specified quantile value for each spectral sample. A recursive smoothening is applied on the quantile-estimated noise spectrum, using smoothening parameter calculated from the estimated frequency-dependent SNR. Although the method does not need a voice activity detector, it requires a large memory for buffering the spectra. For reducing the high computational complexity due to sorting operations, the quantile computations are restricted to a small number of frequency samples and the noise spectrum is obtained using interpolation, restricting the effectiveness of the method in case of non-stationary non-white noise.
Nakajima et al. (H. Nakajima, K. Nakadai, and Y. Hasegawa, “Noise power estimation system, noise power estimating method, speech recognition system and speech recognizing method,” U.S. Pat. No. 8,666,737 B2, 2014) have described a method for estimating the noise spectrum using a cumulative histogram for each spectral sample which is updated at each analysis window using a time decay parameter. Although the method does not require large memory for buffering the spectra, it has high computational complexity and the estimated quantile values can have large errors in case of non-stationary noise.
Thus for noise signal suppression in speech signals in hearing aids and speech communication devices, there is a need to mitigate the disadvantages associated with the methods and systems described above. Particularly, there is a need for noise signal suppression without involving voice activity detection and without needing large memory and high computational complexity.
The present disclosure describes a method and a system for speech enhancement in speech communication devices and more specifically in hearing aids for suppressing stationary and non-stationary background noise in the input speech signal. The method uses spectral subtraction wherein the noise spectrum is updated using quantile-based estimation without voice activity detection and the quantile values are approximated using dynamic quantile tracking without involving large storage and sorting of past spectral samples. The technique permits use of a different quantile at each frequency bin for noise estimation without introducing processing overheads. The preferred embodiment uses analysis-synthesis based on Fast Fourier transform (FFT) and it can be integrated with other FFT-based signal processing techniques like dynamic range compression, spectral shaping, and signal enhancement used in the hearing aids and speech communication devices. A noise suppression system based on this method and using hardware with an audio codec and a digital signal processor (DSP) chip with on-chip FFT hardware is also disclosed.
The present disclosure discloses a method for noise suppression using spectral subtraction wherein the noise spectrum is dynamically estimated without voice activity detection and without storage and sorting of past spectral samples. It also discloses a system using this method for speech enhancement in hearing aids and speech communication devices, for improving speech quality and intelligibility. The disclosed method is suited for implementation using low power processors and the signal delay is small enough to be acceptable for audio-visual speech perception.
In the short-time spectrum of speech signal mixed with background noise, the signal energy in a frequency bin is low in most of the frames and high only in 10-20% frames corresponding to voiced speech segments. Therefore, the spectral samples of the noise spectrum are updated using quantile-based estimation without using voice activity detection. A technique for dynamic quantile tracking is used for approximating the quantile values without involving storage and sorting of past spectral samples. The technique permits use of a different quantile at each frequency bin for noise estimation without introducing processing overheads.
The processing involves noise suppression by spectral subtraction, using analysis-modification-synthesis and comprising the steps of short-time spectral analysis, estimation of the noise spectrum, calculation of the enhanced magnitude spectrum, and re-synthesis of the output signal. The preferred embodiment uses FFT-based analysis-modification-synthesis along with overlapping analysis windows or frames.
The digitized input signal x(n) (151) is applied to the input windowing block (101) which outputs overlapping windowed segments (152). These segments serve as the input analysis frames for the FFT block (102) which calculates the complex spectrum Xn(k) (153), with k referring to frequency sample index. The magnitude spectrum calculation block (103) calculates the magnitude spectrum |Xn(k)| (154). The noise estimation block (104) uses magnitude spectrum |Xn(k)| (154) to estimate noise spectrum Dn(k) (155) using dynamic quantile tracking. The enhanced magnitude spectrum calculation block (105) uses the magnitude spectrum |Xn(k)| (154) and the estimated noise spectrum Dn(k) (155) as the inputs and calculates the enhanced magnitude spectrum |Yn(k)| (156). In this block (105), the estimated noise spectrum Dn(k) (155) is smoothened by applying an averaging filter along the frequency axis. The smoothened noise spectrum Dn′(k) is calculated using a (2b+1)—sample filter, realized recursively for computational efficiency, as the following:
D
n′(k)=Dn′(k−1)+[Dn(k+b)−Dn(k−b−1)]/(2b+1) (1)
The smoothened noise spectrum Dn′(k) is used for calculating the enhanced magnitude spectrum |Yn(k)| (156) using the generalized spectral subtraction as the following:
The exponent factor γ may be selected as 2 for power subtraction or as 1 for magnitude subtraction. Choosing subtraction factor α>1 helps in reducing the broadband peaks in the residual noise, but it may result in deep valleys, causing warbling or musical noise which is masked by a floor noise controlled by the spectral floor factor β.
The enhanced complex spectrum calculation block (106) uses the complex spectrum Xn(k) (153), magnitude spectrum |Xn(k)| (154), and enhanced magnitude spectrum |Yn(k)| (156) as the inputs and calculates the enhanced complex spectrum Yn(k) (157). In spectral subtraction for noise suppression, the output complex spectrum is obtained by associating the enhanced magnitude spectrum with the phase spectrum of the input signal. To avoid phase computation, the enhanced complex spectrum calculation block (106) calculates the enhanced complex spectrum Yn(k) (157) as the following:
Y
n(k)=|Yn(k)|Xn(k)/|Xn(k)| (3)
The IFFT block (107) takes Yn(k) (157) as the input and calculates time-domain enhanced signal (158) which is windowed by the output windowing block (108) and the resulting windowed segments (159) are applied as input to the overlap-add block (109) for re-synthesis of the output signal y(n) (160).
In signal processing using short-time spectral analysis-modification-synthesis, the input analysis window is selected with the considerations of resolution and spectral leakage. Spectral subtraction involves association of the modified magnitude spectrum with the phase spectrum of the input signal to obtain the complex spectrum of the output signal. This non-linear operation results in discontinuities in the signal segments corresponding to the modified complex spectra of the consecutive frames. Overlap-add in the synthesis along with overlapping analysis windows is used for masking these discontinuities. A smooth output window function in the synthesis can be applied for further masking these discontinuities. The input analysis window iv′ (n) and the output synthesis window w2(n) should be such that the sum of w1(n)w2(n) for all the overlapped samples is unity, i.e.:
where S is the number of samples for the shift between successive analysis windows. To limit the error due to spectral leakage, a smooth symmetric window function, such as Hamming window, Hanning window, or triangular window, is used as w1(n) and rectangular window is used as w2(n). The requirement as given in Equation-4 is met by using 50% overlap in the window positions, i.e. window shift S=L/2 for window length of L samples. Alternatively, a rectangular window as w1(n) and a smooth window as w2(n) with 50% overlap are used for masking the discontinuities in the output. In order to limit the error due to spectral leakage and to mask the discontinuities in the consecutive output frames, processing is carried out using a modified Hamming window as the following:
w
1(n)=w2(n)=[1/√{square root over (()}4d2+2e2)][d+e cos(2π(n+0.5)/L)] (5)
with d=0.54 and e=0.46. The requirement as given in Equation-4 is met by using 75% overlap in window positioning, i.e. S=L/4. FFT size N is selected to be larger than the window length L and the analysis frame as input for FFT calculation is obtained by padding the windowed segment with N−L zero-valued samples.
The noise spectrum estimation block (104) in
Let the kth spectral sample of the noise spectrum Dn(k) be estimated as the p(k)-quantile of the magnitude spectrum |Xn(k)|. It is tracked dynamically as
D
n(k)=Dn-S(k)+dn(k) (6)
where S is the number of samples for the shift between successive analysis frames and the change dn(k) is given as
The values of Δ+(k) and Δ−(k) should be such that the quantile estimate approaches the sample quantile and sum of the changes in the estimate approaches zero, i.e. Σdn(k)≈0. For a stationary input and number of frames M being sufficiently large, dn(k) is expected to be −Δ−(k) for p(k)M frames and Δ+(k) for (1−p(k))M frames. Therefore,
(1−p(k))MΔ+(k)−p(k)MΔ−(k)≈0 (8)
Thus the ratio of the increment to the decrement should satisfy the following condition:
Δ+(k)/Δ−(k)=p(k)/(1−p(k)) (9)
and therefore Δ+(k) and Δ−(k) may be selected as
Δ+(k)=λp(k)R (10)
Δ−(k)=λ(1−p(k))R (11)
where R is the range (difference between the maximum and minimum values of the sequence of spectral values in a frequency bin) and λ is a factor which controls the step size during tracking. As the sample quantile may be overestimated by λ+(k) or underestimated by λ−(k), the ripple in the estimated value is given as
During tracking, the number of steps needed for the estimated value to change from initial value Di(k) to final value Df(k) is given as
Since (|Df(k)−Di(k)|)max=R, the maximum number of steps is given as
The factor λ can be considered as the convergence factor and its value is selected for an appropriate tradeoff between δ and smax. It may be noted that the convergence becomes slow for very low or high values of p(k).
The range is estimated using dynamic peak and valley detectors. The peak Pn(k) and the valley Vn(k) are updated, using the following first-order recursive relations:
The constants τ and σ are selected in the range [0, 1] to control the rise and fall times of the detection. As the peak and valley samples may occur after long intervals, τ should be small to provide fast detector responses to an increase in the range and σ should be relatively large to avoid ripples.
The range is tracked as:
R
n(k)=Pn(k)−Vn(k) (17)
The dynamic quantile tracking for estimating the noise spectrum can be written as the following:
A noise suppression system using the above disclosed method is implemented using hardware consisting of an audio codec and a low-power digital signal processor (DSP) chip for real-time processing of the input signal for use in aids for the hearing impaired and also in other speech communication devices.
To examine the effect of the processing parameters, the technique was implemented for offline processing using Matlab. Implementation was carried out using magnitude subtraction (exponent factor γ=1) as it showed higher tolerance to variation in the values of α and β. Processing was carried out with sampling frequency of 10 kHz and window length of 25.6 ms (i.e. L=256 samples) with 75% overlap (i.e. S=64 samples). As the processed outputs with FFT length N=512 and higher were indistinguishable, N=512 was used. The processing with τ=0.1 and α=(0.9)1/1024, corresponding to rise time of one frame shift and a fall time of 1024 frame shift, was found to be the most appropriate combination for different types of noises and SNRs. Processing with these empirically obtained values and without spectral smoothening of the estimated noise spectrum was used for evaluation with informal listening and for objective evaluation with Perceptual Evaluation of Speech Quality (PESQ) measure. The PESQ score (scale: 0-4.5) is calculated from the difference between the loudness spectra of level-equalized and time aligned noise-free reference and test signals (ITU, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” ITU-T Rec., P. 862, 2001). The speech material consisted of a recording with three isolated vowels, a Hindi sentence, and an English sentence (-/a/-/i/-/u/“aayiye aap kaa naam kyaa hai”—“where were you a year ago”) from a male speaker. A longer test sequence was generated by speech-speech-silence-speech concatenation of the recording for informal listening test. Testing involved processing of speech with additive white, street, babble, car, and train noises at SNR of 15, 12, 9, 6, 3, 0, −3, −6, −9, and −12 dB.
To find the most suitable quantile for noise estimation and number of frames over which this quantile should be estimated, the offline processing was carried out using sample quantile. Processing significantly enhanced the speech for all noises and there was no audible roughness. For objective evaluation of the processed outputs, PESQ scores were calculated for the processed output with β=0, α in the range of 0.4 to 6, and with quantile p=0.1, 0.25, 0.5, 0.75, and 0.9. The quantile values were obtained using previous M frames, where M=32, 64, 128, 256, and 512. For fixed values of SNR, α, and p, the highest PESQ scores were obtained for M=128. Lower values of M resulted in attenuation of speech signal and larger values were unable to track non-stationary noise. The investigations were repeated using dynamic quantile tracking. The PESQ scores of the processed output with convergence factor λ=1/256 were found to be nearly equal to the PESQ scores obtained using sample quantile with M=128. It was further observed that noise estimation with p=0.25 resulted in nearly the best scores for different types of noises at all SNRs.
For real-time processing, the system schematically shown in
The preferred embodiment of the noise suppression system has been described with reference to its application in hearing aids and speech communication devices wherein the input and output signals are in analog form and the processing is carried out using a processor interfaced to an audio codec consisting of ADC and DAC with a single digital interface between the audio codec and the processor. It can be also realized using separate ADC and DAC chips interfaced to the processor or using a processor with on-chip ADC and DAC hardware. The system can also be used for noise suppression in speech communication devices with the digitized audio signals available in the form of digital samples at regular intervals or in the form of data packets by implementing the processing block (306) of
The disclosed processing method and the preferred embodiment of the disclosed processing system use FFT-based analysis-synthesis. Therefore the processing can be integrated with other FFT-based signal processing techniques like dynamic range compression, spectral shaping, and signal enhancement for use in the hearing aids and speech communication devices. Noise suppression can also be implemented using other signal analysis-synthesis methods like the ones based on discrete cosine transform (DCT) and discrete wavelet transform (DWT). These methods can also be implemented for real-time processing with the use of the disclosed method of approximation of quantile values by dynamic quantile tracking for noise estimation.
The above description along with the accompanying drawings is intended to describe the preferred embodiments of the invention in sufficient detail to enable those skilled in the art to practice the invention. The above description is intended to be illustrative and should not be interpreted as limiting the scope of the invention. Those skilled in the art to which the invention relates will appreciate that many variations of the described example implementations and other implementations exist within the scope of the claimed invention.
Number | Date | Country | Kind |
---|---|---|---|
640/MUM/2015 | Feb 2015 | IN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IN2015/000183 | 4/24/2015 | WO | 00 |