Not Applicable
1. Field of the Invention
The present invention relates to the speech enhancement methods and systems used to improve speech quality and the performance of Automatic Speech Recognizers (ASR) in noisy environments. It removes the unwanted noise from the near end user speech. It also emphasizes the formants of the user speech and simultaneously extracts clean speech acoustic features for the ASR to improve its recognition rate.
2. Background of the Invention
In the everyday living environments, noise is everywhere. It not only affects speech quality in mobile communications and Voice Over IP (VOIP) applications, but also severely decreases the accuracy of the Automatic Speech Recognition.
One particular example is related to the digital living room environment. The connected devices such as smart TVs or smart appliances are being widely adopted by increasing numbers of consumers. In doing so, the digital living room is evolving into the new digital hub, where Voice Over Internet Protocol communications, social gaming and voice interactions over Smart TVs become central activities. In these situations, the microphones are usually placed near the TV or conveniently integrated into the Smart TV itself. The users normally sit at a comfortable viewing distance in front of the TV. The microphones not only receive the users speech, but also pick up the unwanted sound from the TV speakers and room reverberations. Due to the close proximity of the microphone(s) to the TV loudspeakers, the users speech could be overpowered by the unwanted audio generated by the TV speakers. Inevitably this affects the speech quality in VOIP applications. In Talk Over Media (TOM) situations, when users prefer to use their voice to control and search media content while watching TV at the same time, their speech commands, coupled with the high level of unwanted TV sound would render Automatic Speech Recognition nearly impossible.
Speech enhancement has been an crucial technology to improve speech clarity and intelligibility in noisy environments. Microphone array beamformers have been used to focus and enhance the speech from the direction of the talker. It basically acts as a spatial filter. Acoustic Echo Cancellation (AEC) is another technique to filter out unwanted far end echo. If the signal produced by the TV speaker(s) is known, it can be treated as a far end reference signal. But there are several problems with the prior art speech enhancement techniques. Firstly, the prior art techniques are mainly designed for near field applications where the microphones are placed close to the talker such as in mobile phones and Bluetooth headsets. In near field applications, the Signal to Noise Ratio (SNR) is high enough for speech enhancement techniques to be effective in suppressing and removing the interfering noise and echo. However, in far field applications, the microphones could be 10 to 20 feet away from the talker. The SNR in the microphone signal, located at this distance is very low, and the traditional techniques normally would not perform very well. The results produced by the traditional methods either have large amounts of noise and echo remaining or introduce high levels of distortion to the speech signal which severely decreases its intelligibility. Secondly, the prior art techniques fail to distinguish the VOIP applications from the ASR applications. The processing outputs which is intelligible to a human may not be recognized well by an ASR. Thirdly, the prior art techniques of speech enhancement are not power efficient. In the prior art techniques, adaptive filters are used to cancel the acoustic coupling between loudspeakers and microphones. However, large number of filter taps are required to reduce the reverberant echo. The adaptive filters used in prior arts are slow to adapt to the optimum solution, and further more require significant processing power and memory space.
The current invention intends to overcome or alleviates all or part of the shortcomings in the prior art techniques.
Accordingly, the present invention provides a system and method to enhance speech intelligibility and improve the detection rate of automatic speech recognizer in noisy environments. The present invention reduces an acoustically coupled loudspeaker signal from a plurality of microphone signals to enhance a near end user speech signal. The early reflections of the loudspeaker signal(s) is first removed by an estimation filtering unit. This estimated early reflections signal is transformed into an estimated late reflections signal which statistically closely resembles the remaining noise components within the estimation filtering unit output. A speech probability measure is also derived to indicate the amount of the near end user speech within the estimation filtering unit output. A noise reduction unit uses the estimated late reflections signal as a noise reference to remove the remaining loudspeaker signal. A decision unit, checks a system configuration parameter to determine if the cleaned speech is intended for human communication and/or Automatic Speech Recognition. The low frequency bands of the cleaned speech signal is reconstructed to enhance its naturalness and intelligibility for communication applications. In case that the ASR is enabled, the peaks and the valleys of lower formants of the cleaned speech are emphasized by a formant emphasis filter to improve the ASR recognition rate. A set of acoustic features and processing profiles are also generated for the ASR engine. The present invention can also apply to devices which has a foreground microphone(s) and a background microphone(s).
Embodiments of the present invention not only improve the speech intelligibility, but also simultaneously provide suitable features to improve the recognition rate of the ASR.
(TOM) application to which the present invention may be applied. New Smart TV services integrate traditional cable TV offerings with other internet functionality which were previously offered through a computer. Users can browse the internet, watch streaming videos and make VOIP calls on their big screen TV. The large display format and high definition of the TV makes it ideal for playing internet gaming or performing video chat. Smart TVs will function as the infotainment hub for the future digital living room environment. However, complicated user menu system make the TV remote an inadequate control device. Voice control is more natural, convenient, efficient and is highly desirable. In the case where the microphone(s) are integrated into or placed near the TV set, VOIP call quality can be adversely affected due to the large separation distance between the user and the microphone(s). The distance can greatly decrease the SNR levels for the received speech which can render the ASR ineffective. This problem is even more acute when the media audio is simultaneously playing through the loudspeakers. As depicted in
The speaker signal from the TV is normally in stereo format. There are high degree of correlation between the left channel and the right channel. This inter channel correlation will increase the difficulty for the estimation filter to converge to the true optimum solution. In
The method in the present invention can be implemented in time domain or frequency domain. Signal processing in the frequency domain is generally more efficient than processing in the time domain. In case of a frequency domain implementation, the microphone signal and the speaker signal are transformed into frequency coefficients or frequency bands as depicted by block 305 and 306. Filter banks such as Quadrature Mirror Filter (QMF) and Modified Discrete Cosine Transform (MDCT) can be used to implement the time domain to frequency domain transformation. In one embodiment, time domain to frequency domain transformation is done using a short time Fast Fourier Transform (FFT). First, the signal in the time domain are segmented into overlapping frames. The overlapping ratio may be 0.5. A sliding analysis window is applied to each overlapping frame. The sliding analysis window may be a Hamming window, a Hanning window or a Cosine window. Other windows are also possible. Each windowed overlapping frame is transformed into the frequency domain by an FFT operation. The output of the FFT can further be transformed into a suitable human psycho-acoustical scale such as Bark scale or Mel scale. A logarithmic operation may be further applied to the magnitude of the transformed frequency bands.
An estimation filtering unit 307 is used to estimate and remove the early reflections of the speaker signal. In one embodiment, the estimation filter can be implemented as a FIR filter with fixed filter coefficients. The fixed filter coefficients may be derived from the measurements of the room. In another embodiment, an adaptive filter can be used to estimate the early reflections of the speaker signal. A detailed embodiment of an adaptive estimation filtering unit can be found in
The estimation filtering unit removes the early reflections of the speaker signal. The output of the estimation filtering unit consists of the user speech signal with a certain amount of residual noise, which is largely caused by the late reflection of the speaker signal. The noise transformation unit uses the estimated early reflections of the speaker signal from the estimation filtering unit to derive a representation of the late reflections of the speaker signal. The goal is to generate a noise reference that is statistically similar to the noise component which remains in the output of the estimation filtering unit. The noise transform unit also generates a plurality of speech probability measure Pspeech(t, m) to indicate the amount of near end user speech signal present in the estimated early reflections signal, where t represents the t-th frame and m represents the m-th frequency band. A detailed embodiment of a noise transformation unit is represented in
Noise reduction unit 311 is used to further reduce late reflection components from the speech bands. An exemplary embodiment can be found in
A configuration decision unit 312 is used to control the processing into two branches according to a system configuration parameter. In one embodiment, only one of the two branches is processed. In another embodiment, both branches are processed. One processing branch 314 is aimed to improve speech quality for human listener. The other processing branch 313 focuses on improving the recognition rate of the ASR. In order to adequately suppress noise, the noise reduction unit 311 may remove a significant amount of low frequency content from the speech signal. Thus, the speech signal sounds thin and unnatural when the bass components are lost. In the speech enhancement branch 314 for human listeners, spectrum content analysis is performed and lower frequency bands can be reconstructed 320. In one embodiment, Blind Bandwidth Extension is used to reconstruct the bass part of the speech spectrum. In another embodiment, the Pspeech(t, m) generated by the noise transformation unit 308 is compared to a threshold to generate a binary decision. An exemplary value for the threshold may be 0.5. The binary decision is used to determine whether to reconstruct the t-th frame and the m-th frequency band. In yet another embodiment, the reconstructed low frequency bands after Blind Bandwidth Extension are multiplied with the corresponding Pspeech(t, m) to generate a new set of reconstructed speech bands. This new set of reconstructed speech bands are transformed back to time domain to be transmitted to the VOIP channels. In one exemplary embodiment, the transformation from frequency domain to time domain can be implemented using Inverse Fast Fourier Transform (IFFT). In other embodiments, filter banks reconstruction techniques can be utilized.
In the processing branch for ASR 313, a formant emphasis filter 315 is used to emphasize the spectrum peak of the cleaned speech while maintaining the spectrum integrity of the signal. It can improve the Word Error Rate (WER) and confidence score of the ASR engine. One embodiment of the emphasis filter is illustrated in the
When the near end user speech signal is absent from the microphone signal, the signal E(t, m) contains mostly the late reflections of the signal Y(t, m); the signal E(t, m) is highly correlated to Y(t, m); the signal Yest(t, m) approaches to the true estimate of the early reflections of Y(t, m). Alternatively, when the near end user speech is present in the microphone signal, E(t, m) contains the late reflections of Y(t, m) and the near end user speech; E(t, m) is less correlated to Y(t, m). Due to the nature of the adaptation processes used in the estimation filtering unit 307, Yest(t, m) contains the mix of the early reflections estimation and a small portion of near end user speech signal. A speech probability measure Pspeech(t, m) is used to indicate the amount of presence of near end user speech within Yest(t, m). Both Yest(t, m) and Pspeech(t, m) are used in block 509 to derive the estimated noise N(t, m). In one embodiment of the present invention, a set of measures are calculated in block 505. The measures Re(t), Rx(t), Ry(t) and Ryest(t) represent the spectrum energy of E, X, Y and Yest at a given time. Rex(t, m) is the cross correlation between E and X of the t-th frame and the m-th frequency band. Rey(t, m) is the cross correlation between E and Y of the t-th frame and the m-th frequency band. Block 506 calculates the ratio R(t,m). The value of R is proportional to the value of Re and inversely proportional to the Rey. The value of is also inversely proportional to the difference between Rx and Ryest. In one embodiment, R(t,m) is a multiplication of several terms, which can be expressed as follows,
R(t, m)=1/((Rey(t, m)/Ry(t))*(Rex(t, m)/Rx(t))*Ryest(t)/Re(t)))
In another embodiment, R(t, m) can be calculated recursively as,
R(t, m)=alpha—R*R(t−1, m)+(1-alpha—R)/((Rey/Ry)*(Rex/Ry)*(Ryest/(Rx-Ryest)))
where alpha_R is a smoothing constant, 0<alpha_R<1.
In yet another embodiment, R(t, m) is calculated using different equations depending on different values of Rx(t), Ry(t), Ryest(t) and different convergence states of the adaptive filter 403 . The Pspeech(t, m) can be obtained by smoothing R(t, m) across several time frames and across several adjacent frequency bands. In one embodiment, a moving average filter can be used to achieve the smoothing effects. In another embodiment, the measures Re, Rx, Ry, Ryest, Rex and Rey can be smoothed across time frames and frequency bands before calculating the ratio R(t, m).
In the block 509, The noise estimation N(t, m) may be obtained as a weighted sum of the Yest(t, m) and a function of prior Yest values, which can be expressed as:
N(t, m)=(1−Pspeech(t, m))*Yest(t, m)+F[ (1−Pspeech(t−i, j)*Yest(t−i, j)];
where i<t ; 1<j <max number of bands ; F[ ] is a function.
In one embodiment, F[ ] can be a weighted linear combination of the previous elements in Yest. Since the late reflections energy decays exponentially, the i term can be limited to the frames within the first 100 milliseconds of the current frame. In one embodiment, the weight used in the linear combination may be the same across all previous elements in Yest. In another embodiment, the weight used in the linear combination decrease exponentially, where the newer elements of Yest receives larger weight than the older elements. In another embodiment, N(t, m) may be derived recursively as follows,
A(1,m)=P(1, m)*Yest(1, m);
B(1, m)=P(1, m)*Yest(1, m)−Yest(0, m);
A(t−1, m)=beta1*P(t−1,m)*Yest(t−1, m)+(1−beta1)*(A(t−2, m)−B(t−2, m));
B(t−1, m)=beta2*(A(t−1, m)−A(t−2, m))+(1−beta2)*B(t−2, m);
N(t, m)=P(t, m)*Yest(t, m)+P(t−1, m)*C_decay*(A(t−1, m)+B(t−1,m));
where P(t, m)=1−Pspeech(t, m);
beta1 is a constant, beta1 is within the range of 0.0 to 1.0;
beta2 is a constant, beta2 is within the range of 0.0 to 1.0;
C_decay is a constant, C_decay is within the range of 0.0 to 1.0.
1) calculate a posteriori SNR post(t, m),
post(t,m)=power[E(t, m)]/Var—N(t, m)
2) calculate a priori SNR prior(t,m),
prior(t, m)=a*S(t−1, m)/Var—N(t−1, m)+(1−a)*P[post(t, m)−1]
3) calculate a ratio U(t, m);
U(t, m)=prior(t, m)*post(t, m)/(1+prior(t, m))
4) calculate a Minimum Mean Squared Error(MMSE) estimator gain Gm(t, m)
Gm(t, m)=(sqrt(PI)/2)*(sqrt(U(t, m)*post(t, m))*exp(−U(t,m)/2) *((1+U(t, m))*I0[U(t,m)/2)]+U(t, m)*I1[U(t,m)/2])
5) calculate the noise reduction gain G(t, m);
G(t,m)=Pspeech(t, m)*Gm(t, m)+(1−Pspeech(t, m)*Gmin
6) apply the noise reduction gain G(t, m) to E(t, m) to obtain the cleaned speech
S(t, m);
S(t, m)=G(t, m)* E(t, m);
In one embodiment, the Weiner filter gain is used in the 4-th step of the above procedure to derive the noise reduction gain. In another embodiment, Log-Spectral Amplitude (LSA) estimator is used in the 4-th step. In yet another embodiment, Optimal Modified LSA (OM-LSA) estimator is used in the 4-th step.
G_formant(t, m)=Kconst*Pspeech(t, m)/Pspeech_max(t);
where Kconst is a constant number and Kconst>1.0; Pspeech_max(t) is the max value of the t-th frame across different frequency bands. In one embodiment, the gain G_formant(t, m) is applied to part of the cepstral coefficients. The zero order and the first order of the cepstral coefficients are not gain adjusted to preserve the spectrum tilt. The cepstral coefficients beyond the 30th order are unaltered, as those coefficients do not significantly change the formant spectrum shape. The new cepstral coefficients are then transformed back to the frequency domain by the Inverse Discrete Cosine Transform (IDCT). The resulting new speech spectrum SE(t, m) has higher formant peaks and lower formant valleys, which can improve the ASR recognition rate.
The foregoing description of the embodiments of the invention had been presented for the purpose of illustration; it is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Persons skilled in the relevant art can appreciate that many modifications and variations are possible in light of the above teachings. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto.
This application claims the benefit of U.S. Provisional Application No. 61/674,361, filed Jul. 22, 2012, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61674361 | Jul 2012 | US |