This invention relates to noise-resistant utterance detector and more particularly to data processing for such a detector.
Typical speech recognizers require at the input thereof an utterance detector 11 to indicate where to start and to stop the recognition of the incoming speech stream. See
In applications such as hands-free speech recognition in a car driven on a highway, the signal-to-noise ratio is typically around 0 dB. That means that the energy of the noise is about the same as that of the signal. Obviously, while speech energy gives good results for clean to moderately noisy speech, it is not adequate for reliable detection under such a noisy condition.
In accordance with one embodiment of the present invention a solution for performing endpoint detection of speech signals in the presence of background noise includes noise adaptive spectral extraction.
In accordance with another embodiment of the present invention a solution for performing endpoint detection of speech signals in the presence of background noise includes noise adaptive spectral extraction and inverse filtering.
In accordance with another preferred embodiment of the present invention a solution for performing endpoint detection of speech signals in the presence of background noise includes noise adaptive spectral extraction and inverse filtering and spectral reshaping.
Frame-Level Speech Detection
Speech/non-speech Decision Parameter
In speech utterance detection, two components are identified. The first component 11 makes a speech/non-speech decision for each incoming speech frame as illustrated in
A preferred embodiment of the present invention provides speech utterance detection by noise-adaptive spectrum extraction (NASE)15, frequency-domain inverse filtering 17, and spectrum reshaping 19 before autocorrelation 21 as illustrated by the block diagram of
Autocorrelation Function
For resistance to noise, the periodicity, rather than energy, of the speech signal is used. Specifically, an autocorrelation function is used. The autocorrelation function used is derived from speech X(t), and is defined as:
Rx(τ)=E[X(t)X(t+τ)] (1)
where X(t) is the observed speech signal at time t.
Important properties of Rx(τ) include:
Therefore, we have, for large τ:
RX(τ)≈RS(τ) (5)
This property says that autocorrelation function has some noise immunity.
Search for Periodicity
As speech signal typically contains periodical waveform, periodicity can be used as an indication of presence of speech. The periodicity measurement is defined as:
Tl and Th are pre-specified so that the period found would range from 75 HZ to 400 Hz. A larger value of p indicates a high energy level at the time index where p is found. According to the present invention it is decided that the signal is speech if ρ is larger than a threshold. The threshold is set to be larger than typical values of Rx(t) for non-speech frames.
Noise-adaptive Spectrum Extraction (NASE)
Outline
Applicants teach to use ρ as the parameter for speech/non-speech decision in an utterance detector. For adequate performance, the input to the autocorrelation function, X(t), must be enhanced. Such enhancement can be achieved in the power-spectral representation of X(t), using the proposed noise-adaptive pre-processing.
The input is the power spectrum of noisy speech (pds_signal[]) and the output is the power spectrum of clean speech in the same memory space. The following steps illustrated in
Sequence A consists of initialization stage. Sequence B consists of the main processing block to be applied to every frame of the input signal.
For sequence A, noise-adaptive processing initialization:
For sequence B, noise adaptive processing main section:
For i=0 freq_nbr do
end
if frm_count=10, THEN γ= γMIN fi.
Spectral Inverse Filtering
Outline
The production of speech sounds by humans is dictated by the source/vocal tract principle. The speech signal s(n) is thought to be produced by the source signal u(n) (larynx through the vocal cords) modulated by the vocal tract filter h(n) which resonates at some characteristic formant frequencies. In other words, the speech spectrum S(ω) is the result of the multiplication (convolution in the time domain) of the excitation spectrum U(ω) by the vocal tract transfer function H(ω)
S(ω)=U(ω)×H(ω) (7)
For many speech applications, it is important to apply the inverse vocal tract filtering operation to perform analysis on the excitation signal u(n).
Since equation 6 focuses on the periodicity in the range of the excitation signal only and not on the periodicity induced by the formant frequency, inverse filtering the speech signal to restitute a good approximation of the unmodulated speech signal improves the endpoint detection performance.
Detailed Description
Typically, the vocal tract filter is estimated using linear prediction techniques. The coefficients αk of the auto-regressive prediction filter
are computed by minimizing the mean-square error of the prediction error.
In the present application, instead of basing the inverse filtering operation on the often used Linear Prediction (LP) filter, applicants teach to perform inverse filtering operation based on normalized approximation of the envelope of the short term speech spectrum derived from the local maxima of the short term speech spectrum. The advantage is that applicants avoid computation of LP coefficients and its corresponding spectrum. Selecting local maxima in the short term spectrum is an extremely simple task, especially considering the low resolution of the short term spectrum (128 frequency points). Note that since we never operate in the time-domain to find an estimate of the vocal tract filter, the inverse filtering in itself is performed in the log frequency domain (dB) and is implemented by simply removing (subtracting) from the original spectrum the estimated inverse filtering spectrum.
Determination of the inverse filter by use of the spectrum maxima and the inverse filtering operation is performed by the steps in
The spectral reshaping technique allows for the inverse filtering technique based on the envelope of the maxima to operate properly even when the first two formants in the speech signal are close together, such as in the /u/ or /ow/ sound. Indeed, in this case the formants being so close, there is no valley in the spectrum being determined between the maxima of the formant frequencies and the envelope spectrum resembles a large dome in the low frequency domain. The consequence of this is that the entire low-frequency spectrum is exceedingly inverse filtered and it is difficult to notice the voicing of the excitation in the resulting spectrum. The solution is to implement a detector at the input in the spectrum re-shaper 19 (see
Detailed Description
First, the short term speech spectrum of the speech frame is normalized, with a mean equal to zero dB. Then, a battery of tests is performed to detect the presence of two close low-frequency formants. If we determine the following parameters,
In applicant's preferred embodiment, the values of the parameters are set to be τ1=3.25 dB, τ2=3.00 dB, τ=1.25 dB, λmin=12, λmax=20, δmin=8 and δmax=16.
Validation Experiments
Illustration of Functioning
Noise-adaptive Spectrum Extraction (NASE)
To illustrate the effectiveness of the noise-adaptive processing, the utterance “695-6250” was processed and the result is plotted in
Spectral Inverse Filtering
To illustrate the effectiveness of the inverse filtering technique, the utterance “Taylor Dean” was processed and the normalized autocorrelation results are plotted in dB in
Spectral reshaping only manifests itself in frames for which the detector signaled the presence of two close low frequency formants and while a visual inspection might not immediately show the advantage of spectral reshaping. Results presented in the following paragraph and Table 1 illustrates the additional gain that can be obtained by using the technique.
Utterance Detection Assessment
To evaluate the performance improvement due to the three methods, a speech database was collected in automobile environments. The signal was recorded using a hands-free microphone mounted on the visor. Five vehicles were used for recording, representing several automobile categories.
Table 1 summarizes the test results. On average the first method reduces the detection errors by about an order of magnitude. The other two methods further reduce the remaining error by more than 50 percent.
The amount of additional reduction in the detection errors offered by the inverse filtering technique over the noise adaptive spectral extraction clearly illustrates the complementary of both techniques. While NASE helps minimizing the autocorrelation of the background noise by removing it, it does not help finding the voicing information within the speech signal. The inverse filtering technique, however, is able to extract the periodic voicing information from the speech signal, while it is insufficient to remove autocorrelation created by the background noise. In terms of noise characteristics, it can be stated the NASE will operate efficiently on slowly time-varying noises with broad spectra (almost white), while inverse filtering is able to remove noises with sharp spectral characteristics (almost tones).
It should be pointed out that the remaining 1 percent of detection error can often be attributed to an external cause over which the endpoint detector has little control, such as paper friction or speaker aspiration.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made without departing from the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
3603738 | Focht | Sep 1971 | A |
5822732 | Tasaki | Oct 1998 | A |
6098036 | Zinser et al. | Aug 2000 | A |
6175634 | Graumann | Jan 2001 | B1 |
6473735 | Wu et al. | Oct 2002 | B1 |
6980950 | Gong et al. | Dec 2005 | B1 |
20010001853 | Mauro et al. | May 2001 | A1 |
Number | Date | Country | |
---|---|---|---|
20050049863 A1 | Mar 2005 | US |