Method and system for FFT-based companding for automatic speech recognition

Description

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A and 1B are block diagrams of a system and method for companding speech signals according to an embodiment of the invention;

FIG. 2 is a block diagram a compressor of FIG. 1A;

FIG. 3 are graphs of outputs at various stages of a channel for a mixture of three tones according to an embodiment of the invention; and

FIGS. 4A and 4B are narrow-band spectrograms of a speech signal before and after companding according to an embodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

FFT-Based Companding

The embodiments of our invention provide a method and system for fast Fourier transform (FFT) based companding of speech signals to be processed by an automated speech recognition (ASR) system. Our FFT-based companding method mimics two-tone suppression as described above. Performing the FFT greatly improves the processing efficiency of the companding system and method according to the embodiments of the invention, making the method and system practical for real-time ASR systems.

Companding

FIGS. 1A and 1B show a front end for an automated speech recognition (ASR) system 160. The front end includes a FFT block 110, multiple companding channels 105, and an optional adder 155.

Each channel 105 includes a broadband spatial filter 120 stage, an n power exponent compressor stage 130, a narrowband spatial filter stage 140, and a 1/n power exponent expander stage 150 connected serially.

Input to the system is a speech signal 101. In the preferred embodiment, the speech signal is corrupted with noise. For example, the speech signal is acquired in a moving vehicle.

The input signal can be sampled at 8 or 16 KHz into overlapping analysis frames. Each analysis frame can include data from 25 ms of the input signal 101, and temporally adjacent frames are overlapped by 15 ms.

ASR

An ASR system 160 includes Mel filters 161, a discrete cosine transform (DCT) and cepstral mean subtraction (CMS) block 162, followed by a hidden Markov model (HMM) speech recognizer 163. The output of the system is recognized speech 103. The output can be in the form of text, phonemes, or lattice based speech representations, such as word lattices and phoneme lattices.

Companding Channels

The broadband filter, compressor, narrow band filter, and expander are implemented as multiple, non-coupled, parallel channels. There is one channel for one of a narrow frequency band that spans the frequencies found in speech signals. For example, there are nine equally spaced frequency bands.

Each channel includes the four serially connected stages: the wide band F filter 120, the compressor 130, the narrow-band G filter 140, and the expander 150. The outputs from channels can be combined (summed) 155 to yield an output signal 102 with enhanced spectral peaks. Alternately, the outputs can be used without summation, and features can be determined directly from the channel outputs.

The output signal can be provided to the automatic speech recognition system (ASR) 160. The wide band filter and the narrowband filters in every channel 105 have the same resonant frequency. The resonant frequencies of the various channels are equally spaced and span a desired spectral range, for example, the spectra of speech signals. The broadband filter 110 determines a set of frequencies for the channel that affects a gain of the compressor.

As shown in FIG. 2, the compressor 130 includes an envelope detector (ED) 131, a nonlinearity block 132, and a multiplier 133. The output of the envelope detector x_1e, which we denote by AMP(X₁), represents the amplitude x₁of the output of the broadband filter. The nonlinearity raises the envelope of the signal to a power (n−1). As a result, the amplitude X₂, of the output of the multiplier, is approximately AMP(X₁)ⁿ. If n is lesser than 1, then this results in a compression of the output of the broadband filter.

The narrowband filter 140 selects only a narrower subset of the frequencies that are passed by the filter.

The expander 150 is similar to the compressor and also includes an envelope detector 151, a non-linearity block 152, and a multiplier 153. The output of the envelope detector X_3erepresents the amplitude of X₃, the output of the filter. The nonlinearity block raises the envelope of the signal to a power (1−n)/n. Consequently, the amplitude x₄, the output of the multiplier, is approximately AMP(X₃)^1/n. If n is less than 1, then this result in an expansion of the output of the narrowband filter.

Consider the case where the input to a channel X is a first signal (primary tone) αcos(ω₁t), at time t, with a resonant frequency ω₁for the channel. The broadband filter passes the unchanged, i.e. X₁=αcos(ω₁t), assuming a unit gain, zero phase filter, and X₂=αⁿcos(ω₁t).

The narrowband filter has a resonant frequency identical to the broadband filter. Therefore, the narrowband filter also passes the signal. Hence, an amplitude of the output of the narrowband filter is the same as an amplitude of the output of the compressor, i.e. X₃=αⁿcos(ω₁t).

An amplitude of the output of the channel X₄is

AMP(X₃)^1/n=α, i.e., X₄=α cos(ω₁t).

The channel has no effect on the overall level of an isolated tone at the resonant frequency.

Now, consider the case where the input to the channel is a sum of a first signal (primary tone) at the resonant frequency ω₁of the channel, and a second signal with a higher energy at an adjacent frequency ω₂, such that ω₂lies within the bandwidth of the broadband filter, but outside that of the narrowband filter, i.e.,

X=αcos(ω₁t)+kαcos(ω₂t),

where the amplitude of the second signal is k times that of the first tone.

If the broadband filter passes both signal without modification, then

X
₁∪αcos(ω₁t)+kαcos(ω₂t).

As an extreme case, we consider k>>1. The amplitude of X₁is approximately kα, and

X
₂
∪k
⁽ⁿ⁻¹⁾αⁿcos(ω₁t)+kⁿαⁿcos(ω₂t).

The narrow-band filter does not pass the second signal at the adjacent frequency ω₂, hence x₃=k⁽ⁿ⁻¹⁾αⁿcos(ω₁t).

The expander expands the signal by an amplitude of x₃, leading to

X
₄
=k
^(n−1)/nαcos(ω₁t),

i.e., the output of the channel is the first signal at the resonant frequency, scaled by a factor k^(n−1)/n. Because k>1 and n<1, k^(n-1)/n<1, i.e., the companding results in a suppression of the signal at the center frequency of the channel. The greater the energy of the second signal with the frequency ω₂, i.e., the larger the value of k is, and the greater the suppression of the signal at the center frequency.

More generally, the process results in the enhancement of spectral peaks at the expense of signal at adjacent frequencies. Any sufficiently intense frequencies outside the range of the narrowband filter, but within the range of the broadband filter, set a conservatively low gain in the compressor and are filtered out by the narrowband filter. In this case, the gain of the compressor is set by one set of frequencies, while the gain of the expander gain is set by another set of frequencies, such that the gain in the expander does no undo the effect of the compressor.

The net effect is that there is overall suppression of weak narrowband signal in a channel by strong out-of-band signal. Note that these out-of-band signals in one channel are dominant signals in a neighboring channel where the signals are resonant.

FIG. 3 shows the outputs at various stages of a channel for a mixture of three tones. Consequently, the output spectrum of the filter bank has a local ‘winner-take-all’ like characteristic. Effectively, strong spectral peaks in the input signal suppress or mask weaker neighboring signal, and signals with high signal-to-noise (SNR) ratios are emphasized over signals with low SNR ratios.

FFT-Based Companding

The prior art companding is suited for low-power analog circuit implementations. However, a straightforward digital implementation of the prior art companding is computationally intensive.

Therefore, we describe a computationally efficient digital implementation of the companding based on the fast Fourier transform (FFT).

FIG. 2 shows the details of processing the signals in a single channel in the frequency domain. A FFT of the input speech signal 101 over an analysis frame is represented by X. Herein, upper case letters always refers to signals in the frequency domain. In our representation X is a column vector with as many elements as the number of unique frequency bands in the frequency domain.

The Fourier spectrum of the filter response of the broadband filter in the i^thchannel is a vector F_i. The spectrum of the output signal X_iof the broadband filter is given by X_i,1=F_i{circle around (×)}X, where {circle around (×)} represents a element-wise Hadamard multiplication. Note that the i in X_i,1denotes the i^thspectral channel, while the 1 denotes that it corresponds to the signal X₁in the first channel.

The ED block extracts the RMS value of the input such that X_i,1e=|X_i,1|, where the |.| operator represents the Euclidean norm of a vector. We also assume that the output of the ED is constant over the duration of the analysis frame. However, output can change frame-to-frame. The output of the envelope detector, a scalar over the duration of the frame, is raised to the power n−1 and multiplied by X_i,1. The spectrum of the output of the multiplier is therefore given by

X
_i,2
=|X
_i,1|^n-1X_i,1.

The FFT of the impulse response of the narrowband filter in the i^thchannel is G_i. The spectrum of the output of the narrowband filter is given by

$\begin{matrix} X_{i, 3} = G_{i} \otimes X_{i, 2} \\ = {\langle X_{i, 1} \rangle}^{n - 1} G_{i} \otimes X_{i, 1} \\ = {\langle F_{i} \otimes X \rangle}^{n - 1} G_{i} \otimes F_{i} \otimes X . \end{matrix}$

We define a filter H_ithat is a the combination of the F_iand G_ifilters:

H
_i
=F
_i
{circle around (×)}G
_i
=G
_i
{circle around (×)}F
_i.

Therefore, we can write

X
_i,3
=|F
_i
{circle around (×)}X|
^n-1
H
_i
{circle around (×)}X.

The second ED block determines the RMS value of X_i,3. i.e.,

X
_i,3e
=|F
_i
{circle around (×)}X|
^n-1
|H
_i
{circle around (×)}X|.

The output of the second ED block is constant during the time of analysis of a frame. The output of the ED block is raised to a power (1−n)/n, and multiplied by X_i,3, the output of the narrow band filter. The spectrum of the output of the second multiplier is given by

$\begin{matrix} X_{i, 4} = {\langle X_{i, 3 e} \rangle}^{(1 - n) / n} X_{i, 3} \\ = {({\langle F_{i} \otimes X \rangle}^{n - 1} \langle H_{i} \otimes X \rangle)}^{(1 - n) / n} {\langle F_{i} \otimes X \rangle}^{n - 1} H_{i} \otimes X \\ = {\langle F_{i} \otimes X \rangle}^{(n - 1) / n} {\langle H_{i} \otimes X \rangle}^{(1 - n) / n} H_{i} \otimes X \end{matrix}$

In one embodiment, the outputs of all the channels are summed 155. A spectrum of the summed signal is a sum of the spectra from the individual channels. Hence, the spectrum of the companded signal 102 is given by

$\begin{matrix} Y = \sum_{i} X_{i, 4} \\ = \sum_{i} {\langle F_{i} \otimes X \rangle}^{(n - 1) / n} {\langle H_{i} \otimes X \rangle}^{(1 - n) / n} H_{i} \otimes X \\ = (\sum_{i} {\langle F_{i} \otimes X \rangle}^{(n - 1) / n} {\langle H_{i} \otimes X \rangle}^{(1 - n) / n} H_{i}) \otimes X . \end{matrix}$

The above formulation is a combination of Hadamard multiplications, exponentiation and summation and that can be performed very efficiently. Note that by introducing a term J(X) such that

$J (X) = \sum_{i} {\langle F_{i} \otimes X \rangle}^{(n - 1) / n} {\langle H_{i} \otimes X \rangle}^{(1 - n) / n} H_{i},$

we can write

Y=J(X){circle around (×)}X.

It is clear from the above formulation that the effect of the companding is to filter the frequency domain signal X by a filter that is a function of the signal X itself. It is this non-linear operation that results in the desired enhancement of spectral contrast.

Mel-frequency spectral vectors are determined by multiplying Y by a matrix of Mel filters M:

Y_mel=MY

The companding method according to the invention has several parameters that can be adjusted to optimize speech recognition performance, namely the number of channels in the filter bank, the spacing of the center frequencies of the channels, the design of the broadband filters (the F filters) and the narrow-band filters (the G filters), and the companding factor n.

In the prior art companding method, the center frequencies of the F and G filters were spaced logarithmically.

In contrast, the FFT-based companding method according to an embodiment of the invention, the filters are spaced linearly. In this embodiment, the filter bank has as many filters as the number of frequency bands in the FFT. The frequency response of the broadband filters (the F filters), and the narrowband filters (the G filters) have a triangular shape. The G filters are much narrower than the F filters. The width of the F filters represents a spectral neighborhood that affects the masking of any frequency. The width of the G filters determines the selectivity of the masking.

The optimal values of the width of the F and G filters and the degree of companding n are determined by experimentation. The best performance is obtained with F filters that spanned 9 frequency bands of a 512-point FFT of the signal, and G filters span one frequency band. The optimal value of n is 0.35.

FIGS. 4A and 4B shows the narrow-band spectrogram plot for the sentence “three oh three four nine nine nine two three two” spoken in the noisy environment of a moving vehicle. The energy in any time-frequency component is represented by a grey scale, i.e., the darker, the greater the energy.

FIG. 4A shows the spectrogram before companding, and FIG. 4B the lot after companding to achieve simultaneous masking on the signal. It is evident from FIG. 4B that is able to follow harmonic and formant transitions with clarity and suppress the surrounding clutter. In contrast, FIG. 4A shows that in the absence of companding, the formant transitions are less clear especially at low frequencies where the noise is high.

Effect of the Invention

The embodiments of the invention provide a biologically-motivated signal-processing method and system that effects simultaneous masking of speech spectra via the mechanism of two-tone suppression. Cepstral features derived from spectra enhanced in this manner result in significantly superior automatic speech recognition performance, compared to conventional Mel-frequency cepstra.

In an application of recognizing speech signals acquired in a moving vehicle, the relative word error is improved by 12.5% at −5 dB signal-to-noise Ratio (SNR), and by 6.2% across all SNRs (−5 dB SNR to +15 dB SNR). These improvements are often substantial.

In the quest for a perfect biologically inspired signal processing scheme for noise-robust speech recognition, it is important to be able to distinguish psycho-acoustic phenomena that are relevant to the problem from those that are simply incidental. The methods described above reproduces simultaneous masking to an extent that speech recognition is significantly improved.

Significantly, the embodiments of the invention described herein are not a direct transliteration of conventional companding processes. Rather, the invention uses FFT-based companding that is intended to be more efficient and amenable to incorporation in an automatic speech recognition system than the conventional companding operating in the time domain.

The FFT-based implementation varies significantly from the conventional analog design. For instance, the conventional companding] incorporates time constants through which past sounds affect the spectrum of current sounds. The FFT-based companding according to the invention is instantaneous within an analysis frame.

The F and G filters can be triangular. However, biologically-correct filters, e.g., asymmetric filters that resemble typical masking curves measured in humans, can also be used.

It is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

Claims

1. A method for processing a speech signal, comprising the steps of: performing a fast Fourier transform on a speech signal to produce a speech signal having a plurality of frequency bands in a frequency domain, and for each frequency bands further comprising the steps of: filtering the speech signal in the frequency domain with a spatial broadband filter;compressing the broadband filtered speech signal;filtering the compressed speech signal with a spatial narrowband filter; andexpanding the narrowband filtered signal to an expanded signal.
2. The method of claim 1, in which the speech signal includes noise of an environment in which the speech signal is acquired.
3. The method of claim 1, in which the speech signal is sampled into a plurality of overlapping frames, and the broadband filtering, compressing, narrowband filtering, and expanding is performed individually on the plurality of frames.
4. The method of claim 1, further comprising: summing the expanded signals into a summed speech signal; andrecognizing the summed speech signal by an automatic speech recognizer.
5. The method of claim 1, in which the plurality of frequency bands are equally spaced.
6. The method of claim 1, in which cepstral features are determined directly from the expanded signal.
7. The method of claim 1, in which the broadband filter and the narrowband filter are linear.
8. The method of claim 1, in which a Fourier spectrum of a response of the broadband filter is a vector F, and a spectrum of the broadband filtered signal is X1=F{circle around (×)}X, where {circle around (×)} represents an element-wise Hadamard multiplication, and X is the speech signal, and X1 is the broad band filtered signal, and in which the expanding produces the expanded signal X2=|X1|n-1X1, where n is a constant companding factor.
9. The method of claim 8, in which a Fourier spectrum response of the narrowband filter is a vector G, and a spectrum of the narrowband filtered signal X3 is X3=G{circle around (×)}X2.
10. The method of claim 9, in which a filter H is F{circle around (×)}G, and X3=|F{circle around (×)}X|n-1H{circle around (×)}X.
11. The method of claim 10, in which the expanded signal is X4=|F{circle around (×)}X|(n−1)/n)|H{circle around (×)}X|(1−n/n)H{circle around (×)}X.
12. The method of claim 11, in which the summed signal is J(X)=Σ|F{circle around (×)}X|(n−1)/n|H{circle around (×)}X|(1−n/n)H{circle around (×)}X.
13. The method of claim 12, in which an integral spectrum is computed as Y=J(X){circle around (×)}X.
14. The method of claim 13, in which a Mel-frequency spectral vector is determined as Ymel=MY.
15. A system for processing a speech signal, comprising: a fast Fourier transform configured to be applied to a speech signal to produce a speech signal having a plurality of frequency bands in a frequency domain, and for each frequency bands further comprising: a spatial broadband filter configured to filtering the speech signal in the frequency domain;means for compressing the broadband filtered speech signal;a spatial narrowband filter configured to filter the compressed speech signal; andmeans for expanding the narrowband filtered signal to an expanded signal.
16. The method of claim 15, wherein speech recognition features are determined from the expanded signal.

Method and system for FFT-based companding for automatic speech recognition

Information

Publication Number

Date Filed

Date Published

Inventors

CPC

US Classifications

International Classifications

Abstract

Description

Claims