The invention relates generally to speech-based user interfaces, and more particularly to hands-free interface.
A speech-based user interface acquires speech input from a user for further processing. Typically, the speech acquired by the interface is processed by an automatic speech recognition system (ASR). Ideally, the interface responds only to the user speech that is specifically directed at the interface, but not to any other sounds.
This requires that the interface recognizes when it is being addressed, and only responds at that time. When the interface does accept speech from the user, the interface must acquire and process the entire audio signal for the speech. The interface must also determine precisely the start and the end of the speech, and not process signals significantly before the start of the speech and after the end of the speech. Failure to satisfy these requirements can cause incorrect or spurious speech recognition.
A number of speech-based user interfaces are known. These can be roughly categorized as follows.
Push-to-Talk
With this type of interface, the user must press a button only for the duration of the speech. Thus, the start and end of speech signals are precisely known, and the speech is only processed while the button is pressed.
Hit-to-Talk
Here, the user briefly presses a button to indicate the start of the speech. It is the responsibility of the interface to determine where the speech ends. As with push-to-talk interface, the hit-to-talk interface also attempts to ensure that speech is only when the button is pressed.
However, there are a number of situations where the use a button may be impossible, inconvenient, or simply unnatural, for example, any situation where the user's hands are otherwise occupied, the user is physically impaired, or the interface precludes the inclusion of a button. Therefore, hands-free interfaces have been developed.
Hands-Free
With hands-free speech-based interfaces, the interface itself determines when speech starts and ends.
Of the three types of interface, the hands-free interface is arguably the most natural, because the interface does not require an express signal to initiate or terminate processing of the speech. In most conventional hands-free interfaces, only the audio signal acquired by the primary sensor, i.e., the microphone, is analyzed to make start and end of the speech decisions.
However, the hands-free interface is the most difficult to implement because it is difficult to determine automatically when the interface is being addresses by just the user, and when the speech starts and ends. This problem becomes particularly difficult when the interface operates in a noisy or reverberant environment, or in an environment where there is additional unrelated speech.
One conventional solution uses “attention words.” The attention words are intended to indicate expressly the start and/or end of the speech. Another solution analyzes an energy profile of the audio signal. Processing begins when there is a sudden increase in the energy, and stops when the energy decreases. However, this solution can fail in a noisy environment, or an environment with background speech.
A zero crossing rates of the audio signal can also be used. The zero-crossings occur when the speech signal changes between positive and negative. When the energy and zero-crossings are at predetermined levels, speech is probably present.
Another class of solutions uses secondary sensors to acquire secondary measurements of the speech signal, such as a glottal electromagnetic sensor (GEMS), a physiological microphone (P-mic), a bone conduction sensors, and an electroglottographs. However all of the above secondary sensors need to be mounted on the user of the interface. This can be inconvenient in any situation where it is difficult to forward the secondary signal to the interface. That is, the user may need to be ‘tethered’ to the interface.
An ideal secondary sensor for a hands-free, speech-based interface should be able to operate at a distance from the user. Video cameras could be used as effective far-field sensors for detecting speech. Video images can be used for face detection and tracking, and to determine when the user is speaking. However, cameras are expensive, and detecting faces and recognizing moving lips is tedious, difficult and error prone.
Another secondary sensor uses the Doppler effect. An ultrasonic transmitter and receiver are deployed at a distance from the user. A transmitted ultrasonic signal is reflected by the face of the user. As user speaks parts of the face move, which changes the frequency of the reflected signal. Measurements obtained from the secondary sensor are used in conjunction with the audio signal acquired by the primary sensor to detect when the user speaks.
In addition to being usable at a distance from the user, the Doppler sensor differs from conventional secondary sensors in another, crucial way. The measurements provided by conventional current secondary sensors are usually linearly related to the speech signal itself. The GEMS sensor provides measurements of the excitation function to the vocal tract. The signals acquired by P-mics, throat microphones and bone-conduction microphones are essentially a filtered versions of the speech signal itself.
In contrast, the signal acquired by the Doppler sensor is not linearly related to the speech signal. Rather, the signal expresses information related to the movement of the face while speaking. The relationship between facial movement and the speech is not obvious, and certainly not linear.
However, the Doppler sensors use a support vector machine (SVM) to classify the audio signal as speech or non-speech. The classifier must first be trained off-line on joint speech and Doppler recordings. Consequently, the performance of the classifier is highly dependent on the training data used. It may be that different speakers articulate speech in different ways, e.g., depending on gender, age, and linguistic class. Therefore, it may be difficult to train the Doppler-based secondary sensor for a broad class of users. In addition, that interface requires both a speech signal and the Doppler signal for speech activity detection.
Therefore, it desired to provide a speech activity sensor that does not require training of a classifier. It is also desired to detect speech only from the Doppler signal, without using any part of the concomitant audio signal. Then, as an advantage, the detection process can be independent of background “noise,” be it speech or any other spurious sounds.
The embodiments of the invention provide a hands-free, speech-based user interface. The interface detects when speech is to be processed. In addition, the interface detects the start and end speech so that proper segmentation of the speech can be performed. Accurate segmentation of speech improves noise estimation and speech recognition accuracy.
A secondary sensor includes an ultrasonic transmitter and receiver. The sensor detects facial movement when the user of the interface speaks using the Doppler effect. Because speech detection can be entirely based only on the secondary signal due to the facial movement, the interface works well even in extremely noisy environments.
Interface Structure
Transmitter
The transmitter 101 includes an ultrasonic emitter 110 coupled to an oscillator 111, e.g., 40 kHz oscillator. The oscillator 111 is a microcontroller that is programmed to toggle one of its pins, e.g., at 40 kHz with a 50% duty cycle. The use of a microcontroller greatly decreases the cost and complexity of the overall design.
In one embodiment, the emitter has a resonant carrier frequency centered at 40 kHz. Although the input to the emitter is a square wave, the actual ultrasonic signal emitted is a pure tone due to a narrow-band response of the emitter. The narrow bandwidth of the emitted signal corresponds approximately to the bandwidth of a demodulated Doppler signal.
Receiver
The receiver 102 includes an ultrasonic channel 103 and an audio channel 104.
The ultrasonic channel includes a transducer 120, which, in one embodiment, has a resonant frequency of 40 kHz, with a 3 dB bandwidth of less than 3 kHz. The transducer 120 is coupled to a mixer 140 via a preamplifier 130. The mixer also receives input from a band pass filter 145 that uses, in one embodiment, a 36 KHz signal generator 146. The output of the mixer is coupled to a first low pass filter 150.
The audio channel includes a microphone 160 coupled to a second low pass filter 170. The audio channel acquires an audio signal. Hereinafter, an audio signal specifically means an acoustic signal that is audible. In a preferred embodiment, the audio channel is duplicated so that a stereo audio signal can be acquired.
Outputs 151 and 171 of the low pass filters 150 and 170, respectively, are processed 200 as described below. The eventual goal is to detect only speech activity 181 by a user of the interface in the received audio signal.
The transmitter 110 and the transducer 120 in the preferred embodiment have a diameter of approximately 16 mm, which is nearly twice the wavelength of the ultrasonic signal at 40 kHz. As a result, the emitted ultrasonic is spatially narrow beam, e.g., with a 3 dB beam width of approximately 30 degrees. This makes it possible for the ultrasonic signal to be highly directional. This decreases the likelihood of sensing extraneous signals not associated with facial movement. In fact, it makes sense to colocate the transducer 120 with the microphone 160.
Most conventional audio signal processors cut off received acoustic signals well below 40 kHz prior to digitization. Therefore, we heterodyne the received ultrasonic signal such that the resultant much lower “beat frequency” signal falls is within the audio range. Doing so also provides us with another advantage. The heterodyned signal can be sampled at audio frequencies, with the additional benefits in a reduction of computational complexity.
The signal 121 acquired by the transducer is pre-amplified 130 and input to the analog mixer 140. The second input to the mixer is a 36 kHz, as in our preferred embodiment, sinusoid signal. The sinusoid signal is generated by producing a 36 kHz 50% duty cycle square wave from the microcontroller. The square wave is bandpass filtered 145 with a fourth order active filter. The output of the mixer is then low-pass filtered 150 with a cutoff frequency of 8 kHz, as in our preferred embodiment.
The audio channel includes a microphone 160 to acquire the audio signal. In preferred embodiment, the microphone is selected to have a frequency response with a 3 dB cutoff frequency below 8 kHz. This ensures that the audio channel does not acquire the ultrasonic signal. The audio signal is further low-pass filtered by a second order RC filter 170 with a cut off frequency of 8 kHz.
The outputs 151 and 171 of the ultrasonic channel and the audio channel are jointly fed to the processor 200. The stereo signal is sampled at 16 kHz before the processing 200 to detect the speech activity 181.
Interface Operation
The ultrasonic transmitter 101 directs a narrow-beam, e.g., 40 kHz, ultrasonic signal at the face of the user of the interface 100. The signal emitted by the transmitter is a continuous tone that can be represented as s(t)=sin(2πfct), where fc, is the emitted frequency, e.g., 40 kHz in our case.
The user's face reflects the ultrasonic signal as a Doppler signal. Herein, the Doppler signal generally refers to the reflected ultrasonic signal. While speaking, the user moves articulatory facial structures including but not limited to the mouth, lips, tongue, chin and cheeks. Thus, the articulated face can be modeled as a discrete combination of moving articulators, where the ith component has a time-varying velocity vi(t). The low velocity movements cause changes in wavelength of the incident ultrasonic signal. A complex articulated object, such as the face, exhibits a range of velocities while in motion. Consequently, the reflected Doppler signal has a spectrum of frequencies that is related to the entire set of velocities of all parts of the face that move as the user speaks. Therefore, as stated above, the bandwidth of the ultrasonic signal corresponds approximately to the bandwidth of frequencies at which the facial articulators move.
The Doppler effect states that if a tone of frequency f is incident on an object with velocity v relative to a sensor 120, the frequency {circumflex over (f)} of the reflected Doppler signal is given by
where vs is the speed of sound in a particular medium, e.g., air. The approximation to the right in Equation (1) holds true if v<<vs, which is true for facial movement.
The various articulators have different velocities. Therefore, each articulator reflects a different frequency. The frequencies change continuously with the velocity of the articulators. The received ultrasonic signal can therefore be considered as sum of multiple frequency modulated (FM) signals, all modulating the same carrier frequency (fc). The FM can be modeled as:
where Vi(τ) is the velocity at a specific instant of time ‘τ’.
Equation (2) uses the approximate form of the Doppler Equation (1). The variable ai is the amplitude of the signal reflected by the ith articulated component. This variable is related to the distance of the component from the sensor. Although ai is time varying, the changes are relatively slow, compared to the sinusoidal terms in Equation 2. We assume the term to be a constant gain term.
The variable φi is a phase term intended to represent relative phase differences between the Doppler signals reflected by the various moving articulators. If fc is the carrier frequency, then Equation (2) represents the sum of multiple frequency modulated (FM) signals, all operating on the single carrier frequency fc.
Most of the information relating to the movement of facial articulators resides in the frequency of the signals in Equation (1). In preferred embodiment, we demodulate the signal such that this information is also expressed in the amplitude of the sinusoidal components, so that a measure of the energy of these movements can be obtained.
Conventional FM demodulation proceeds by eliminating amplitude variations through hard limiting and band-pass filtering, followed by differentiating the signal to extract the ‘message’ into the amplitude of the sinusoid signal, followed finally by an envelope detector.
Our FM demodulation is different. We do not perform the hard-limiting and band-pass filtering operation because we want to retain the information in the amplitude αi. This gives us an output that is more similar to spectral-decomposition of the ultrasonic signal.
The first step differentiates the received ultrasonic signal d(t). From Equation (2) we obtain
The derivative of d(t) is multiplied by the sinusoid of frequency fc. This gives us:
A low-pass filter with a cut-off below fc cut off the second sinusoid on the right in Equation 4 finally giving us:
where LPF represents the low-pass-filtering operation.
The signal represented by Equation (5) encodes velocity terms in both amplitudes and frequencies. If the signal is analyzed using relatively short analysis frames, the velocities of the frequencies do not change significantly within a particular analysis frame, and the right hand side of Equation (5) can be interpreted as a frequency decomposition of the left hand side.
The signal contains energy primarily at frequencies related to the various velocities of the moving articulators. The energy at any velocity is a function of the number and distance of facial articulators moving with that velocity, as well as the velocity itself.
Speech Activity Detection
The signals are then partitioned 210 into frames using, e.g., a 1024 point Hamming window.
The audio signal 171 is processed only while speech activity 181 from the user is detected.
Facial articulators are relatively slowly moving. The frequency variations due to their velocity are low. The ultrasonic signal is demodulated 220 into a range of frequency range, e.g., 25 Hz to 150 Hz. Frequencies outside this range, although potentially related to speech activity, are usually corrupted by the carrier frequency, as well as harmonics of the speech signal including any background speech or babble, particularly in speech segments.
To obtain the frequency resolution needed for analyzing the ultrasonic signal, the frame size is a relatively large, e.g., 64 ms. Each frame includes 1024 samples. Adjacent frames overlap by 50%.
From each frame of the demodulated and windowed Doppler signal, we extract 230 discrete Fourier transform (DFT) coefficients for eight bins in a frequency range from 25 Hz to 150 Hz. In our preferred implementation, we actually use the well known Goertzel's algorithm, see e.g., U.S. Pat. No. 4,080,661 issued to Niwa on Mar. 21, 1978, “Arithmetic unit for DFT and/or IDFT computation,” incorporated herein by reference.
The energy in these frequency bands is determined from the DFT coefficients. Typically, the sequence of energy values is very noisy. Therefore, we “smooth” 240 the energy using a five point median filter.
To determine if the tth frame of audio signal represents speech, the median filtered energy value Ed(t) of the Doppler signal in the corresponding frame is compared 250 to an adaptive threshold βt to determine whether the frame indicates speech activity 202, or not 203. The threshold for the tth frame is adapted as follows:
βt=βt−1+μ(Ed(t)−Ed(t−1)),
where μ is an adaptation factor that can be adjusted for optimal performance.
If the frame is not indicative of speech, then we assume an end of an utterance 260 event. An utterance is defined as a sequence of one or more frames of speech activity followed by a frame that is speech. The energy Ec of the current audio frame 204 and the energy Ep of the last confirmed frame 289 that includes speech are compared 285 according to αEp≦Ec. The scalar α is a selectable non-speech parameter between 0 and 1 to determine speech and non-speech frames 291-292, respectively.
This event initiates end of speech detection 270, which operates only on the audio signal. The method continues 275 to detect speech up to three frames after the end of utterance event. Finally, adjacent speech segments that are within 200 ms of each other are merged.
The interface according to the embodiments of the invention detects speech only when speech is directed at the interface. The interface also concatenates adjacent speech utterances. The interface excludes non-speech audio signals.
The ultrasonic Doppler sensor is accurate at SNRs as low as −10 dB. The interface is also relatively insensitive to false alarms.
The interface has several advantages. It is inexpensive, has low false trigger rate and is not affected by ambient out-of-band noise. Also, due to the finite range of the ultrasonic receiver, the output is not affected by distant movements.
The interface only uses the Doppler signals to make the initial decision whether speech activity is present or not. The audio signal can be used optionally to concatenate adjacent short utterance into continuous speech segments.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
4080661 | Niwa | Mar 1978 | A |
20070165881 | Ramakrishnan et al. | Jul 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
20080071532 A1 | Mar 2008 | US |