ULTRASONIC DOPPLER SENSOR FOR SPEECH-BASED USER INTERFACE

Information

  • Patent Application
  • 20080071532
  • Publication Number
    20080071532
  • Date Filed
    September 12, 2006
    18 years ago
  • Date Published
    March 20, 2008
    16 years ago
Abstract
A method and system detect speech activity. An ultrasonic signal is directed at a face of a speaker over time. A Doppler signal of the ultrasonic signal is acquired after reflection by the face. Energy in the Doppler signal is measured over time. The energy over time is compared to a predetermined threshold to detect speech activity of the speaker in a concurrently acquired audio signal.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 is a block diagram of a hands-free speech-based user interface according to an embodiment of our invention;



FIG. 2 is a flow diagram of a method for detecting speech activity using the interface of FIG. 1; and



FIGS. 3A-3C are timing diagrams of primary and secondary signals acquired and processed by the interface of FIG. 1 and the method of FIG. 2.





DESCRIPTION OF THE PREFERRED EMBODIMENT

Interface Structure


Transmitter



FIG. 1 shows a hands-free, speech-based interface 100 according to an embodiment of our invention. Our interface includes a transmitter 101, a receiver 102, and a processor 200 executing the method according to an embodiment of the invention. The transmitter and receiver, in combination, form an ultrasonic Doppler sensor 105 according to an embodiment of the invention. Hereinafter, ultrasound is defined as sound with a frequency greater than the upper limit of human hearing. This limit is approximately 20 kHz.


The transmitter 101 includes an ultrasonic emitter 110 coupled to an oscillator 111, e.g., 40 kHz oscillator. The oscillator 111 is a microcontroller that is programmed to toggle one of its pins, e.g., at 40 kHz with a 50% duty cycle. The use of a microcontroller greatly decreases the cost and complexity of the overall design.


In one embodiment, the emitter has a resonant carrier frequency centered at 40 kHz. Although the input to the emitter is a square wave, the actual ultrasonic signal emitted is a pure tone due to a narrow-band response of the emitter. The narrow bandwidth of the emitted signal corresponds approximately to the bandwidth of a demodulated Doppler signal.


Receiver


The receiver 102 includes an ultrasonic channel 103 and an audio channel 104.


The ultrasonic channel includes a transducer 120, which, in one embodiment, has a resonant frequency of 40 kHz, with a 3 dB bandwidth of less than 3 kHz. The transducer 120 is coupled to a mixer 140 via a preamplifier 130. The mixer also receives input from a band pass filter 145 that uses, in one embodiment, a 36 KHz signal generator 146. The output of the mixer is coupled to a first low pass filter 150.


The audio channel includes a microphone 160 coupled to a second low pass filter 170. The audio channel acquires an audio signal. Hereinafter, an audio signal specifically means an acoustic signal that is audible. In a preferred embodiment, the audio channel is duplicated so that a stereo audio signal can be acquired.


Outputs 151 and 171 of the low pass filters 150 and 170, respectively, are processed 200 as described below. The eventual goal is to detect only speech activity 181 by a user of the interface in the received audio signal.


The transmitter 110 and the transducer 120 in the preferred embodiment have a diameter of approximately 16 mm, which is nearly twice the wavelength of the ultrasonic signal at 40 kHz. As a result, the emitted ultrasonic is spatially narrow beam, e.g., with a 3 dB beam width of approximately 30 degrees. This makes it possible for the ultrasonic signal to be highly directional. This decreases the likelihood of sensing extraneous signals not associated with facial movement. In fact, it makes sense to colocate the transducer 120 with the microphone 160.


Most conventional audio signal processors cut off received acoustic signals well below 40 kHz prior to digitization. Therefore, we heterodyne the received ultrasonic signal such that the resultant much lower “beat frequency” signal falls is within the audio range. Doing so also provides us with another advantage. The heterodyned signal can be sampled at audio frequencies, with the additional benefits in a reduction of computational complexity.


The signal 121 acquired by the transducer is pre-amplified 130 and input to the analog mixer 140. The second input to the mixer is a 36 kHz, as in our preferred embodiment, sinusoid signal. The sinusoid signal is generated by producing a 36 kHz 50% duty cycle square wave from the microcontroller. The square wave is bandpass filtered 145 with a fourth order active filter. The output of the mixer is then low-pass filtered 150 with a cutoff frequency of 8 kHz, as in our preferred embodiment.


The audio channel includes a microphone 160 to acquire the audio signal. In preferred embodiment, the microphone is selected to have a frequency response with a 3 dB cutoff frequency below 8 kHz. This ensures that the audio channel does not acquire the ultrasonic signal. The audio signal is further low-pass filtered by a second order RC filter 170 with a cut off frequency of 8 kHz.


The outputs 151 and 171 of the ultrasonic channel and the audio channel are jointly fed to the processor 200. The stereo signal is sampled at 16 kHz before the processing 200 to detect the speech activity 181.


Interface Operation


The ultrasonic transmitter 101 directs a narrow-beam, e.g., 40 kHz, ultrasonic signal at the face of the user of the interface 100. The signal emitted by the transmitter is a continuous tone that can be represented as s(t)=sin(2πfct), where fc is the emitted frequency, e.g., 40 kHz in our case.


The user's face reflects the ultrasonic signal as a Doppler signal. Herein, the Doppler signal generally refers to the reflected ultrasonic signal. While speaking, the user moves articulatory facial structures including but not limited to the mouth, lips, tongue, chin and cheeks. Thus, the articulated face can be modeled as a discrete combination of moving articulators, where the ith component has a time-varying velocity vi(t). The low velocity movements cause changes in wavelength of the incident ultrasonic signal. A complex articulated object, such as the face, exhibits a range of velocities while in motion. Consequently, the reflected Doppler signal has a spectrum of frequencies that is related to the entire set of velocities of all parts of the face that move as the user speaks. Therefore, as stated above, the bandwidth of the ultrasonic signal corresponds approximately to the bandwidth of frequencies at which the facial articulators move.


The Doppler effect states that if a tone of frequency f is incident on an object with velocity v relative to a sensor 120, the frequency {circumflex over (f)} of the reflected Doppler signal is given by











f
^

=





υ
s

+
υ



υ
s

-
υ



f




(

1
+


2





υ


υ
s



)


f



,




(
1
)







where vs is the speed of sound in a particular medium, e.g., air. The approximation to the right in Equation (1) holds true if v<<v5, which is true for facial movement.


The various articulators have different velocities. Therefore, each articulator reflects a different frequency. The frequencies change continuously with the velocity of the articulators. The received ultrasonic signal can therefore be considered as sum of multiple frequency modulated (FM) signals, all modulating the same carrier frequency (fc). The FM can be modeled as:











d


(
t
)


=



i




a
i



sin


(


2

π







f
c



(

t
+


2

υ
s






0
t





υ
i



(
τ
)









τ





)



+

φ
i


)





,




(
2
)







where Vi(τ) is the velocity at a specific instant of time ‘τ’.


Equation (2) uses the approximate form of the Doppler Equation (1). The variable ai is the amplitude of the signal reflected by the ith articulated component. This variable is related to the distance of the component from the sensor. Although ai is time varying, the changes are relatively slow, compared to the sinusoidal terms in Equation 2. We assume the term to be a constant gain term.


The variable Φi is a phase term intended to represent relative phase differences between the Doppler signals reflected by the various moving articulators. If fc is the carrier frequency, then Equation (2) represents the sum of multiple frequency modulated (FM) signals, all operating on the single carrier frequency fc.


Most of the information relating to the movement of facial articulators resides in the frequency of the signals in Equation (1). In preferred embodiment, we demodulate the signal such that this information is also expressed in the amplitude of the sinusoidal components, so that a measure of the energy of these movements can be obtained.


Conventional FM demodulation proceeds by eliminating amplitude variations through hard limiting and band-pass filtering, followed by differentiating the signal to extract the ‘message’ into the amplitude of the sinusoid signal, followed finally by an envelope detector.


Our FM demodulation is different. We do not perform the hard-limiting and band-pass filtering operation because we want to retain the information in the amplitude ai. This gives us an output that is more similar to spectral-decomposition of the ultrasonic signal.


The first step differentiates the received ultrasonic signal d(t). From Equation (2) we obtain



















t




d


(
t
)



=



i



2

π






a
i





f
c



(

1
+


2



υ
i



(
t
)




υ
s



)


·

cos


(


2

π







f
c



(

1
+


2

υ
s






0
t





υ
i



(
τ
)









τ





)



+

φ
i


)









(
3
)







The derivative of d(t) is multiplied by the sinusoid of frequency fc. This gives us:












sin


(

2

π






f
c


t

)












t




d


(
t
)



=



i



2

π






a
i




f
c



(

1
+


2



υ
i



(
t
)




υ
s



)





sin


(

2

π






f
c


t

)


·

cos


(


2

π







f
c



(

1
+


2

υ
s






0
t





υ
i



(
τ
)









τ





)



+

φ
i


)














i



2

π






a
i




f
c



(

1
+


2



υ
i



(
t
)




υ
s



)




(

1
-

sin


(




2

π






f
c



υ
s






0
t





υ
i



(
τ
)









τ




+

φ
i


)


+

sin


(


4

π






f
c


t

+



2

π






f
c



υ
s






0
t





υ
i



(
τ
)









τ




+

φ
i


)



)







(
4
)







A low-pass filter with a cut-off below fc cut off the second sinusoid on the right in Equation 4 finally giving us:











LPF


(


sin


(

2

π






f
c


t

)












t




d


(
t
)



)


=

-



i



2

π






a
i




f
c



(

1
+


2



υ
i



(
t
)




υ
s



)




sin


(




2

π






f
c



υ
s






0
t





υ
i



(
τ
)









τ




+

φ
i


)






,




(
5
)







where LPF represents the low-pass-filtering operation.


The signal represented by Equation (5) encodes velocity terms in both amplitudes and frequencies. If the signal is analyzed using relatively short analysis frames, the velocities of the frequencies do not change significantly within a particular analysis frame, and the right hand side of Equation (5) can be interpreted as a frequency decomposition of the left hand side.


The signal contains energy primarily at frequencies related to the various velocities of the moving articulators. The energy at any velocity is a function of the number and distance of facial articulators moving with that velocity, as well as the velocity itself.


Speech Activity Detection



FIG. 2 shows the method 200 for speech activity detection according to an embodiment of the invention. The ultrasonic Doppler signal 151 and the audio signal 171 acquired by the ADS 105 are both sampled 201 at 16 kHz. FIG. 3A shows the reflected Doppler signal. In FIGS. 3A-3B, the vertical axis is amplitude. FIG. 3C also shows the normalized energy contour of the Doppler signal. The horizontal axis is time.


The signals are then partitioned 210 into frames using, e.g., a 1024 point Hamming window.


The audio signal 171 is processed only while speech activity 181 from the user is detected.


Facial articulators are relatively slowly moving. The frequency variations due to their velocity are low. The ultrasonic signal is demodulated 220 into a range of frequency range, e.g., 25 Hz to 150 Hz. Frequencies outside this range, although potentially related to speech activity, are usually corrupted by the carrier frequency, as well as harmonics of the speech signal including any background speech or babble, particularly in speech segments. FIG. 3B shows the demodulated Doppler signal.


To obtain the frequency resolution needed for analyzing the ultrasonic signal, the frame size is a relatively large, e.g., 64 ms. Each frame includes 1024 samples. Adjacent frames overlap by 50%.


From each frame of the demodulated and windowed Doppler signal, we extract 230 discrete Fourier transform (DFT) coefficients for eight bins in a frequency range from 25 Hz to 150 Hz. In our preferred implementation, we actually use the well known Goertzel's algorithm, see e.g., U.S. Pat. No. 4,080,661 issued to Niwa on Mar. 21, 1978, “Arithmetic unit for DFT and/or IDFT computation,” incorporated herein by reference.


The energy in these frequency bands is determined from the DFT coefficients. Typically, the sequence of energy values is very noisy. Therefore, we “smooth” 240 the energy using a five point median filter.



FIG. 3C shows the energy contour as well as the audio signal. The Figure shows that the energy in the Doppler signal is correlated to speech activity.


To determine if the tth frame of audio signal represents speech, the median filtered energy value Ed(t) of the Doppler signal in the corresponding frame is compared 250 to an adaptive threshold βt to determine whether the fame indicates speech activity 202, or not 203. The threshold for the tth frame is adapted as follows:





βtt−1+μ(Ed(t)−Ed(t−1)),


where μ is an adaptation factor that can be adjusted for optimal performance.


If the frame is not indicative of speech, then we assume an end of an utterance 260 event. An utterance is defined as a sequence of one or more frames of speech activity followed by a frame that is speech. The energy Ec of the current audio frame 204 and the energy Ep of the last confirmed frame 289 that includes speech are compared 285 according to αEp≦Ec. The scalar α is a selectable non-speech parameter between 0 and 1 to determine speech and non-speech frames 291-292, respectively.


This event initiates end of speech detection 270, which operates only on the audio signal. The method continues 275 to detect speech up to three frames after the end of utterance event. Finally, adjacent speech segments that are within 200 ms of each other are merged.


EFFECT OF THE INVENTION

The interface according to the embodiments of the invention detects speech only when speech is directed at the interface. The interface also concatenates adjacent speech utterances. The interface excludes non-speech audio signals.


The ultrasonic Doppler sensor is accurate at SNRs as low as −10 dB. The interface is also relatively insensitive to false alarms.


The interface has several advantages. It is inexpensive, has low false trigger rate and is not affected by ambient out-of-band noise. Also, due to the finite range of the ultrasonic receiver, the output is not affected by distant movements.


The interface only uses the Doppler signals to make the initial decision whether speech activity is present or not. The audio signal can be used optionally to concatenate adjacent short utterance into continuous speech segments.


Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.

Claims
  • 1. A method for detecting speech activity, comprising: directing an ultrasonic signal at a face of a speaker over time;acquiring a Doppler signal of the ultrasonic signal after reflection by the face;measuring an energy in the Doppler signal over time; andcomparing the energy over time to a predetermined threshold to detect speech activity of the speaker.
  • 2. The method of claim 1, further comprising: frequency demodulating the Doppler signal before the measuring.
  • 3. The method of claim 2, in which the frequency demodulation is into a range of frequency bands.
  • 4. The method of claim 1, further comprising: sampling the Doppler signal; andpartitioning the samples into frames before the measuring.
  • 5. The method of claim 4, in which the fames overlap in time.
  • 6. The method of claim 2, further comprising: extracting discrete Fourier transform (DFT) coefficients from the demodulated Doppler signal; andmeasuring the energy from the DFT coefficients.
  • 7. The method of claim 1, further comprising: filtering the Doppler signal to smooth the energy before the measuring.
  • 8. The method of claim 7, further comprising: determining a medium of the energy over time before the comparing using the filtering.
  • 9. The method of claim 1, further comprising: acquiring concurrently an audio signal while acquiring the Doppler signal; andprocessing the audio signal only while detecting the speech activity.
  • 10. The method of claim 1, further comprising: heterodyning the Doppler signal before the measuring.
  • 11. The method of claim 1, in which the ultrasonic signal is spatially narrow beam.
  • 12. The method of claim 11, in which the ultrasonic signal has a bandwidth corresponding to a bandwidth of the demodulated Doppler signal.
  • 13. The method of claim 9, in which the acquiring is performed with colocated sensors.
  • 14. The method of claim 1, in which a bandwidth of the ultrasonic signal corresponds to a bandwidth of frequencies at which articulator of the face move while speaking.
  • 15. The method of claim 2, in which the energy is obtained from an amplitude of the demodulated Doppler signal.
  • 16. The method of claim 2, in which the demodulating is similar to spectral-decomposition of the ultrasonic signal.
  • 17. The method of claim 1, further comprising: sampling the ultrasonic signal to obtain overlapping frames.
  • 18. A system for detecting speech activity, comprising: a transmitter configured to direct an ultrasonic signal at a face of a speaker;a receiver configured to acquire a Doppler signal of the ultrasonic signal after reflection by the face;means for measuring an energy in the Doppler signal; andmeans for comparing the energy to a threshold to detect speech activity.
  • 19. An apparatus for detecting speech activity, comprising: an emitter configured to direct an ultrasonic signal at a face of a speaker;a transducer configured to acquire a Doppler signal of the ultrasonic signal after reflection by the face;a microphone configured to acquire an audio signal; andmeans coupled to the transducer and microphone to detect speech activity in the audio signal based on an energy of the Doppler signal.
  • 20. The apparatus of claim 19, in which the emitter, transducer and microphone are colocated.