Embodiments of the present disclosure relate generally to a system and method for performing automatic speech recognition (ASR) using end-pointing markers generated using an accelerometer-based voice activity detector.
Currently, a number of consumer electronic devices are adapted to receive speech via microphone ports or headsets. While the typical example is a portable telecommunications device (mobile telephone), with the advent of Voice over IP (VoIP), desktop computers, laptop computers, and tablet computers may also be used to perform voice communications.
When using these electronic devices, the user also has the option of using the speakerphone mode or a wired headset to receive his speech. However, a common complaint with these hands-free modes of operation is that the speech captured by the microphone port or the headset includes environmental noise, such as wind noise, secondary speakers in the background, or other background noises. This environmental noise often renders the user's speech unintelligible and thus, degrades the quality of the voice communication.
When performing speech recognition, the electronic device may be assessing the speech captured by the microphone port or headset that may come from secondary speakers in the background in addition to speech coming from the electronic device's primary user (or speaker).
The embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment of the invention in this disclosure are not necessarily to the same embodiment, and they mean at least one. In the drawings:
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown to avoid obscuring the understanding of this description.
The present disclosure relates generally to systems and methods for performing ASR using end-pointing markers generated using an accelerometer-based voice activity detector. In one example system, at least one accelerometer is included in at least one earbud to detect vibration of the user's vocal chords. The at least one accelerometer is used to generate data output that is used by an accelerometer-based voice activity detector (VADa) to generate a VADa output. The VADa is a more robust voice activity detector that is less affected by ambient acoustic noise. Accordingly, the VADa may more accurately detect speech by the primary speaker rather than speech from a secondary speaker in the background. The VADa output is then used to perform the ASR on the acoustic signals received from at least one microphone that may be included in at least one earbud.
In one embodiment, each of the earbuds 110L, 110R is a wireless earbud and may also include a battery device, a processor, and a communication interface (not shown). In this embodiment, the processor may be a digital signal processing chip that processes the acoustic signal from at least one of the microphones 111BR, 111ER and the inertial sensor output from the accelerometer 113R. In one embodiment, the beamformers' patterns illustrated in
The communication interface may include a Bluetooth™ receiver and transmitter to communicate acoustic signals from the microphones 111BR, 111ER, and the inertial sensor output from the accelerometer 113R wirelessly in both directions (uplink and downlink) with the electronic device. In some embodiments, the communication interface communicates encoded signal from a speech codec 160 to the electronic device 10.
When the user speaks, his speech signals may include voiced speech and unvoiced speech. Voiced speech is speech that is generated with excitation or vibration of the user's vocal chords. In contrast, unvoiced speech is speech that is generated without excitation of the user's vocal chords. For example, unvoiced speech sounds include /s/, /sh/, /f/, etc. Accordingly, in some embodiments, both the types of speech (voiced and unvoiced) are detected in order to generate an augmented voice activity detector (VAD) output, which more faithfully represents the user's speech.
First, in order to detect the user's voiced speech, in one embodiment, the output data signal from accelerometer 113 placed in each earbud 110 together with the signals from the microphones 111B, 111E or the microphone array 1211-121M or the beamformer may be used. The accelerometer 113 may be a sensing device that measures proper acceleration in three directions, X, Y, and Z or in only one or two directions. When the user is generating voiced speech, the vibrations of the user's vocal chords are filtered by the vocal tract and cause vibrations in the bones of the user's head which are detected by the accelerometer 113 in the headset 110. In other embodiments, an inertial sensor, a force sensor or a position, orientation and movement sensor may be used in lieu of the accelerometer 113 in the headset 110.
In the embodiment with the accelerometer 113, the accelerometer 113 is used to detect the low frequencies since the low frequencies include the user's voiced speech signals. For example, the accelerometer 113 may be tuned such that it is sensitive to the frequency band range that is below 2000 Hz. In one embodiment, the signals below 60 Hz-70 Hz may be filtered out using a high-pass filter and above 2000 Hz-3000 Hz may be filtered out using a low-pass filter. In one embodiment, the sampling rate of the accelerometer may be 2000 Hz but in other embodiments, the sampling rate may be between 2000 Hz and 6000 Hz. In another embodiment, the accelerometer 113 may be tuned to a frequency band range under 1000 Hz. It is understood that the dynamic range may be optimized to provide more resolution within a forced range that is expected to be produced by the bone conduction effect in the headset 100. Based on the outputs of the accelerometer 113, an accelerometer-based VAD output (VADa) may be generated, which indicates whether or not the accelerometer 113 detected speech generated by the vibrations of the vocal chords. In one embodiment, the power or energy level of the outputs of the accelerometer 113 is assessed to determine whether the vibration of the vocal chords is detected. The power may be compared to a threshold level that indicates the vibrations are found in the outputs of the accelerometer 113. In another embodiment, the VADa signal indicating voiced speech is computed using the normalized cross-correlation between any pair of the accelerometer signals (e.g., X and Y, X and Z, or Y and Z). If the cross-correlation has values exceeding a threshold within a short delay interval the VADa indicates that the voiced speech is detected. In some embodiments, the VADa is a binary output that is generated as a voice activity detector (VAD), wherein 1 indicates that the vibrations of the vocal chords have been detected and 0 indicates that no vibrations of the vocal chords have been detected.
Using at least one of the microphones in the headset 110 (e.g., one of the microphones in the microphone array 1211-121M, back earbud microphone 111B, or end earbud microphone 111E) or the output of a beamformer, a microphone-based VAD output (VADm) may be generated by the VAD to indicate whether or not speech is detected. This determination may be based on an analysis of the power or energy present in the acoustic signal received by the microphone. The power in the acoustic signal may be compared to a threshold that indicates that speech is present. In another embodiment, the VADm signal indicating speech is computed using the normalized cross-correlation between any pair of the microphone signals (e.g., 1211 and 121M). If the cross-correlation has values exceeding a threshold within a short delay interval the VADm indicates that the speech is detected. In some embodiments, the VADm is a binary output that is generated as a voice activity detector (VAD), wherein 1 indicates that the speech has been detected in the acoustic signals and 0 indicates that no speech has been detected in the acoustic signals.
Both the VADa and the VADm may be subject to erroneous detections of voiced speech. For instance, the VADa may falsely identify the movement of the user or the headset 100 as being vibrations of the vocal chords while the VADm may falsely identify noises in the environment as being speech in the acoustic signals. Accordingly, in one embodiment, the VAD output (VADv) is set to indicate that the user's voiced speech is detected (e.g., VADv output is set to 1) if the coincidence between the detected speech in acoustic signals (e.g., VADm) and the user's speech vibrations from the accelerometer data output signals is detected (e.g., VADa). Conversely, the VAD output is set to indicate that the user's voiced speech is not detected (e.g., VADv output is set to 0) if this coincidence is not detected. In other words, the VADv output is obtained by applying an AND function to the VADa and VADm outputs.
As shown in
In
The electronic device 10 also includes a voice activity detector (VAD) 130 that generates an accelerometer VAD output (VADa) based on data output by the at least one accelerometer 113L. As shown in
The accelerometer data output signals (or accelerometer signals) may be first pre-conditioned. First, the accelerometer signals are pre-conditioned by removing the DC component and the low frequency components by applying a high pass filter with a cut-off frequency of 60 Hz-70 Hz, for example. Second, the stationary noise is removed from the accelerometer signals by applying a spectral subtraction method for noise suppression. Third, the cross-talk or echo introduced in the accelerometer signals by the speakers in the earbuds may also be removed. This cross-talk or echo suppression can employ any known methods for echo cancellation. Once the accelerometer signals are pre-conditioned, the VAD 130 may use these signals to generate the VADa output. In one embodiment, the VADa output is generated by using one of the X, Y, and Z accelerometer signals which shows the highest sensitivity to the user's speech or by adding the three accelerometer signals and computing the power envelope for the resulting signal. When the power envelope is above a given threshold, the VADa output is set to 1, otherwise is set to 0. In another embodiment, the VADa output indicating voiced speech is computed using the normalized cross-correlation between any pair of the accelerometer signals (e.g. X and Y, X and Z, or Y and Z). If the cross-correlation has values exceeding a threshold within a short delay interval the VADa output indicates that the voiced speech is detected. In another embodiment, a combined VAD output is generated by computing the coincidence as a “AND” function between the VADm from one of the microphone signals or beamformer output and the VADa from one or more of the accelerometer signals (VADa). This coincidence between the VADm from the microphones and the VADa from the accelerometer signals ensures that the VAD is set to 1 only when both signals display significant correlated energy, such as the case when the user is speaking. In another embodiment, when at least one of the accelerometer signal (e.g., X, Y, or Z signals) indicates that user's speech is detected and is greater than a required threshold and the acoustic signals received from the microphones also indicates that user's speech is detected and is also greater than the required threshold, the VAD output is set to 1, otherwise is set to 0. In some embodiments, an exponential decay function and a smoothing function are further applied to the VADa output.
Referring back to
The voice processor 150 may include a beamformer 152, a noise suppressor 153, a spectral mixer 154, an AGC controller 155, and a speech codec 156. In some embodiments, the headset 100 is coupled to the electronic device 10 wirelessly and communicates the output of the speech codec 156 to the electronic device 10. In this embodiment, the earbuds 110L, 110R include the beamformer 152, noise suppressor 153, spectral mixer 154, AGC controller 155, and speech codec 156. In other embodiments, the earbuds 110L are coupled to the electronic device 10 via the headset wire 120 and the electronic device 10 includes the beamformer 152, noise suppressor 153, spectral mixer 154, AGC controller 155, and speech codec 156.
The beamformer 152 receive the acoustic signals from at least one of the microphones 111BL and 111EL as illustrated in
In one embodiment, the VADa output may be used to steer the beamformer 152. For example, when the VADa output is set to 1, one microphone in one of the earbuds 110L, 110R may detect the direction of the user's mouth and steer a beamformer in the direction of the user's mouth to capture the user's speech while another microphone in one of the earbuds 110L, 110R may steer a cardioid or other beamforming patterns in the opposite direction of the user's mouth to capture the environmental noise with as little contamination of the user's speech as possible. In this embodiment, when the VADa output is set to 0, one or more microphones in one of the earbuds 110L, 110R may detect the direction and steer a second beamformer in the direction of the main noise source or in the direction of the individual noise sources from the environment.
In the embodiment illustrated in
Referring back to
The noise suppressor 153 may be a 2-channel noise suppressor that can perform adequately for both stationary and non-stationary noise estimation. In one embodiment, the noise suppressor 153 includes a two-channel noise estimator that produces noise estimates that are noise estimate vectors, where the vectors have several spectral noise estimate components, each being a value associated with a different audio frequency bin. This is based on a frequency domain representation of the discrete time audio signal, within a given time interval or frame.
The noise suppressor 153 then uses the output noise estimate generated by the two-channel noise estimator to attenuate the voice beam signal. The action of the noise suppressor 153 may be in accordance with a conventional gain versus SNR curve, where typically the attenuation is greater when the noise estimate is greater. The attenuation may be applied in the frequency domain, on a per frequency bin basis, and in accordance with a per frequency bin noise estimate which is provided by the two-channel noise estimator. The noise suppressed voice beam signal (e.g., clean beamformer signal) is then outputted to the spectral mixer 154.
The spectral mixer 154 may receive (i) the accelerometer signal (e.g., from at least one accelerometer 113L) and (ii) the clean beamformer signal (e.g., the noise suppressed or de-noised beamformer signal). The spectral mixer 154 performs spectral mixing of the received signals to generate a mixed signal. In one embodiment, the spectral mixer 154 generates a mixed signal that includes the accelerometer signal to account for the low frequency band (e.g., 800 Hz and under) of the mixed signal, and the clean beamformer signal to account for the high frequency band (e.g., over 4000 Hz).
The AGC controller 155 receives the mixed signal from the spectral mixer 154 and performs AGC on the mixed signal based on the VADa output received from the VAD 130. The speech codec 156 receives the AGC output from the AGC controller 155 and performs encoding on the AGC output based on the VADa output from the VAD 130. The speech codec may generate a speech signal.
Referring back to
In
The VADa decoder 161 and the speech decoder 163 receive and decode the encoded combined signal to respectively obtain a decoded VADa output and a decoded speech signal. In one embodiment, the VADa decoder 161 may pass the combined signal through a Low Pass filter and the speech decoder 163 may pass the combined signal through a High Pass filter. In one embodiment, both filters may have a cutoff frequency of about 80 Hz. The VADa decoder 161 may detect if in each frame of 10 ms, for example, there is either a positive or a negative semi-sinusoid. If the VADa decoder 161 detects either the positive or the negative semi-sinusoid, then the VADa decoder 161 generates the decoded VADa output that indicates that voice activity is detected, otherwise, the VADa decoder 161 generates the decoded VADa output that indicates that voice activity is not detected.
The decoded VADa output is provided to the end-pointer 162 which is a server-side endpointer in system 300. The end-pointer 162 may include a Deep Neural Network (DNN). The end-pointer 162 generates end-pointing markers (e.g., indicating beginning and ending of the user or primary speaker's utterance) based on the decoded VADa output from the VADa decoder 161. The ASR module 164 may generate acoustic and linguistic information during the decoding process from the acoustic model and the linguistic model that is transmitted to the end-pointer 162. In one embodiment, the end-pointer 162 generates end-pointing markers based on the VADa output and the acoustic and linguistic information that is received from the ASR module 164. The ASR module 164 may perform ASR on the speech signal based on the end-pointing markers received from the end-pointer 162. The ASR module 164 may be implemented to have a front-end DNN. The ASR module 164 may generate an ASR output that is transmitted back to the electronic device 10 wirelessly. The ASR output may include the text of the speech signal.
Contrary to system 300 in
In the embodiment in
In system 500B, the electronic device 10 includes VAD 130 that generates a VADa output based on data output by the at least one accelerometer 113L. The electronic device 10 in
The VADa output is provided to the end-pointer 162 which is included in the ASR engine 160 that is also included in the electronic device 10 in system 500B. The end-pointer 162 may include a Deep Neural Network (DNN)). The end-pointer 162 generates end-pointing markers (e.g., indicating beginning and ending of the user or primary speaker's utterance) based on the VADa output. The ASR module 164 may generate acoustic and linguistic information during the decoding process from the acoustic model to the linguistic model that is transmitted to the end-pointer 162. In one embodiment, the end-pointer 162 generates end-pointing markers based on the VADa output and the acoustic and linguistic information that is received from the ASR module 164. The ASR module 164 may perform ASR on the speech signal based on the end-pointing markers received from the end-pointer 162. The ASR module 164 may be implemented to have a front-end DNN. The ASR module 164 may generate an ASR output that is further processed by the electronic device 10. For example, the ASR output may include the text of the speech signal that the electronic device 10 displays on the device 10's display device (e.g., touch screen or display screen).
Contrary to
In another embodiment, the accelerometer signal received by the ASR engine 160 may also be received by the ASR module 164. In this embodiment, the accelerometer signal can be applied as a secondary input to the ASR module 164. Based on the accelerometer signal, the speech signal, and the end-pointing markers, the ASR module 164 in this embodiment performs ASR and generates an ASR output.
Contrary to the system in
The following embodiments of the invention may be described as a process, which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a procedure, etc.
The method 800 starts, at Block 801, with a voice activity detector (VAD) generating an accelerometer VAD output (VADa) based on data output by at least one accelerometer that is included in at least one earbud. The at least one accelerometer detects vibration of the user's vocal chords. In one embodiment, the VAD is included in an ASR engine included in a server. In this embodiment, the electronic device transmits the data output by the at least one accelerometer to the ASR engine and the ASR engine computes the VADa output using the server side VAD. In another embodiment, the VAD is included in an electronic device. In this embodiment, the VADa output is generated by the device-side VAD and transmitted to the ASR engine.
At Block 802, a voice processor generates a speech signal based on acoustic signals from at least one microphone. The voice processor may be included in the electronic device. In one embodiment, the VADa output generated by the VAD included in the electronic device and the speech signal from the voice processor are encoded by an encoder included in the electronic device. The ASR engine in this embodiment then decodes the combined signal to obtain a decoded VADa output and a decoded speech signal.
At Block 803, an end-pointer generates the end-pointing markers based on the VADa output. In one embodiment, the end-pointer is included in the ASR engine. The ASR engine may be included on a server.
At Block 804, an ASR engine performs ASR on the speech signal based on the end-pointing markers. In one embodiment, the ASR module included in the ASR engine generates acoustic and linguistic information. In this embodiment, the end-pointer may generate the end-pointing markers based on the decoded VADa output and the acoustic and linguistic information from the ASR module.
Keeping the above points in mind,
An embodiment of the invention may be a machine-readable medium having stored thereon instructions which program a processor to perform some or all of the operations described above. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), such as Compact Disc Read-Only Memory (CD-ROMs), Read-Only Memory (ROMs), Random Access Memory (RAM), and Erasable Programmable Read-Only Memory (EPROM). In other embodiments, some of these operations might be performed by specific hardware components that contain hardwired logic. Those operations might alternatively be performed by any combination of programmable computer components and fixed hardware circuit components.
While the invention has been described in terms of several embodiments, those of ordinary skill in the art will recognize that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting. There are numerous other variations to different aspects of the invention described above, which in the interest of conciseness have not been provided in detail. Accordingly, other embodiments are within the scope of the claims.