There are voice wakeup systems designed to allow a user to perform a voice search by uttering a query immediately after uttering a keyword. A typical example of a voice search (assuming the keyword is “Hello VoiceQ” and the query is “find the nearest gas station”), would be “Hello VoiceQ, find the nearest gas station.” Typically, the entire voice search utterance, including both the keyword and the query, are sent to an automatic speech recognition (ASR) engine for further processing. This can result in the ASR engine not properly recognizing the query. This failure can be due to the ASR engine confusing the keyword and query, e.g., mistakenly considering part of the keyword to be part of the query or mistakenly considering part of the query to be part of the keyword. As a result, the voice search may not be performed as the user intended.
Better voice search results could be obtained if just the whole query were sent to the ASR engine. It is, therefore, desirable to accurately and reliably separate the end of the keyword from the start of the query, and then send just the query to the ASR engine for further processing.
The technology disclosed herein relates to systems and methods for locating the end of a keyword in acoustic signals. Various embodiments of the disclosure can provide methods and systems for facilitating more accurate and reliable voice search based on an audio input including a voice search query uttered after a keyword. The keyword can be designed to trigger a wakeup of a voice sensing system (e.g., “Hello Voice Q), whereas the query (e.g., find the nearest gas station”) includes information upon which a search can be performed.
Various embodiments of the disclosure can facilitate more accurate voice searches by providing a clean query to the automatic speech recognition (ASR) engine for further processing. The clean query can include the entire query and only the entire query, absent any part of the keyword. This approach can assist the ASR engine by determining the end of the keyword and separating out the query so that the ASR engine can more quickly and more reliably respond to just the question posed in the query.
Various embodiments of the present disclosure may be practiced with any audio device operable to capture and process acoustic signals. In various embodiments, audio devices can include smart microphones which combine microphone(s) and other sensors into a single device. Various embodiments may be practiced in smart microphones that include voice activity detection for providing a wakeup feature. Low power applications can be enabled by allowing the voice wakeup to provide a lower power mode in the smart microphone until a voice activity is detected.
In some embodiments, the audio devices may include hand-held devices, such as wired and/or wireless remote controls, notebook computers, tablet computers, phablets, smart phones, smart watches, media players, mobile telephones, and the like. In certain embodiments, the audio devices may include a personal desktop computer, TV sets, car control and audio systems, smart thermostats, and so forth. The audio devices may have radio frequency (RF) receivers, transmitters, and transceivers, wired and/or wireless telecommunications and/or networking devices, amplifiers, audio and/or video players, encoders, decoders, loud speakers, inputs, outputs, storage devices, and user input devices.
Referring now to
In some embodiments, the smart microphone 110 includes an acoustic sensor 112, a sigma-delta modulator 114, a downsampling element 116, a circular buffer 118, upsampling elements 126 and 128, amplifier 132, a buffer control element 122, a control element 134, and a low power sound detect (LPSD) module 124. The acoustic sensing device 112 may include, for example, a microelectromechanical system (MEMS), a piezoelectric sensor, and so forth. In various embodiments, components of the smart microphone 110 are implemented as combinations of hardware and programmed software. At least some of the components of the smart microphone 110 may be disposed on an application-specific integrated circuit (ASIC). Further details concerning various elements in
In various embodiments, the smart microphone 110 may operate in multiple operational modes, including a voice activity detection (VAD) mode, a signal transmit mode, and a burst mode. While operating in the voice activity detection mode, the smart microphone 110 may consume less power than in the signal transmit mode.
While in the VAD mode, the smart microphone 110 may detect voice activity. Upon detection of the voice activity, the select/status (SEL/STAT) signal may be sent from the smart microphone 110 to the host device 120 to indicate the presence of the voice activity detected by the smart microphone 110.
In some embodiments, the host device 120 includes various processing elements, such as a digital signal processing (DSP) element, a smart codec, a power management integrated circuit (PMIC), and so forth. The host device 120 may be part of a device, such as, but not limited to, a cellular phone, a smart phone, a personal computer, a tablet, and so forth. In some embodiments, the host device is communicatively connected to a cloud-based computational resource (also referred to as a computing cloud).
In response to receiving an indication of the presence of a voice activity, the host device 120 may start a wakeup process. After the wakeup latency period, the host device 120 may provide the smart microphone 110 with a clock (CLK) (for example, 768 kHz). Responsive to receipt of the external CLK clock signal, the smart microphone 110 can enter a signal transmit mode.
In the signal transmit mode, the smart microphone 110 may provide buffered audio data (DATA signal) to the host 120 at the serial digital interface (SDI) input. In some embodiments, the buffered audio data may continue to be provided to the host device 120 as long as the host device 120 provides the external clock signal CLK to the smart microphone 110.
In some embodiments, a burst mode can be employed by the smart microphone 110 in order to reduce the latency due to the buffering of the audio data. The burst mode can provide faster than real time transfer of data between the smart microphone 110 and the host device 120. Example methods employing a burst mode are provided in U.S. patent application Ser. No. 14/989,445, filed Jan. 6, 2016, entitled “Utilizing Digital Microphones for Low Power Keyword Detection and Noise Suppression”, which is incorporated herein by reference in its entirety.
The charge pump 212 can provide voltage to charge up a diaphragm of the MEMS sensor 214. An acoustic signal including voice may move the diaphragm, thereby causing the capacitance of the MEMS sensor 214 to change from creating voltage to generating an analog electrical signal.
The clock detector 224 can control which clock is provided to the sigma-delta modulator 226. If an external clock is provided (at the CLOCK pin 244), the clock detector 224 can use the external clock. In some embodiments, if no external clock is provided, the clock detector 224 uses the internal oscillator 222 for data timing/clocking.
The sigma-delta modulator 226 may convert the analog electrical signal into a digital signal. The output of the sigma-delta modulator (representing a one-bit serial steam) can be provided to the LPSD element for further processing. In some embodiments, the further processing includes voice activity detection. In certain embodiments, the further processing includes also include keyword detection, for example, after detecting voice activity, determining that a specific keyword is present in the acoustic signal.
In some embodiments, the smart microphone 200 may detect voice activity while operating in an ultra-low power mode and running only on an internal clock without need for an external clock. In some embodiments, LPSD element 124 with VAD gain element 230 and a circular buffer 134 are configured to run at ultra-low power mode to provide VAD capabilities.
LPSD element 124 can be operable to detect voice activity in the ultra-low power mode. Sensitivity of the LPSD element 124 may be controlled via the VAD gain element 230 which provides an input to the LPSD module 124. The LPSD element 124 can be operable to monitor incoming acoustic signals and determine the presence of a voice-like signature indicative of voice activity.
Upon detection of an acoustic activity that meets trigger requirements to quality as voice activity detection, the smart microphone 200 can provide a signal to the SEL/STAT pin 248 to wake up a host device coupled to the smart microphone 200
In some embodiments, the circular buffer 118 stores acoustic data generated prior to detection of voice activity. In some embodiments, the circular buffer 118 may store 256 milliseconds of acoustic data. The host device can provides a CLK signal to a smart microphone CLK pin. Once the CLK signal is detected, the smart microphone 200 may provide data to the DATA pin.
In some embodiments, keyword detection can be performed within the smart microphone 110 (in
A determination as to which frame corresponds to the end of the keyword may be made based on a confidence value (i.e. posterior likelihood). The confidence value can represent a measurement of how well the part 320 of acoustic signal 310 matches a pre-determined keyword (for example, “Ok VoiceQ” in the example of
In some embodiments, the keyword detection is performed based on phoneme Hidden Markov model (HMM). In other embodiments, the keyword detection is performed using a neural network trained to output the confidence value. In these and other embodiments, the confidence value can be computed using Gaussian Mixture Models, or using Deep Neural Nets, or using any other type of detection scheme (e.g. support vector machines, etc.) In some embodiments, the confidence level can be calculated from the confidence values measured at a number of frames fed to the phoneme HMM or neural network. Therefore, the confidence level can be considered a function of a number of consecutive frames of the acoustic signal.
A plot 400 is an example plot of a confidence value 410 for an example signal is shown in
Tests performed by the inventors have shown that the maximum of the confidence value correlates well with the true end of the keyword. In the tests, the standard deviation of the error between the true end of the keyword and the frame corresponding to the maximum of the confidence is less than 50 milliseconds, with the mean value at 0. In the example of
According to various embodiments of the present disclosure, when a keyword detection occurs (due to the confidence value exceeding the detection threshold), the voice sensing flags a keyword detection event. In various embodiments, the voice sensing then continues to monitor the acoustic signal in frames to compute a running maximum of the confidence value for every frame. The frame (for example, frame 440 in
In some embodiments, a fixed offset is added to the end-of-keyword frame. In these embodiments, the maximum value of the confidence may give a good estimate of the location of the end of the keyword, but for flexibility purposes an offset can be added when assigning the final end of keyword time. For example, some embodiments may mark the end of the keyword 10 ms later to prevent any part of the keyword in the query, and where it is not considered problematic if a very small amount of the query is accidentally removed. Other embodiments may mark the end of the keyword 10 ms earlier where it is important not to miss anything in the query.
The confidence value cannot be monitored forever for a hypothetical maximum value to occur. Therefore, in some embodiments, the monitoring is stopped when any of the following conditions are satisfied:
1) The time elapsed since the keyword detection exceeds a pre-determined duration time (DT) 450. In some embodiments, DT is between 100 or 200 milliseconds.
2) The confidence value at the current frame has dropped by a pre-determined threshold (marked as DC 460 in
3) The confidence value at the current frame has dropped below the detection threshold 420.
In some embodiments, method 500 commences in block 502 with receiving an acoustic signal that includes a keyword portion immediately followed by a query portion. The acoustic signal represents at least one captured sound. In block 504, method 500 can determine the end of the keyword portion. In block 506, method 500 can separate, based on the end of the keyword portion, the query portion from the keyword portion of the acoustic signal. In block 508, method 500 can provide the query portion, absent any part of the keyword portion, to an automatic speech recognition (ASR) system.
In block 604, method 600 can determine a first point in the time period. The first point can divide the acoustic signal into a first part and a second part. The first point is a point at which a confidence level reaches a first threshold, where the confidence value represents a measure of degree of a match between the first part and the keyword (i.e., how well the first part of the acoustic signal matches the keyword.)
In response to the determination of the first point, method 600 can proceed, in block 606, to monitor further confidence values at further points following the first point. In some embodiments, during the monitoring, a running maximum of the confidence value is computed at every frame.
The monitoring can continue until determining that a predefined condition is satisfied. The predefined condition may include one of the following: further points reach a maximum predefined detection time, the further confidence values drops below the first threshold, and further confidence value drops below the maximum of the confidence values by a second pre-determined threshold.
In block 608, method 600 proceeds with estimating, based on the confidence values for the further points, the location of the end of the keyword. In some embodiments, the point that corresponds to the maximum value of the confidence values is assigned a location at the end of the keyword in the acoustic signal.
The present technology is described above with reference to example embodiments. Therefore, other variations upon the example embodiments are intended to be covered by the present disclosure.
This application claims the benefit of and priority to U.S. Provisional Application No. 62/425,155, filed Nov. 22, 2016, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62425155 | Nov 2016 | US |