Devices that analyze audio are used for various applications. For example, digital assistants can be used to help users with a variety of daily tasks. For example, the digital assistants can help provide information, reminders, maintain task lists, and the like. The digital assistants can operate by continuously listening for audio streams. The audio streams can be processed to wake the digital assistant and then provide the information the user is looking for.
Examples described herein provide devices with event based keyword detectors. As discussed above, devices with voice activated digital assistants are becoming more ubiquitous. Voice activated digital assistants can be activated by speaking a keyword to “wake” the voice activated digital assistant. However, the voice activated digital assistants are deployed on devices that use a continuous power source. For example, the device may be plugged into a power outlet to power a digital signal processor (DSP) that is continuously processing audio until a keyword is detected by the DSP.
Due to the high consumption of power to continuously operate the DSP, the voice activated digital assistants have not moved to mobile devices. The continuous operation of the DSP may consume too much power to make it practical to operate on battery operated devices (e.g., mobile devices).
Rather, on mobile devices, a user may select a button to activate the digital assistant. After the button is pressed, the user may interact with the voice activated digital assistant.
Examples herein provide a device that includes a separate component to initially analyze the audio stream using less power. Once the keyword is detected, the DSP (or any other type of deep learning/machine learning system) can be activated to analyze subsequent streams of audio. As a result, the voice activated digital assistants can be deployed on mobile devices that operate on battery power. The DSP can be selectively activated once the keyword is detected by a low power consumption component that can detect keywords based on events generated from an audio stream. Thus, the device can continuously monitor audio signals to detect the keyword without greatly affecting the battery life on the mobile devices.
In one example, the device may also provide improved privacy. For example, an event based keyword detector in the device may be tuned such that the events generated from the audio signal cannot be reconstructed into the original audio signal.
In an example, the device 100 may include a processor 102, the event based keyword detector 104, a digital signal processor (DSP) 106, a memory 108, a microphone 110, and a speaker 112. The processor 102 may be communicatively coupled to, and control operation of, the event based keyword detector 104, the DSP 106, the memory 108, the microphone 110, and speaker 112. The DSP 106 may be any type of very long instruction word (VLIW) processor, single instruction multiple data (SIMD) processor, application specific integrated chip (ASIC), deep learning system, machine learning system, and the like.
In an example, the memory 108 may be a non-transitory computer readable medium that includes instructions executed by the processor 102. For example, the instructions to perform the functions for a voice activated digital assistant may be stored in the memory 108. The memory 108 may also store data. For example, an audio signal 114 received by the microphone 110 may be temporarily stored in the memory 108 for sampling and tuning the event based keyword detector 104, as described in further details below. The memory 108 may be a hard disk drive, random access memory, read only memory, and the like.
In an example, the event based keyword detector 104 may be deployed as dedicated hardware components within the device 100. The components may be programmed to perform a particular function. The components, in combination, may represent the event based keyword detector 104 to perform keyword detection on the audio signal 114 based on events generated from the audio signal 114.
An amount of data generated by the events may be much smaller than the amount of data generated by the audio signal 114. For example, the events may be a portion of the audio signal 114. As a result, the processing power used to analyze the lower amount of data associated with the events may consume less power than traditional analysis of the entire audio signal 114 by the DSP 106.
The event based keyword detector 104 may process the audio signal 114 to search for a keyword based on the events. The keyword may be a command or word that is spoken by a user to begin interaction with the voice activated digital assistant. In response to detecting the keyword, the event based keyword detector 104 may send an enable signal to the DSP 106. The DSP 106 may then perform full analysis on subsequent audio signals 114 that are received. For example, the subsequent audio signals 114 may by-pass the event based keyword detector 104 until the DSP 106 is deactivated again. The user may provide the audio signals 114 via the microphone 110 and the voice activated digital assistant may provide information via the speaker 112.
As a result, the large amount of power consumed by the DSP 106 to perform full analysis of the entire audio signal 114 continuously may be avoided. Rather, a lower amount of power may be consumed by the event based keyword detector 104 to analyze events generated from the audio signal 114 until a keyword is detected. In response to detecting the keyword, the DSP 106 may be activated.
In an example, the raster plot generator 204 may record the events generated by the event generator 202 as a raster plot. The raster plot may be a way that the events are recorded and stored in memory for analysis by the keyword detector 206. The raster plot may be stored as data in memory 108, accessed by the keyword detector 206, and analyzed to detect a pattern of events.
In an example, the raster plot may be represented as a two dimensional graph in a Cartesian coordinate system. An x-axis of the raster plot may represent time (e.g., in seconds). The y-axis may represent a particular event generator 202. As discussed in further details below, a plurality of the event generators 202 may be deployed.
The raster plot may provide a visual representation of the events that are generated by the event generator 202. Different words or audio signals may be represented by different patterns of events that can be detected within the raster plot.
In an example, the keyword detector 206 may be any type of component (e.g., hardware, software, or combination of hardware and software) that can be trained to detect a keyword by recognizing a pattern of events that is associated with the keyword. In an example, the keyword detector 206 may be a neural network. Examples of neural networks include a recurrent neural network (RNN) that uses a long short-term memory (LSTM) element, a convolutional neural network (CNN), a spiking neural network (SNN), and the like.
The keyword detector 206 may be trained to detect the keyword in the raster plot generated by the raster plot generator 204. In other words, the keyword may be detected by the keyword detector 206 when a pattern in the events generated by the event generator 202 matches a pattern of events associated with a known keyword. The keyword detector 206 may be trained to detect the pattern of events associated with the known keyword.
In an example, the audio signal 114 may be provided to the integrator 302 as a series of input values x(t). The audio signals x(t) may be integrated over time to generate integrated audio signal values h(t).
The comparator 304 may be set with a positive and a negative threshold value. Said another way, the threshold value may be represented as a positive value and a negative value in the comparator 304. In an example, the threshold values may be set to +0.5 and −0.5. However, it should be noted that the threshold values can be non-symmetrical (e.g., +0.8 and −0.3).
Each one of the integrated audio signal values h(t) may be compared against the positive and negative threshold values in the comparator 304. When an integrated audio signal value h(t) exceeds the threshold, the event generator may output a signal or value indicating that an event has been generated. When an event is generated, the reset, or refractory, timer 306 may pause the integrator 302 for a predefined amount of time associated with the reset timer 306 (e.g., 10 seconds, 30 seconds, and so forth). After the reset timer 306 expires, the integrator 302 may continue to integrate the audio signal x(t) to generate integrated audio signal values h(t). The comparator 304 may again compare the integrated audio signal values h(t) to the threshold until the threshold is exceeded to generate another event. The process may be continuously repeated for the audio signal 114.
For example, the audio signal 114 may be represented as values x(t) as shown by the graph 402. The values x(t) may vary over time. The audio signal 114 may be integrated to calculate the integrated audio signal values h(t) illustrated in the graph 404.
Using the example threshold value of +0.5 and −0.5 described above, the integrated audio signal values h(t) may be compared to the threshold. At time 4081, the value of h(t) may exceed +0.5. As a result an event may be detected as indicated by a corresponding line in the graph 406. As noted above, the reset timer 306 may pause the integrator 302 for a predefined period of time.
The integrator 302 may begin integrating the audio signal x(t) again after the reset timer 306 expires. The value of h(t) begins to rise again until the threshold value of +0.5 is exceeded again at time 4082. As a result, a corresponding line is generated at time 4082 in the graph 406 and a second event is generated.
The above process can be repeated for the audio signal x(t) to detect the events at times 4081-408n, as illustrated in the graph 406. The pattern of the events (e.g., the spacing between the events at times 4081-408n, the amount of events that is generated, and the like) can be recorded in a raster plot and provide a pattern. For example, the pattern of events generated at times 4081-408n in the graph 406 may be associated with a keyword and recorded in a raster plot. When the pattern is detected by the keyword detector 206, the neural network may determine that the keyword is detected.
In addition, the events may compress the amount of data consumed by the audio signal 114 by a large amount. For example, 10 events may be detected from the audio signal 114 that may have 100 points of data. Thus, the amount of data to be analyzed may be compressed or reduced by 10 times in one example. As noted above, this may reduce the amount of processing to analyze the smaller amount of data and consume less power than processing the raw audio signal or other audio processing techniques.
Based on the descriptions above, an “event” may be defined as an output of the event generator 202. Said another way, an “event” may be defined as a time when an integrated value of the audio signal exceeds a threshold value.
In an example, the amount of events that is generated may be controlled or tuned based on the value of the threshold set in the comparator 304. For example, the larger the threshold, the fewer events that may be detected. For example, in
Conversely, the smaller the threshold, the more events that may be detected. For example, in
The value of the threshold in the comparator 304 may be set to optimize the event generator 202. For example, as the threshold value is reduced, the number of events approaches the same number of data points as originally found in the audio signal x(t). As a result, the savings in power and processing resources may be reduced as the amount of data increases. However, as the threshold value is increased, the number of events is reduced to a point where the confidence and accuracy of the keyword detection performed by the keyword detector 206 may be reduced. In addition, as the threshold value is increased and the number of events is reduced, privacy is increased as the audio signal 114 cannot be reconstructed from the events that are generated. Thus, the threshold value in the comparator 304 may be set to generate a number of events that minimizes the processing resources to analyze the data (e.g., reduces power consumption), maximizes the accuracy and confidence of the keyword detector 206 to detect the keyword, and provides privacy such that the audio signal cannot be reconstructed from the events that are generated.
In one example, the event generators 2021 to 202m may each be set with a different threshold value. For example, the comparator 304 of the event generator 2021 may be set to a first threshold value. The comparator 304 of the event generator 2022 may be set to a second threshold value. The comparator 304 of the event generator 2023 may be set to a third threshold value, and so forth.
As a result, each one of the event generators 2021 to 202m may generate a different number of events. As discussed above, the threshold value of the comparator 304 may determine how many events are generated by a respective event generator 2021 to 202m. The raster plot generator 504 may generate a raster plot that includes each event generator 2021 to 202m along the y-axis and time along the x-axis. The events generated by each event generator 2021 to 202m may be recorded in the raster plot. The raster plot may then be provided to the keyword detector 506.
In an example, the keyword detector 506 may be trained to detect a pattern of events from the event generators 2021 to 202m recorded in the raster plot that represents a keyword. For example, the keyword detector 506 may be trained to detect a particular pattern of events from the event generators 2021 to 202m that is associated with a known keyword. When the pattern of events in the raster plot matches the pattern of events associated with the known keyword, the keyword detector 506 may determine that the keyword is detected. The keyword detector 506 may be a neural network, such as an RNN that uses an LSTM element, a CNN, an SNN, and the like.
In an example, the event generators 2021 to 202m may each be selectively enabled or disabled. For example, a different combination of event generators 2021 to 202m and associated number of events generated by the event generators 2021 to 202m may provide the most accurate and confident keyword detection. The event generators 2021 to 202m may be tuned before being implemented in the device 100 via a feedback loop, as illustrated in
In an example, the event generators 2021 to 202m may operate in a cascading fashion to trigger the DSP 106. For example, the first event generator 2021 may generate events from the audio signal 114. The raster plot generator 504 may generate the raster plot of the events and the keyword detector 506 may detect a desired keyword based on detecting a pattern of events associated with the desired keyword in the raster plot.
If the desired keyword is detected from the events generated by the first event generator 2021, the second event generator 2022 may be executed. The second event generator 2022 may have threshold values (e.g., in the respective comparator of thresholds 304) and generate events from the same audio signal 114 processed by the first event generator 2021, but with tighter threshold values. The raster plot generator 504 may generate a raster plot of events and the keyword detector 506 may detect a desired keyword based on detecting a pattern of events associated with the desired keyword in the raster plot.
If the desired keyword is detected, the process may be repeated with the third event generator 2023 and continue up to the last event generator 202m. The threshold values for each successive event generator 2023 to 202m may continue to get tighter (e.g., the range becomes gradually narrower). If the desired keyword is detected based on the events generated by the last event generator 202m, the keyword detector 506 may generate a signal that causes the DSP 106 to trigger and begin full analysis of subsequent audio signals.
In an example, the feedback loop may include the memory 108, the DSP 106, a confidence calculator 608, and an event rate generation adjuster 610. The feedback loop may be used to tune the event generators 2021 to 202m such that the keyword detector 606 may consistently detect the keyword with accuracy and confidence. The event generators 2021 to 202m may be tuned by adjusting the threshold values in the respective comparators 304 of the event generators 2021 to 202m and/or selectively enabling or disabling the event generators 2021 to 202m.
The event based keyword detector 104 may operate similar to the event based keyword detector 104 illustrated in
In one example, the high confidence signal may be an enable signal to the DSP 106 to begin operating and performing full analysis on subsequent audio streams. For example, the subsequent audio signals may be provided directly to the DSP 106 rather than being fed through the event based keyword detector 104 when the DSP 106 is activated.
In one example, when the event generators 2021 to 202m are being tuned, the DSP 106 may verify whether or not the keyword detector 606 was accurate in detecting the keyword. For example, it is possible that the neural network may provide high confidence in an inaccurate conclusion.
The audio signal 114 may be temporarily stored in the memory 108 to be sampled by the DSP 106. The DSP 106 may analyze the audio signal 114 to determine if the keyword is detected. If the keyword is detected in the audio signal 114 by the DSP 106, then the DSP 106 may provide feedback that the keyword was accurately detected by the keyword detector 606. If the keyword is not detected in the audio signal 114 by the DSP 106, then the DSP 106 may provide feedback that the keyword was not accurately detected. The DSP 106 may provide an accuracy feedback 616 to the event rate generation adjuster 610.
In an example, if the confidence calculator 608 determines that the confidence value generated by the keyword detector 606 is below the confidence threshold value, then a low confidence signal may be transmitted to the event rate generation adjuster 610. In response to a signal from the DSP 106 that the keyword detector 606 was inaccurate, in response to a low confidence signal from the confidence calculator 608, or in response to both, the event rate generation adjuster 610 may tune the event generators 2021 to 202m.
In an example, the event rate generation adjuster 610 may generate an enable/disable event generator signal 612 and/or generate a threshold adjust signal 614. The enable/disable event generator signal 612 may send either an enable signal or a disable signal to any one of the event generators 2021 to 202m. The threshold adjust signal 614 may set the threshold value in the respective comparators 304 of the event generators 2021 to 202m to a desired value. In one example, the threshold adjust signal 614 may cause the threshold value to be incrementally increased or decreased.
After the event rate generation adjuster 610 tunes the event generators 2021 to 202m, the process may be repeated with another audio signal 114. The process may be repeated until the keyword detector 606 generates a confidence score that exceeds the confidence threshold and the DSP 106 indicates that the keyword detector 606 has accurately detected the keyword.
In an example, the event rate generation adjuster 610 may perform step changes to perform the tuning. For example, the event generators 2021 to 202m may all be initially enabled and have the respective comparators 304 set to a particular threshold value. For example, the threshold value may be different for respective comparator 304 of each event generator 2021 to 202m. The event rate generation adjuster 610 may disable one event generator 2021 to 202m at a time for each iteration of the feedback and tuning loop that is performed. The event rate generation adjuster 610 may then incrementally change the threshold value in the respective comparators 304 of the event generators 2021 to 202m one at a time. For example, the event rate generation adjuster 610 may increase or decrease the threshold value of the event generator 2021 by 0.05, then increase or decrease the threshold value of the event generator 2022 by 0.05, and so forth.
In an example, the event rate generation adjuster 610 may perform changes randomly. For example, the event rate generation adjuster 610 may disable the event generator 2022 and change the threshold value of the comparator 304 of the event generator 2023. If another tuning step is performed, the event rate generation adjuster 610 may enable the event generator 2022 and change the threshold value of the comparator 304 in the event generator 2021 and the threshold value of the comparator 304 in the event generator 2022, and so forth.
Thus, in an example, the event based keyword detector 104 may be tuned such that the keyword is detected in the audio signal 114 with high confidence and accuracy. Once the event based keyword detector 104 is tuned, the tuning/feedback process may be stopped and the device 100 may be activated to listen for the keyword. In an example, the tuning/feedback process may be periodically performed to ensure that the event based keyword detector 104 continues to detect the keyword in the audio signal 114 with high confidence and accuracy over time.
At block 702, the method 700 begins. At block 704, the method 700 receives an audio signal. For example, the audio signal may be sound or speech received via a microphone of the device.
At block 706, the method 700 generates, by an event based keyword detector, a plurality of events from the audio signal. In an example, the event based keyword detector may include an event generator to generate events from the audio signal. The event generator may be a biphasic integrator that integrates the audio signal over time. The integrated audio signal values may be compared to a positive and a negative threshold value. When the integrated audio signal value exceeds either the positive or negative threshold value, an event may be detected. The integrator may be paused for a predefined period of time after the event is generated before continuing to integrate the audio signal over time. The process may then be repeated. Thus, the amount of data may be reduced by compressing the audio signal into a smaller number of events that represents the audio signal.
In an example, the event generators of the event based keyword detector may be tuned beforehand. The tuning may store the audio signal in memory to be sampled by a digital signal processor (DSP). The detection of the keyword by the event based keyword detector may be compared to the detection of the keyword by the digital signal processor in the audio signal. Based on the difference in detection, the event generators may be tuned. For example, the tuning may selectively enable or disable event generators and/or change a threshold value of a respective comparator of the event generators. The tuning process may be repeated until an amount of accuracy (e.g., accurate greater than 90% of the time) and an amount of confidence (e.g., a confidence score above 95%) are above a desired threshold before the audio signal is received.
At block 708, the method 700 generates, by the event based keyword detector, a raster plot of the plurality of events. For example, the event based keyword detector may include a raster plot generator. The raster plot generator may record each event generated from each event generator on a Cartesian coordinate system.
At block 710, the method 700 analyzes, by the event based keyword detector, the raster plot to detect a pattern in the plurality of events that is associated with a keyword. For example, the event based keyword detector may include a neural network that can be trained to detect a particular pattern of events in the raster plot. The particular pattern of events may be associated with a known keyword to activate a voice activated digital assistant. The neural network may detect the keyword when a pattern of events in the raster plot matches the particular pattern of events that is associated with the known keyword.
At block 712, the method 700 activates a digital signal processor to analyze subsequent audio streams in response to the keyword being detected. For example, when the keyword is detected by the event based keyword detector, an enable signal may be sent by the event based keyword detector to the digital signal processor. The digital signal processor may activate and analyze the subsequent audio streams. In other words, when the digital signal processor is activated, the subsequent audio streams may by-pass the event based keyword detector until interaction with the voice activated digital assistant is completed. In an example, the interaction may be completed when no audio signal is detected for a predefined period of time.
When the interaction ends and the voice activated digital assistant is deactivated, the digital signal processor may also be deactivated. Audio signals may then be passed through the event based keyword detector again until the keyword is detected. At block 714, the method 700 ends.
In an example, the instructions 806 may include instructions to set a threshold value for a plurality of event generators of an event based keyword detector. The instructions 808 may include instructions to receive an audio signal. The instructions 810 may include instructions to detect a keyword from the audio signal by the event based keyword detector based on a pattern of events generated by each one of the plurality of event generators and recorded in a raster plot, wherein an event is generated when an integrated audio signal value exceeds the threshold. The instructions 812 may include instructions to activate a digital signal processor to analyze subsequent audio streams after the keyword is detected by the event based keyword detector.
It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/056638 | 10/17/2019 | WO |