Hotphrase Triggering Based On A Sequence Of Detections

Information

  • Patent Application
  • 20230298588
  • Publication Number
    20230298588
  • Date Filed
    May 25, 2023
    a year ago
  • Date Published
    September 21, 2023
    8 months ago
Abstract
A method includes receiving audio data corresponding to an utterance spoken by the user and captured by the user device. The utterance includes a command for a digital assistant to perform an operation. The method also includes determining, using a hotphrase detector configured to detect each trigger word in a set of trigger words associated with a hotphrase, whether any of the trigger words in the set of trigger words are detected in the audio data during the corresponding fixed-duration time window. The method also includes determining identifying, in the audio corresponding to the utterance, the hotphrase when each other trigger word in the set of trigger words was also detected in the audio data. The method also includes triggering an automated speech recognizer to perform speech recognition on the audio data when the hotphrase is identified in the audio data corresponding to the utterance.
Description
Claims
  • 1. A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving audio data corresponding to an utterance spoken by a user, the utterance comprising: a command for a digital assistant to perform an operation;a set of trigger words; andone or more other words that are spoken between a first trigger word in the set of trigger words and a last trigger word in the set of trigger words;determining that the first trigger word in the set of trigger words is detected in the audio data;after determining that the first trigger word in the set of trigger words is detected in the audio data, determining that each other trigger word in the set of trigger words is also detected in the audio data; andbased on determining that the first trigger word and each other trigger word in the set of trigger words is detected in the audio data, triggering an automated speech recognizer (ASR) to perform speech recognition on the audio data.
  • 2. The method of claim 1, wherein triggering the ASR to perform speech recognition on the audio data comprises: generating a transcription of the utterance by processing the audio data; andperforming query interpretation on the transcription to identify that the transcription includes the command for the digital assistant to perform the operation.
  • 3. The method of claim 2, wherein generating the transcription comprises: rewinding the audio data buffered in memory hardware in communication with the data processing to a time at or before the first trigger word in the set of trigger words was detected in the audio data; andprocessing the audio data commencing at the time at or before the first trigger word in the sequence of trigger words to generate the transcription of the utterance.
  • 4. The method of claim 2, wherein the transcription comprises, between the first trigger word in the set of trigger words and the last trigger word in the set of trigger words, the one or more other words.
  • 5. The method of claim 1, wherein the operations further comprise: determining that each other trigger word in the set of trigger words is detected in the audio data during a fixed-duration time window commencing when the first trigger word in the set of trigger words was detected in the audio data,wherein triggering the ASR to perform speech recognition processing is based determining that each other trigger word in the set of trigger words is detected in the audio data during the fixed-duration time window.
  • 6. The method of claim 1, wherein determining that the first trigger word in the set of trigger words is detected in the audio data comprises: generating, using a hotphrase detector, a trigger word confidence score indicating a likelihood that the first trigger word is present in the audio data;detecting the first trigger word in the audio data when the trigger word confidence score satisfies a trigger word confidence threshold; andbuffering, in memory hardware in communication with the data processing hardware, the audio data and a trigger event for the first trigger word detected in the audio data, the trigger event indicating the trigger word confidence score and a timestamp indicating when the first trigger word was detected in the audio data.
  • 7. The method of claim 6, wherein the operations further comprise, based on determining that the first trigger word in the set of trigger words is detected in the audio data, executing a trigger word aggregation routine configured to: determine that a respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware; andwhen the respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware, determine a hotphrase confidence score indicating a likelihood that the utterance spoken by the user includes the set of trigger words,wherein triggering the ASR to perform speech recognition on the audio data comprises triggering the ASR to perform speech recognition on the audio data when the hotphrase confidence score satisfies a hotphrase confidence threshold.
  • 8. The method of claim 7, wherein executing the trigger word aggregation routine comprises executing a neural network-based model.
  • 9. The method of claim 7, wherein executing the trigger word aggregation routine comprises executing a heuristic-based model.
  • 10. The method of claim 1, wherein the data processing hardware resides on the user device.
  • 11. A system comprising: data processing hardware; andmemory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising: receiving audio data corresponding to an utterance spoken by a user, the utterance comprising: a command for a digital assistant to perform an operation;a set of trigger words; andone or more other words that are spoken between a first trigger word in the set of trigger words and a last trigger word in the set of trigger words;determining that the first trigger word in the set of trigger words is detected in the audio data;after determining that the first trigger word in the set of trigger words is detected in the audio data, determining that each other trigger word in the set of trigger words is also detected in the audio data; andbased on determining that the first trigger word and each other trigger word in the set of trigger words is detected in the audio data, triggering an automated speech recognizer (ASR) to perform speech recognition on the audio data.
  • 12. The system of claim 11, wherein triggering the ASR to perform speech recognition on the audio data comprises: generating a transcription of the utterance by processing the audio data; andperforming query interpretation on the transcription to identify that the transcription includes the command for the digital assistant to perform the operation.
  • 13. The system of claim 12, wherein generating the transcription comprises: rewinding the audio data buffered in memory hardware in communication with the data processing to a time at or before the first trigger word in the set of trigger words was detected in the audio data; andprocessing the audio data commencing at the time at or before the first trigger word in the sequence of trigger words to generate the transcription of the utterance.
  • 14. The system of claim 12, wherein the transcription comprises, between the first trigger word in the set of trigger words and the last trigger word in the set of trigger words, the one or more other words.
  • 15. The system of claim 11, wherein the operations further comprise: determining that each other trigger word in the set of trigger words is detected in the audio data during a fixed-duration time window commencing when the first trigger word in the set of trigger words was detected in the audio data,wherein triggering the ASR to perform speech recognition processing is based determining that each other trigger word in the set of trigger words is detected in the audio data during the fixed-duration time window.
  • 16. The system of claim 11, wherein determining that the first trigger word in the set of trigger words is detected in the audio data comprises: generating, using a hotphrase detector, a trigger word confidence score indicating a likelihood that the first trigger word is present in the audio data;detecting the first trigger word in the audio data when the trigger word confidence score satisfies a trigger word confidence threshold; andbuffering, in memory hardware in communication with the data processing hardware, the audio data and a trigger event for the first trigger word detected in the audio data, the trigger event indicating the trigger word confidence score and a timestamp indicating when the first trigger word was detected in the audio data.
  • 17. The system of claim 16, wherein the operations further comprise, based on determining that the first trigger word in the set of trigger words is detected in the audio data, executing a trigger word aggregation routine configured to: determine that a respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware; andwhen the respective trigger event for each other corresponding trigger word in the set of trigger words is also buffered in the memory hardware, determine a hotphrase confidence score indicating a likelihood that the utterance spoken by the user includes the set of trigger words,wherein triggering the ASR to perform speech recognition on the audio data comprises triggering the ASR to perform speech recognition on the audio data when the hotphrase confidence score satisfies a hotphrase confidence threshold.
  • 18. The system of claim 17, wherein executing the trigger word aggregation routine comprises executing a neural network-based model.
  • 19. The system of claim 17, wherein executing the trigger word aggregation routine comprises executing a heuristic-based model.
  • 20. The system of claim 11, wherein the data processing hardware resides on the user device.
Continuations (1)
Number Date Country
Parent 17118251 Dec 2020 US
Child 18323725 US