The present invention is in the field of automated speech recognition (ASR), and, more specifically, use of multiple recognizers.
An objective of some speech recognition systems is to perform recognition as accurately as possible. An objective of some speech recognition systems is to respond with the most useful possible results. An objective of some speech recognition systems is to respond robustly in environments with failure-prone connections. An objective of some speech recognition systems is to distribute the processing workload between computer processors or geographical locations such as server farms.
Those objectives, in many embodiments, conflict with another key objective, which is to respond to users with useful results as quickly as possible. That is, with low latency. Dual mode speech recognition systems and methods use multiple recognizers to convert speech into useful results. Known embodiments of dual mode speech recognition attempt to address the conflicting objectives by sending speech to multiple recognizers varying in speed and accuracy, and provide low latency by setting up a timeout and choosing among the results, if any, that are received before the timeout occurs.
This approach has a major drawback, which is that, in some instances, the user will receive no response until the timeout occurs. That is, he must wait for as long as the longest amount of time the system is designed to wait for any response. Furthermore, in no case will the system respond before receiving a second result, even if the first result is of sufficient quality.
The present disclosure is directed to embodiments of systems, methods, and non-transitory computer readable media that perform dual mode speech recognition. Various embodiments respond to user speech immediately if a result is of sufficient quality. Quality is measured by a recognition score. Various embodiments respond early if the first result is useful, as measured by the quality score, and vary the latency as a function of the quality of the result. Various embodiments use timeout events whose duration varies with quality: a low quality result suggests waiting longer for a higher quality result. Various embodiments ignore early results if they are below an acceptable level of quality, and respond with a later result or an error if no second result is received before a timeout occurs.
Some embodiments have asymmetrical recognizers, such as one that responds more quickly and one that responds with more accurate or more useful results. For example, some mobile phones perform speech recognition both locally and over a wireless Internet connection. Some earpiece headsets perform speech recognition in the headset, but also in a phone connected over a personal area network.
Some embodiments are Internet-connected automobiles that respond, if possible from a remote server, which has access to useful dynamic data such as weather and traffic conditions, but responds from a local recognizer when the automobile is in a location that has no wireless network connectivity.
Some embodiments are power-sensitive systems-on-chip that use low power processors for recognition in a typical mode, but wake up a high performance processor if needed to provide better results.
Some embodiments use server-based dual mode recognition, and send speech to more than one server with symmetrical recognizers that have different ping latencies or different presence of local data.
Some embodiments send speech to multiple remote recognizers if accessing different recognizers incur different costs. In such case, it may be advantageous to send spoken utterances to the less expensive one, compare a resulting recognition score to a threshold, and, if the recognition score is below the threshold, send the spoken utterance to a second recognizer.
Some embodiments perform recognition on delimited spoken queries, such as the speech between a detected wake-up phrase and a detected end-of-utterance. Some embodiments perform recognition continuously, typically at periodic intervals called frames, such as every 10 msec. Some embodiments perform speech recognition incrementally.
Various embodiments quantify the quality of results using various appropriate techniques. As part of speech recognition, some embodiments compute hypotheses and probability scores for phonemes, phonetic sequences, word sequences (transcriptions), grammatically correct sentences (parses), and meaningful interpretations. Recognition scores, in various embodiments, are based on a probability score alone or a combination of such probability scores.
Spoken utterances are delimited segments of speech, typically comprising multiple words. In various embodiments, they are initiated by a wake-up phrase or a UI action such as clicking or tapping, and terminated by detection of an end-of-utterance event or a UI action such as tapping or releasing a button.
Recognizers are hardware- or software-implemented subsystems, which receive speech and return recognition results with associated scores. The form and the nature of results varies widely across embodiments, but can include a text transcription, information requested by the speech, or representations of user intents, as data structures represented in JavaScript Object Notation (JSON) or other equivalent internal or exchange data format.
Various embodiments respond to users so as to give them a feeling that their speech has effected a desired result. Responses comprise results, but may include other information or actions as appropriate for various embodiments. For example, a spoken user request for a coffee causes a speech-enabled coffee maker to respond with spoken words and to produce a cup of coffee. Results are the basis for embodiments to produce responses. In some embodiments, results are text for a machine to output from a text-to-speech module. In some embodiments, results include text with mark-up meta information and instructions encoded for a machine to process.
Various recognizer embodiments associate recognition scores with results, and return the scores within results, or prior to results, depending on the embodiment. For example, the recognition score may be presented as included within the results or alternatively as separate from the results. Recognizers produce scores in various appropriate ways that practitioners of the embodiment know.
Local recognizers are ones present within devices with which users interact directly. Remote recognizers are ones that couple with user devices through means such as networks, cables, or wireless signaling.
The term timeout can refer to a period (i.e., duration) of time, a point in time, an event, or a stored value, as will be apparent to readers skilled in the art. Various embodiments start a timeout timer counting as soon as they send speech to a recognizer, start a timeout timer upon receiving a first score, or start a timeout timer at the time of any other event, as appropriate.
A function of a recognition score may be used to determine a timeout duration that is appropriate for a given recognition score. Some embodiments use discrete and some use continuous functions. For many embodiments, a non-increasing function is appropriate.
In the embodiment of
In
Upon receiving a first score, the chooser compares the first received score to a high threshold in Step 41. If the score is above the high threshold, the embodiment chooses in Step 42 a first result associated with the first score as the basis for creating a response, without waiting for another result. If the first score is not above the high threshold, the embodiment sets a timeout duration as a function of the first score in Step 44. Upon receiving a second result before the timeout occurs in Step 45, the embodiment presumes that the second result is superior to the first result and chooses it in Step 47. If the timeout occurs in Step 45 before receiving a second result, the embodiment chooses the first result in Step 46 as the basis for creating a response.
Upon receiving a score associated with a first result, the chooser compares the score to a high threshold in Step 61. If the score is above the high threshold, the chooser chooses the first result as the basis for creating a response in Step 62 without waiting for a response from the second recognizer, or in some embodiments even without requesting such a response,.
If the score is not above the high threshold in Step 61, the chooser compares the score to a low threshold in Step 63. If the score is not below the low threshold in Step 63, the chooser sets a timeout duration as a function of the score in Step 64. When the score is between the low and high thresholds, the corresponding result may be considered as the basis for the response depending on the score received for results from the second recognizer, if results are received from the second recognizer before the timeout. Upon receiving a second result before the timeout occurs in Step 65, the chooser chooses assumes that the second result is more accurate than the first, and choses it in Step 67. If the timeout occurs in Step 65 before receiving a second result, the chooser chooses the first result in Step 66 as the basis for creating a response.
If the score is below the low threshold, the embodiment ignores the associated result in Step 68. The chooser proceeds to set a pre-configured timeout in Step 69, and does not base the timeout duration on a function of the score. Upon receiving a second result before the timeout occurs in Step 610, the chooser chooses the second result in Step 67 regardless of the associated score. In another embodiment, if the score associated with the second result is below a low threshold for the second score, the chooser may produce no useful response and signals an error. If the timeout occurs in Step 610 before receiving a second result, the embodiment produces no useful response and signals an error in Step 611.
If the embodiment of
Some embodiments use multiple similar recognizers. Some embodiments use different kinds of recognizers. Some embodiments with different kinds of recognizers perform a step of normalizing scores from different recognizers before comparing the scores to thresholds or to scores from other recognizers. Scores are most often scalar. Various embodiments represent scores on linear, logarithmic, or other scales. Various embodiments base scores on hypothesis probability calculations of phonemes, phonetic sequences, n-grams, word sequences (such as transcriptions), grammatically correct sentences (such as parses), and recognized interpretations of utterances according to domains of knowledge. Some embodiments combine two or more ways of computing scores into a single scalar score. Some embodiments use multi-dimensional scores based on retaining two or more ways of computing scores.
Various curves 701 take different shapes such as linear, parabolic, s-shaped, and staircase. All curves 701 are non-increasing, and most are decreasing, or step-wise decreasing.
Some embodiments compute a recognition score, not directly from a hypothesis strength within a recognizer, but as a probability of a second recognition score being above a threshold of desired improvement over the first recognition score. In some embodiments, the improvement threshold changes over time.
As will be apparent to practitioners of the art, descriptions herein can be extended to systems of more than two recognizers. Any plural number of recognizers can be considered in making decisions such as where to send speech, whether to choose a result immediately, discard the result, or wait for another, how to compare scores, and whether to start a timeout timer, and what function of one or more timeout timers to use are
Some embodiments operate on continuous speech. Such embodiments, on an effectively continuous basis, re-compute or adjust recognition scores and start or disable new timeout timers. Some such embodiments have multiple timers that run simultaneously. In various embodiments, continuous operation effectively means repeating operations on a timescale that is imperceptible to users, such as less than a few hundred milliseconds.
Some such embodiments are systems that display a continuously updated transcription as a user speaks. It is desirable to update the transcription within a certain maximum latency, and as soon as possible if the accuracy is sufficient. If a recognition score from a faster, but less accurate, recognizer exceeds a threshold, then the system updates the transcription with that recognizer's result. If the score does not exceed the threshold then the system waits for a response from a more accurate, but slower, recognizer. Some such embodiments repeatedly send speech to both recognizers and start timers every 10 milliseconds, expecting new results with a latency of 30 to 500 milliseconds. Accordingly, the system will have multiple timers running simultaneously and can switch between the results of one recognizer and the other on any frame boundary.
Some embodiments tend to favor local recognition results for real-time transcriptions, but choose more accurate, remotely-processed results of delimited spoken utterances as the basis for responses. Some embodiments that process delimited spoken utterances only respond to complete commands; automobiles, vending machines, humanoid robots, and some personal assistants may depend on such embodiments.
Dual mode speech recognition, as described herein, is embodied in methods, in machines, and in computer-readable media that store code that, if executed by one or more computer processors, would cause the computer processors to perform speech recognition accordingly.
Some embodiments are implemented in modular ways, and various such embodiments use combinations of hardware logic modules and software function modules. Various modular embodiments perform different necessary functions within different comparable modules. For example, some embodiments have a module for receiving speech from a user, sending speech to a first recognizer, a module for sending speech to a second recognizer, a module for receiving a recognition score, and a module for detecting a timeout, and a module for updating a speech recognition vocabulary.
Number | Name | Date | Kind |
---|---|---|---|
5956683 | Jacobs et al. | Sep 1999 | A |
6327568 | Joost | Dec 2001 | B1 |
6377913 | Coffman et al. | Apr 2002 | B1 |
6408272 | White et al. | Jun 2002 | B1 |
6456975 | Chang | Sep 2002 | B1 |
6487534 | Thelen et al. | Nov 2002 | B1 |
6697782 | Iso-Sipila | Feb 2004 | B1 |
6701294 | Ball et al. | Mar 2004 | B1 |
6704708 | Pickering | Mar 2004 | B1 |
7058573 | Murveit | Jun 2006 | B1 |
7277854 | Bennett et al. | Oct 2007 | B2 |
7472060 | Gorin et al. | Dec 2008 | B1 |
8521526 | Lloyd | Aug 2013 | B1 |
8949130 | Phillips | Feb 2015 | B2 |
8972263 | Stonehocker et al. | Mar 2015 | B2 |
9330669 | Stonehocker et al. | May 2016 | B2 |
9678928 | Tung | Jun 2017 | B1 |
9691390 | Stonehocker et al. | Jun 2017 | B2 |
20020198706 | Kao et al. | Dec 2002 | A1 |
20040210437 | Baker | Oct 2004 | A1 |
20050010422 | Ikeda | Jan 2005 | A1 |
20060009980 | Burke | Jan 2006 | A1 |
20060036438 | Chang | Feb 2006 | A1 |
20060190256 | Stephanick et al. | Aug 2006 | A1 |
20060190268 | Wang | Aug 2006 | A1 |
20070011010 | Dow et al. | Jan 2007 | A1 |
20070276651 | Bliss et al. | Nov 2007 | A1 |
20100057451 | Carraux | Mar 2010 | A1 |
20100106497 | Phillips | Apr 2010 | A1 |
20110015928 | Odell et al. | Jan 2011 | A1 |
20120022874 | Lloyd et al. | Jan 2012 | A1 |
20120150539 | Jeon | Jun 2012 | A1 |
20120179457 | Newman et al. | Jul 2012 | A1 |
20130085753 | Bringert | Apr 2013 | A1 |
20130132084 | Stonehocker | May 2013 | A1 |
20140163977 | Hoffmeister | Jun 2014 | A1 |
20140250378 | Stifelman | Sep 2014 | A1 |
20140372122 | Harsham | Dec 2014 | A1 |
20160217788 | Stonehocker et al. | Jul 2016 | A1 |
20170069308 | Aleksic | Mar 2017 | A1 |
20170178623 | Shamir | Jun 2017 | A1 |
Number | Date | Country |
---|---|---|
2930716 | Oct 2015 | EP |
2016209444 | Dec 2016 | WO |
Entry |
---|
Javier Gonzalez-Dominguez, et al., A Real-Time End-to-End Multilingual Speech Recognition Architecture, IEEE Journal of Selected Topics in Signal Processing, Jun. 2015, vol. 9, No. 4, IEEE. |
Takuma Okamoto, et al., Reducing latency for language identification based on large-vocabulary continuous speech recognition, Acoust. Sci. & Tech., 2017, pp. 38-41, vol. 38, Issue 1, The Acoustical Society of Japan. |
EP18177044.7—Extended Euorpean Search Report dated Aug. 18, 2018, 7 pages. |
U.S. Appl. No. 13/530,101—Office Action dated Mar. 26, 2014, 8 pages. |
U.S. Appl. No. 13/530,101—Response to Mar. 26 Office Action filed Sep. 23, 2014, 11 pages. |
U.S. Appl. No. 13/530,101—Notice of Allowance dated Oct. 24, 2014, 8 pages. |
U.S. Appl. No. 14/621,024—Office Action dated Aug. 26, 2015, 10 pages. |
U.S. Appl. No. 14/621,024—Response to Aug. 26 Office Action filed Nov. 13, 2016, 5 pages. |
U.S. Appl. No. 14/621,024—Notice of Allowance dated Jan. 5, 2016, 9 pages. |
U.S. Appl. No. 15/085,944—Office Action dated Nov. 16, 2016, 11 pages. |
U.S. Appl. No. 15/085,944—Notice of Allowance dated Feb. 24, 2016, 10 pages. |
U.S. Appl. No. 15/603,257—Nonfinal Office Action dated Jun. 8, 2018, 20 pgs. |
U.S. Appl. No. 15/603,257—Response to Nonfinal Office Action dated Jun. 8, 2018, filed Sep. 5, 2018, 11 pgs. |
U.S. Appl. No. 15/603,257—Final Office Action dated Sep. 25, 2018, 13 pgs. |
U.S. Appl. No. 15/603,257—Response to Final Office Action dated Sep. 25, 2018, filed Apr. 3, 2019, 9 pgs. |
Number | Date | Country | |
---|---|---|---|
20180358019 A1 | Dec 2018 | US |