1. Field of the Invention
The present invention relates to the field of speech processing technologies and, more particularly, to using a combination of end-of-path and silence frame detections with inclusive finalization timeouts to detect end of utterance (EOU) events in a speech processing system.
2. Description of the Related Art
When developing applications that employ speech recognition, one of the main goals is always to create a positive user experience. For most application designers, this means developing an application that acts more like a human than a machine. In applications employing speech recognition, this goal equates to having an application that detects speech directed at the application, understands speaker pauses/breaks, reacts to recognized phrases, and provides a response that the request was understood.
One of the recurring problems with modem speech recognition is their ability to accurately determine the end of speech. Adding to this difficulty, many application designers desire control over the length of time for inter-word pauses before the recognition engine determines that the speaker has stopped speaking. Thus, to satisfy both users and application designers, an intuitive mechanism for detecting end-of-utterances is necessary, which can still be configured in an application specific manner to establish application specific inter-word pauses.
End of utterance (EOU) detection difficulties have been addressed in various ways in the past, each of which has its own significant drawbacks. One technique for resolving EOU problems is to employ a push-to-talk (PTT) technology, which forces the speaker to notify the application of an EOU event. PTT technologies however require explicit user feedback regarding EOU events, which many users find cumbersome and/or unnatural.
Another EOU problem mitigation technique involves segmenting an incoming audio stream up into a set of data frames, each of which is labeled as a speech frame or a silence frame. Whenever a definable quantity of consecutive silence frames are detected, the speech recognition engine can assume that a speaker has stopped speaking. In relatively quiet environments, using consecutive silence frames to determine EOU events, works relatively well. In noisy environments, however, loud ambient noises can easily cause one or more frames to be marked as speech, which can be problematic because each mis-marked frame causes a consecutive number of silence frames (for EOU determination purposes) to be reset. Thus, in noisy environments, use of consecutive silence frames for EOU determinations often results in excessively long delays in deciding an EOU occurrence.
An enhancement of the silence frame based technique, referenced as a dual factor technique, permits an EOU determination to be made from a combination of end-of-path determinations and a quantity of consecutive silence frames. The dual factor technique tends to perform better in a variety of environments (silent as well as somewhat noisy environments) than techniques based on silence frames or end-of-path determinations alone. The problem with existing dual factor techniques is that under certain conditions, they wait a relatively long time before making a determination.
The present invention represents an enhancement of a dual factor technique for end of utterance (EOU) determinations. The invention speeds up the EOU determination process when an EOU determination is based upon a number of silence frames. More specifically, situations exist currently where conventional dual factor EOU determinations must wait until an entire silence frame window is full before making an EOU determination. Once a tentative EOU determination is made based upon a number of silence frames, a sending of audio frames to a decoder is halted to be resumed only after the tentative EOU determination is finalized, which currently requires the silence frame window to be full. In many instances, however, a sufficient number of frames are present in the silence frame window to make a definitive determination. That is, no matter what the remaining frames are, the ultimate determination will not change. The present invention looks for such a state, and makes an immediate EOU finalization determination even before the silence frame window is completely filled. This improves efficiency by reducing a delay period for EOU determinations, while having no negative effect on accuracy.
The present invention can be implemented in accordance with numerous aspects consistent with the materials presented herein. One aspect of the present invention can include a system for determining end of utterance events (EOU). The system can include a frame based segmenter, a frame labeler, a decoder, a silence EOU detector, an end-of-path manager, and an EOU detector. The frame based segmenter can be configured to segment an incoming audio stream into a sequence of frames. The frame labeler can label frames created by the frame based segmenter as silence frames and as speech frames. The decoder can match audio contained in speech frames against entries in a speech recognition grammar and can perform programmatic actions based upon match results. The silence EOU detector can initiate a tentative end of utterance event when a number of silence frames within a sequence of frames exceeds a previously defined threshold. The end-of-path manager can initiate a tentative end of utterance event when an end of a path of an enabled recognition grammar is determined. The EOU detector can establish a waiting period and a set of conditions for converting a tentative end of utterance event into a finalized end of utterance event and for releasing a tentative end of utterance event that is not to be finalized.
Another aspect of the present invention can include software for determining an EOU event, which includes a silence component, a path component, and a finalization component. The silence component can initiate a silence induced EOU event based upon a number sequential frames labeled as silence that are received. The path component can initiate an end-of-path induced EOU event based upon programmatic determinations that terminal nodes of recognition grammar paths for a speech input have been reached. The finalization component can delay determinations of EOU events initiated by the silence component and the path component for a defined duration, can perform at least one determination as to whether the initiated EOU event is to be finalized, and can then either finalize the initiated EOU event or ignore the initiated EOU event based upon the performed determination.
Still another aspect of the present invention can include a method for determining EOU events in a speech processing situation. The method can segment an incoming audio stream into a set of frames. Each of the frames can be labeled as containing speech or silence. An end-of-path determination can be made. The method can wait for an application requested time out period to expire before finalizing a result. During this time, speech frames can continue to be speech recognized. The end-of-path determination can be selectively revoked depending upon results of the speech recognitions occurring during the requested time out period. When the requested time out period expires and when results have not been revoked, an EOU event can be initiated based upon a finalized end-of-path determination.
It should be noted that various aspects of the invention can be implemented as a program for controlling computing equipment to implement the functions described herein, or as a program for enabling computing equipment to perform processes corresponding to the steps disclosed herein. This program may be provided by storing the program in a magnetic disk, an optical disk, a semiconductor memory or any other recording medium. The program can also be provided as a digitally encoded signal conveyed via a carrier wave. The described program can be a single program or can be implemented as multiple subprograms, each of which interact within a single computing device or interact in a distributed fashion across a network space.
There are shown in the drawings, embodiments which are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
The present invention discloses a solution for a speech processing system to determine end-of-utterance (EOU) events. The solution is a modified dual factor technique, where one factor is based upon a number of approximately continuously silence frames received and a second factor is based upon an end-of-path occurrence. The solution permits a set of configurable timeout delay values to be established, which can be configured on an application specific basis by application developers. The solution can speed up EOU determinations made through a dual factor technique, which are partly based upon a number of silence frames received, which improves efficiency of the modified dual factor technique without impacting accuracy.
Two different occurrences can trigger a tentative EOU event; one being determined by the silence EOU handler 123, the other being determined by the end-of-path manager 132. Once a tentative EOU event occurs, an EOU detector 140 can determine whether conditions exist to finalize the tentative EOU occurrence to produce a confirmed EOU event or whether conditions exist for negating the tentative EOU event. The detector 140 can use a counter 142 and a finalization timeout variable 144 to make its determinations.
End-of-path process 210 illustrated in
When the finalization time-out expires, the process can progress from step 222 to step 224, where the EOU event can be finalized. In step 226, responsive to the finalized EOU event, a set of actions suitable for the decoded speech and/or state of the speech enabled device can be performed. One of the suitable actions can be to generate result 116. Additionally, the decoding of speech frames can be halted once the EOU event has been finalized, as shown by step 228.
The silence process 240 illustrated in
Once the silence window is fixed and the tentative EOU determination made, the decoding of speech labeled frames can be halted, as indicated by step 246. Halting the decoding process when a silence situation is believed to exist can conserve processing resources. In step 248, a time-out counter 142 can be started. New frames from the audio stream 112 continue to be labeled by labeler 122 at this time. While the time-out counter is less than the finalization time out threshold 144, a quantity of speech and/or silence frames within the window can be intermittently checked. This permits the process 240 to take immediate action, when it becomes evident that tentative EOU determination should be either finalized or released. When no preliminary determination is possible, the window can be allowed to fill and/or the time-out counter can reach the finalization threshold, at which point a determination can be made.
Accordingly, step 250 checks to see if a sufficient number of silence frames exist to finalize the tentative EOU determination. If so, the process can progress to step 258, where finalization actions can be performed. Otherwise, step 252 can execute, where a determination as to whether sufficient quantities of speech frames are present in the window to release the tentative EOU determination. If so, the process can progress to step 262, where release actions can execute. Otherwise, a current value of the time-out counter can be compared against the finalization time out threshold (or the silence window can fill up in a different implementation). When the time-out event has not occurred, the process can loop back to step 250, where after a time another check for sufficient silence frames can be performed.
After the time-out event occurs, a decision can be made in step 256 to finalize the tentative EOU determination or not. A decision to finalize results in the process progressing from step 256 to step 258, where a decision to release the tentative determination results in the process progressing from step 256 to step 262. In step 258, the EOU determination can be finalized. In step 260, actions can be performed responsive to the finalized EOU determination. For example result handler 130 can initiate a programmatic action or can produce results 116, which causes another programmatic component to take actions relating to the received result 116. In step 262, a tentative EOU determination can be released and the previously halted decoder 126 can resume decoding speech frames, as shown by step 264. Speech frames accumulated when the decoder 126 was halted (in step 246) can be queued to be processed when decoding is resumed in step 264.
To illustrate by example, a sliding silence window can be fixed when at least eight out of the last ten frames are labeled as silence. The window can be created to contain thirty frames. After the window is fixed, so that it includes the eight silence frames of eight to ten sequentially received frames, subsequent frames can be placed in the now fixed window during a time period when the tentative EOU determination has yet to be finalized. When either the window fills or when the time out period expires, the determination can be finalized and/or released. Additionally, a speech exit threshold can be established for a sufficient number of speech frames in a window (e.g., seven frames) for terminating the finalization period early. That is, after the speech exit threshold has been reached or surpassed, the tentative EOU determination to be immediately released (e.g., ignored) and the speech processing system can resume normal input processing operations. A silence exist threshold can also be established for a sufficient number of silence frames in a window (e.g., twenty two) to terminate the finalization period early with a finalized EOU result.
As used herein, the speech processing system 110 can be any computing device or set of computing devices able to perform speech recognition functions, which include an EOU feature. The speech processing system 110 can be implemented as a stand-alone server, as part of a cluster of servers, within a virtual computing space formed from a set of one or more physical devices, and the like. In one embodiment, functionality attributed to the EOU detector 140, the decoder 126, and the like can be incorporated within different machines or machine components.
The present invention may be realized in hardware, software, or a combination of hardware and software. The present invention may be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software may be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention also may be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.
This invention may be embodied in other forms without departing from the spirit or essential attributes thereof. Accordingly, reference should be made to the following claims, rather than to the foregoing specification, as indicating the scope of the invention.