The use of speech recognition technology is currently gaining popularity. One reason is that speech is one of the most convenient human-machine communication interfaces for running computer applications. Automatic speech recognition technology is one of the fundamental components for facilitating human-machine communication, and therefore this technology has made substantial progress in the past several decades.
However, in real world applications, speech recognition technology has not gained as much penetration as was first believed. One reason for this is that it is still difficult to maintain consistent, robust, speech recognition performance across different operating conditions. For example, it is difficult to maintain accurate speech recognition in applications that have variable background noises, different speakers and speaking styles, dialectical accents, out-of-vocabulary words, etc.
Due to the difficulty in maintaining accurate speech recognition performance, speech recognition error correction is also an important part of the automatic speech recognition technology. Efficient correction of speech recognition errors is still rather difficult in most speech recognition systems.
Many current speech recognition systems rely on a spoken input in order to correct speech recognition errors. In other words, when a user is using a speech recognizer, the speech recognizer outputs a proposed result of the speech recognition function. When the speech recognition result is incorrect, the speech recognition system asks the user to repeat the utterance which was incorrectly recognized. In doing so, many users repeat the utterance in an unnatural way, such as very slowly and distinctly, and not fluently as it would normally be spoken. This, in fact, often makes it more difficult for the speech recognizer to recognize the utterance accurately, and therefore, the next speech recognition result output by the speech recognizer is often erroneous as well. Correcting a speech recognition result with speech thus often results in a very frustrating user experience.
Therefore, in order to correct errors made by an automatic speech recognition system, some other input modes (other than speech) have been tried. Some such modes include using a keyboard, spelling out the words using spoken language, and using pen-based writing of the word. Among these various input modalities, the keyboard is probably the most reliable. However, for small handheld devices, such as personal digital assistants (PDAs) or telephones, which often have a very small keypad, it is difficult to key in words in an efficient manner without going through at least some type of training process.
It is also known that some current handheld devices are provided with a handwriting input option. In other words, using a “pen” or stylus, a user can perform handwriting on a touch-sensitive screen. The handwriting characters entered on the screen are submitted to a handwriting recognition component that attempts to recognize the characters written by the user.
In most prior error correction interfaces, locating the error in a speech recognition result is usually done by having a user select the misrecognized word in the result. However, this does not indicate the type of error, in any way. For instance, by selecting a misrecognized word, it is still not clear whether the recognition result contains an extra word or character, has misspelled a word, has output the wrong sense of a word, or is missing a word, etc.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.
A speech recognition result is displayed for review by a user. If it is incorrect, the user provides pen-based editing marks, and an error type and location (within the speech recognition result) are identified. An alternative result template is generated and an N-best alternative list is also generated by applying the template to intermediate recognition results from the automatic speech recognizer. The N-best alternative list is output for use in correcting the speech recognition results.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the background.
More specifically, speech recognizer 102 first receives a spoken input 118 from a user. This is indicated by block 200 in
In generating the speech recognition result 120, speech recognizer 102 also generates intermediate recognition results 122. Intermediate recognition results 122 are commonly generated by current speech recognizers as a word graph or confusion network. These are normally not output by a speech recognizer because they cannot normally be read or deciphered easily by a human user. When depicted in graphical form, they normally resemble a highly interconnected graph (or “spider web”) of nodes and links. The graph is a very compact representation of high probability recognition hypotheses (word sequences) generated by the speech recognizer. The speech recognizer only eventually outputs the highest probability recognition hypothesis, but the intermediate results are used to identify that hypothesis.
In any case, once the recognition result 120 is output by speech recognizer 102 and displayed on user interface display 106, it is determined whether the recognition result 120 is correct or whether it needs to be corrected. This is indicated by block 206 in
If the user determines that the displayed speech recognition result is incorrect, then the user provides pen-based editing marks 124 through user interface display 106. For instance, system 100 is illustratively deployed on a handheld device, such as palmtop computer, a telephone, a personal digital assistant, or another type of mobile device. User interface display 106 illustratively includes a touch-sensitive area which, when contacted by a user (such as by using a pen or stylus) receives the user input editing marks from the pen or stylus. In the embodiment described herein, the pen-based editing marks not only indicate a position within the displayed recognition result 120 that contains the error, but also indicate a type of error that occurs at that position. Receiving the pen-based editing marks 124 is indicated by block 208 in
The marked up speech recognition result 126 is received, through display 106, by error identification component 108. Error identification component 108 then identifies the type and location of the error in the marked up recognition result 126, based on the pen-based editing marks 124 input by the user. Identifying the type and location of the error is indicated by block 210 in
In one embodiment, error identification component 108 includes a handwriting recognition component (which can be the same as handwriting recognition component 116 described below, or a different handwriting recognition component) which is used to process and identify the symbols used by the user in pen-based editing marks 124. While a wide variety of different types of pen-based editing marks can be used to identify error type and error position in the recognition result 120, a number of examples of such symbols are shown in
A substitution error is an error in which a word (or other token) is misrecognized as another word. For instance, where the word “speech” is misrecognized as the word “screech”, this is a substitution error because an erroneous word was substituted for a correct word in the recognition result.
An insertion error is an error in which one or more spurious words or characters (or other tokens) are inserted in the speech recognition result, where no word(s) or character(s) belongs. In other words, where the erroneous recognition result is “speech and recognition”, but where the actual result should be “speech recognition” the word “and” is erroneously inserted in a spot where no word belongs, and is thus an insertion error.
A deletion error is an error in which one or more words or characters (or other tokens) have been erroneously deleted. For instance, where the erroneous speech recognition result is “speech provides” but the actual recognition result should be “speech recognition provides”, the word “recognition” has erroneously been deleted from the speech recognition result.
It will, of course, be noted that the particular pen-based editing marks used in
Error identification component 108 identifies the particular error type and location in the speech recognition result 120 by performing handwriting recognition on the symbols in the pen-based editing marks to determine whether they are circles, v or inverted v shapes, or horizontal lines. Based on this handwriting recognition, component 108 identifies the particular types of errors that have been marked by the user.
Component 108 then correlates the particular position of the pen-based editing marks 124 on the user interface display 106, relative to the words in the speech recognition result 120 displayed on the user interface display 106. Of course, these are both provided together in marked up result 126. Component 108 can thus identify within the speech recognition result, the type of error noted by the user, and the particular position within the speech recognition result that the error occurred.
The particular position may be the word position of the word within the speech recognition result, or it may be a letter position within an individual word, or it may be a location of a phrase. The error position can thus be correlated to a position in the speech signal that spawns the marked result. The error type and location 128 are output by error identification component 108 to template generator 110.
Template generator 110 generates a template 130 that represents word sequences which can be used to correct the error having the identified error type. In other words, the template defines allowable sequences of words that can be used in correcting the error. Template generation is described in greater detail below with respect to
Once template 130 has been generated, it is provided to N-best alternative generator 112. Recall that intermediate speech recognition results 122 have been provided from speech recognizer 102 to N-best alternative generator 112. The intermediate speech recognition results 122 embody a very compact representation of high probability recognition hypotheses generated by speech recognizer 102. N-best alternative generator 112 applies the template 130 provided by template generator 110 against the intermediate speech recognition results 122 to find various word sequences in the intermediate speech recognition results 122 that conform to the template 130.
The intermediate speech recognition results 122 will also, illustratively, have scores associated with them from the various models in speech recognizer 102. For instance, speech recognizer 102 will illustratively include acoustic models and language models, all of which output scores indicating how likely it is that the components (or tokens) of the hypotheses in the intermediate speech recognition results are the correct recognition for the spoken input. Therefore, N-best alternative generator 102 identifies the intermediate speech recognition results 122 that conform to template 130, and ranks them according to a conditional posterior probability, which is also described below with respect to
In one illustrative embodiment, once the N-best alternative list has been generated, error correction component 114 automatically corrects speech recognition result 120 by substituting the first-best alternative from N-best alternative list 132 as the corrected result 134. The corrected result 134 is then displayed on user interface display 106 for confirmation by the user. Automatically correcting the recognition result using the first-best alternative is indicated by block 216 in
If, at block 226, the user is unable to locate the correct result in the N-best alternative list 132, the user can simply provide a user hand writing input 140. User hand writing input 140 is illustratively a user input in which the user spells out the correct word or phrase that is currently being corrected on user interface display 106. For instance,
Once the user handwriting input 142 is received, it is provided to handwriting recognition component 116 which performs handwriting recognition on the characters and symbols provided by input 142. Handwriting recognition component 116 then generates a handwriting recognition result 144 based on the user handwriting input 142. Any of a wide variety of different known handwriting recognition components can be used to perform handwriting recognition. Performing the handwriting recognition is indicated by block 232 in
Recognition result 144 is provided to error correction component 114. Error correction component 114 then substitutes for the word or phrase being corrected, the handwriting recognition result 144, and outputs the newly corrected result 134 for display on user interface display 106.
Once the correct recognition result has been obtained (at any of blocks 206, 220, 228, or 232), the correct recognition result is finally displayed on user interface display 106. This is indicated by block 234 in
The result can then be output to any of a wide variety of different applications, either for further processing, or to execute some task, such as command and control. Outputting the result for some type of further action or processing is indicated by block 236 in
It can be seen from the above description that interface component 104 significantly reduces the handwriting burden on the user in order to make error corrections in the speech recognition result. Automatic correction can be performed first. Also, in order to speed up the process, in one embodiment, a N-best alternative list is generated, from which the user chooses an alternative, if the automatic correction is unsuccessful. A long alternative list 132 can be visually overwhelming, and can slow down the correction process and require more interaction from the user, which may be undesirable. In one embodiment, the N-best alternative list 132 displays the five best alternatives for selection by the user. Of course, any other desired number could be used as well, and five is given for the sake of example only.
However, prior to generating the probability, the present system first generates template 130 to constrain a modified generalized posterior probability calculation. The calculation is performed to assess the confidence of recognition hypotheses, obtained from intermediate speech recognition results 122 by applying the template 130 against those results, at marked error locations in the recognition result 120. By using a template to sift out relevant hypotheses (paths) from the intermediate speech recognition results 122, the template constrained probability estimation can assess the confidence of a unit hypothesis, as a substring hypothesis, or a substring hypothesis that includes a wild card component, as is discussed below.
In any case, the first step in generating the N-best alternative list is for template generator 110 to generate template 130. The template 130 is generated to identify a structure of possibly matching results that can be identified in intermediate speech recognition results 122, based upon the error type and the position of the error (or the context of the error) within recognition result 120. Generating the template is indicated by block 350 in
In one embodiment, the template 130 is denoted as a triple, [T;s,t]. The template T is a template pattern that includes hypothesized units and metacharacters that can support regular expression syntax. The characters [s,t] define the time interval constraint of the template. In other words, they define the time frame within recognition result 120 that corresponds to the position of the marked error. The term s is the start time in the speech signal that spawned the recognition result that corresponds to a starting point of the marked error, and t is the end time in the speech signal (that generated the recognition result 120) corresponding to the marked error. Referring again to
In one embodiment, in a regular expression of the template, the basic template can also include metacharacters, such as a “don't care” symbol *, a blank symbol Φ, or a question mark ?. A list of some exemplary metacharacters is found below in Table 1.
The next template in
Template 406 in
Template 408 in
Different types of customized templates 130 are illustratively generated for different types of errors. For example, let W1 . . . WN be the word sequence in a speech recognition result 120, for a spoken input. In one exemplary embodiment, the template T can be designed as follows:
where 0≦I≦N, 1≦j≦N−i, W0=̂ (is the sentence start), WN+1=$ (is the sentence end), and the symbols of “?” and “*” are the same as defined in Table 1. Eq. 1 only includes templates for correcting substitution and deletion errors. Insertion errors can be corrected by a simple deletion, and no template is needed in order to correct such errors.
Depending on the type of error indicated by the pen-based editing marks 124 provided by the user, the particular portion of the template in Eq. 1 will be used to sift hypotheses in the intermediate speech recognition results 122 output by speech recognizer 102, in order to identify alternatives for N-best alternatives list 132. Searching the intermediate search results 122 for results that match the template 130 is indicated by block 352 in
The matching hypothesis are then scored. All string hypotheses that match template [T; s,t] form the hypothesis set H([T;s,t]). The template constrained posterior probability of [T;s,t] is a generalized posterior probability summed on all string hypotheses in the hypothesis set H([T:s,t]), as follows:
where x1T is the whole sequence of acoustic observations, and α and β are exponential weights for the acoustic and language models, respectively.
It can thus be seen that the numerator of the summation in Eq. 2 contains two terms. The first is the acoustic model probability associated with the sequence of acoustic observations delimited by the template's starting and ending positions given a current word, and the second term is the language model likelihood for a given word, given its history. For a given hypothesis that matches the template 130 (i.e., for a given hypothesis in the hypothesis set) all of the aforementioned probabilities are summed and normalized by the acoustic probability for the sequence of acoustic observations in the denominator of Eq. 2. This score is used to rank the N-best alternatives to generate list 132.
It can thus be seen that the template 130 acts to sift the hypotheses in intermediate speech recognition results 122. Therefore, the constraints on the template can be set more fine (by generating a more restrictive template) to sift out more of the hypotheses, or can be set more coarse (by generating a less restrictive template), to include more of the hypotheses. As discussed above,
In calculating the template constrained posterior probabilities set out in Eq. 2, the reduced search space (the granularity of the template), the time relaxation registration (how wide the time parameters s and t are set), and the weights assigned to the acoustic and language model likelihoods, can be set according to conventional techniques used in generating generalized word posterior probability for measuring reliability of recognized words, except that in the template constrained posterior probability, the string hypothesis selection, which corresponds to the term under the sigma summation in Eq. 2. Of course, these items in the template constrained posterior probability calculation can be set by machine learned processes or empirically, as well. Scoring each matching result using a conditional posterior result probability is indicated by block 354 in
The N most likely substring hypotheses which match the template, are found from the intermediate speech recognition results, and the scores generated for each. They are output as the N-best alternative list 132, in rank order. This is indicated by block 356 in
A-to-D converter 406 converts the analog signal from microphone 417 into a series of digital values. In several embodiments, A-to-D converter 406 samples the analog signal at 16 kHz and 16 bits per sample, thereby creating 32 kilobytes of speech data per second. These digital values are provided to a frame constructor 407, which, in one embodiment, groups the values into 25 millisecond frames that start 10 milliseconds apart.
The frames of data created by frame constructor 207 are provided to feature extractor 408, which extracts a feature from each frame. Examples of feature extraction modules include modules for performing Linear Predictive Coding (LPC), LPC derived Cepstrum, Perceptive Linear Prediction (PLP), Auditory model feature extraction, and Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction. Note that the invention is not limited to these feature extraction modules and that other modules may be used within the context of the present invention.
The feature extraction module produces a stream of feature vectors that are each associated with a frame of the speech signal.
Noise reduction can also be used so the output from extractor 408 is a series of “clean” feature vectors. If the input signal is a training signal, this series of “clean” feature vectors is provided to a trainer 424, which uses the “clean” feature vectors and a training text 426 to train an acoustic model 418 or other models as described in greater detail below.
If the input signal is a test signal, the “clean” feature vectors are provided to a decoder 412, which identifies a most likely sequence of words based on the stream of feature vectors, a lexicon 414, a language model 416, and the acoustic model 418. The particular method used for decoding is not important to the present invention and any of several known methods for decoding may be used. However, in performing the decoding, decoder 412 generates intermediate recognition results 122 discussed above.
Optional confidence measure module 420 can assign a confidence score to the recognition results and provide them to output module 422. Output module 422 can thus output recognition results 120, either by itself, or along with its confidence score.
The mobile device 510 shown in
It should also be noted that in the embodiment illustrated in
Although the subject matter has been described in language specific to structural features and/or methodology acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.