The present invention relates generally to human expression recognition and more specifically to speech, handwriting, or gesture recognition using an expression recognition function.
Automated methods and apparatus for recognizing human expressions such as speech, handwriting, and gestures are known that use conventional recognition functions, also called herein expression recognizers. For example, speaker independent speech recognizers are used for telephone answering systems and for some cellular telephones. These speech recognizers are typically fixed recognizers, which is a type also used for many handwriting and gesture recognizers. Fixed expression recognizers, as the expression is used herein, means that the recognizer is not adapted while it is being used; i.e., the databases used to analyze the human expression are not substantially changed after the recognizer is distributed by a manufacturer or after the software is installed, or after a training process is completed. Other conventional expression recognizers may employ limited adaptation techniques that serve to improve the conventional scheme that is used for recognition.
Although such expression recognizers work well in many circumstances, the reliability of their output is not perfect. In some circumstances where expression recognizers are or could be used to advantage because of their greater simplicity, lower power drain and less memory requirements, such as in handheld electronic devices, their performance may suffer. In particular, when such expression recognizers are used substantially by only one person, the resulting error rate may be undesirable due to several factors. For an example of a speech recognizer, the person may have a vocal tract that renders the person's speech in a manner more difficult for the recognizer to interpret than the range of speech for which the recognizer was designed or trained. As another example, the recognizer may not have 100% reliability for any person due to inherent limits in the recognition technology or due to a constant noise in the background. Finally, the person may have a habit of enunciating certain words such that they sound like two words or such that a word is dropped. Such observations pertain to handwriting and gesture systems as well.
The accompanying figures, where like reference numerals refer to identical or functionally similar elements throughout the separate views, together with the detailed description below, are incorporated in and form part of the specification, and serve to further illustrate the embodiments and explain various principles and advantages, in accordance with the present invention.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.
Before describing in detail embodiments that are in accordance with the present invention, it should be observed that the embodiments reside primarily in combinations of method steps and apparatus components related to human expression recognition. Accordingly, the apparatus components and method steps have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
Referring to
The expression recognizer 115 generates a recognized sequence of words W″ 116 that has the most likelihood of representing the expressed sequence of words W′ 111 that it received. This sequence may be generated as digitally encoded text, or, for gestures, it may simply be a sequence of codes. It will be appreciated that the most likely sequence of words 116 may not convey the originally intended communication 106, either because of imperfect conversion from human intention 106to human expressed words 111 or because of inaccurate conversion from human expressed words 111 to the recognized sequence of words 116.
A corrector 120 receives the recognized sequence of words 116 and analyzes the sequence one word at a time. The word being analyzed is termed the target word. To analyze the target word, the corrector 120 provides the target word and one or more words in the sequence near the target word to a correction model, which determines a replacement for the target word. The replacement may be in the form of a substitute word, an added word, or a deletion of the target word. The substitute word may be, in some instances, the original target word. When the corrector 120 has analyzed each word in the recognized sequence of words 116, it then may generate a corrected sequence of words W′″ 121 that may be presented to the human that generated the expressed sequence of words 111.
The presentation of the corrected sequence of words 121 may be performed by a function of the electronic device 100 not shown in
Referring to
At step 305 (
In table 1, row 1 (R1) stores the target word, 7, and the two words (digits, in this example) preceding the target word and the two digits following the target word. Row 2 (R2) stores the number of times that this sequence has been analyzed by the selector 205 and correction model 210, which in this case is 20. The possible word values in the vocabulary (0-9) are listed in the second column. The conditional probabilities for each word value, given the target word and the nearby words (the two preceding and two following words in this example) are listed in the third column. In this example, the conditional probability of the target value being a 6 is 0.95 for the 20 times this sequence has been analyzed in the past.
At step 310 (
It should be noted that there is actually another value used in the vocabulary that wasn't listed in Table 1. That is a value used for one or two unvoiced digits at the beginning or end of set of words being analyzed. Thus, the first sequence of words that would be selected in this example by the selector 205 are ##847, when the symbol for the unvoiced digit is #.
Table 1 is a table for replacement values that are more specifically called substitution values, because the most likely value determined using the set of conditional probabilities defined by table 1 is substituted on a one-to-one basis with the target value. It will be appreciated that in many instances, the substitution value will be the same value as the target word, so that no change occurs. For simplicity of definition, this may still be classified as a substitution. In accordance with embodiments of the present invention, additional conditional probabilities exist for replacements that are made by adding an identified most probable value after the target word, instead of substituting the most probable value for the target value. This accommodates errors in which a digit is dropped from the recognized sequence of words (the dropping of the digit may have occurred by the human expression 110 or the expression recognizer 115, or some partial combination of the two aspects). In some embodiments, yet another conditional probability exists for deleting the target word.
A more complete table for the same target value used in Table 1 is shown in Table 2.
In Table 2, Row 2 (R2) now has two values. The first value is the number of times that this sequence has been analyzed by the selector 205 and correction model 210, which in this case is 5. The second value is the conditional probability for the target word being deleted. For rows 3-13 (R2-R13), there are now three columns. The first two columns are the same as in Table 1. The third column lists the conditional probabilities for adding the word value in the first column to the word sequence, after the target word. Also, row 13 (R13) has been added to include the word value #. For this example, the most likely conditional probability in the table is for adding the word value 6 after the target word, which will generate the intended sub sequence 757650 of the intended full sequenced 8475765054. It should be noted that the sum of the all conditional values (23 in this example), should add to 1.0.
In accordance with the above example, and for more general embodiments of the present invention, it can be seen that when there are M unique words in the vocabulary, there are at most M substitution conditional probabilities, M addition conditional probabilities, and 1 deletion conditional probability in the set of conditional probabilities for the target word. When the number of conditional probabilities in the set of conditional probabilities for the target word is expressed as C, then C≦2M+1. M will clearly be an integer greater than zero.
It will be appreciated that fewer or more than two words directly preceding and directly following the target word could be used to formulate the set of conditional probabilities for a target word, and that the number of preceding words need not be the same as the number of following words. Thus, the sequence near the target word may comprises P words of the sequence directly preceding the target word, and F words of the sequence directly following the target word, wherein P and F are non-negative integers.
The number of sets of conditional probabilities can be seen to be a maximum of M(P+F+1). In the above example, the number of sets of conditional probabilities is 115. Each table in the above example could have 29 values (the five digits defining the condition, the one value of the number of analyses, the one probability value for the deletion, and the 22 probability values for the substitutions and additions). Thus, the maximum amount of memory that theoretically be used for this example is approximately 425,000 values. However, the tables may be generated only as needed—that is, only when a particular combination of a target value and the nearby letters is first recognized. The actual number of tables needed is typically at least an order of magnitude smaller than the theoretical maximum for many practical uses. For a telephone number application storing 250 telephone numbers, the memory requirements are quite compatible with today's cellular telephones.
Referring again to
At step 325 (
In this example, the intended word sequence is 8475765054, and the recognized sequence is 8475775054. When the third 7 of the sequence is analyzed, the highest conditional probability is for a substitution word value of 3. When the most likely sequence of words is presented, it would likely be denied. When queried for the correct word values, the human would indicate an intention of 6 for the sixth word in the sequence. Since the values in Table 3 had been generated using 3 previous occurrences, of the recognized sequence 8475775054, in which two were determined to be correct, a new conditional probability for the word value 6 would be calculated as 2/4, or 0.5, and a new probability for the word value 3 would be calculated as 2/4, or 0.5, and the other probabilities would remain at 0. When this table is used again, the corrector 120 would pick one of the two values randomly, since their conditional probabilities are equal. It will be appreciated that the situation of this example, is not very likely to arise in a typical telephone number application, since there would have to be two phone numbers each having a five digit sequence that differs by only one digit from the other.
Thus, an electronic device that includes an expression recognizer has been described that provides for the recognition of human intent, thereby improving the recognition reliability provided by the electronic device in comparison to when the electronic device uses only the expression recognizer. It will be appreciated that these embodiments can provide correction for a speaker's unique vocal aspects, for example an accent or a vocal impediment, for a speaker's habitual errors, and/or for short comings of an expression recognizer without training the expression recognizer to the speaker, using a simple technology.
It will be appreciated that embodiments of the invention described herein may be comprised of one or more conventional processors and unique stored program instructions that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the embodiments of the invention described herein. The non-processor circuits may include, but are not limited to, a radio receiver, a radio transmitter, signal drivers, clock circuits, power source circuits, and user input devices. As such, these functions may be interpreted as steps of a method to perform recognition of human intent. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of these approaches could be used. Thus, methods and means for these functions have been described herein. In those situations for which functions of the embodiments of the invention can be implemented using a processor and stored program instructions, it will be appreciated that one means for implementing such functions is the media that stores the stored program instructions, be it magnetic storage or a signal conveying a file. Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such stored program instructions and ICs with minimal experimentation.
In the foregoing specification, specific embodiments of the present invention have been described. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present invention. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.