Claims
- 1. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
- providing a precalculated matrix of distance metrics relating said prototype frames with said prototype states;
- thresholding said matrix by assigning a default value to metrics which do not meet a preselected criterion for being meaningful;
- for each prototype frame, forming a list of prototype states for which the distance metric is meaningful;
- for each input utterance, generating a fine sequence of prototype frames and a coarse set of input representative frames selected from said fine sequence, the number of representatives being a minor fraction of the number of frames in the corresponding fine sequence of frames and being distributed in position along said fine sequence;
- for each input utterance, generating a temporary matrix of distance metrics relating each of said sequence of input representatives to said states by performing the following steps:
- (a) setting all entries in said temporary matrix to the default value;
- (b) sequentially scanning said input representatives to locate the corresponding lists for included prototype states;
- (c) adjusting those entries in said temporary matrix which are included in said corresponding lists; and
- subsampling at least a selected portion of said vocabulary models and scoring the subsampled prototype states from said selected models using distance metrics obtained from said temporary matrix, the scoring providing a basis for preselection of candidate models for further processing.
- 2. A method as set forth in claim 1 wherein the generation of said coarse set includes subsampling of the fine sequence at a number of spaced positions along said fine sequence.
- 3. A method as set forth in claim 2 wherein the generation of said coarse set further includes the combining of distances obtained at said subsampled positions with distances obtained at positions adjacent to said subsampled positions thereby to effect a time averaging.
- 4. A method as set forth in claim 1 wherein the scoring performs a time warping of the coarse set of representatives with the subsampled states.
- 5. A method as set forth in claim 1 wherein said scoring includes determining, for each subsampled state, the input representative within a predetermined range of representatives which provides the best match.
- 6. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
- providing a precalculated matrix of distance metrics relating said prototype frames with said prototype states;
- thresholding said matrix by assigning a default value to metrics which do not meet a preselected criteria for being meaningful;
- for each prototype frame, forming a list of prototype states for which the distance metric is meaningful;
- for each input utterance, generating a fine sequence of prototype frames;
- dividing said fine sequence into a series of equal segments thereby to obtain a coarse set of input sample positions along said fine sequence, the number of sample positions being a minor fraction of the number of frames in the corresponding fine sequence of frames;
- for each input utterance, generating a temporary matrix of distance metrics relating each of said sequence of input sample positions to said states by performing the following steps:
- (a) setting all entries in said temporary matrix to the default value;
- (b) sequentially scanning a predetermined number of input frames adjacent to and including each input sample position to locate the corresponding lists for included prototype states;
- (c) determining the one of said predetermined number of frames which best matches each included prototype state; and
- (d) adjusting those entries in said temporary matrix which correspond to said best matches; and
- subsampling at least a selected portion of said vocabulary models and scoring the subsampled prototype states from said selected models using distance metrics obtained from said temporary matrix, the scoring providing a basis for preselection of candidate models for further processing.
- 7. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
- precalculating a matrix of distance metrics relating said prototype frames with said prototype states;
- thresholding said matrix by assigning a default value to metrics which do not meet a preselected criterion for being meaningful;
- for each prototype frame, forming a list of prototype states for which the distance metric is meaningful;
- for each input utterance, generating a fine sequence of prototype frames and a coarse set of a predetermined number of input representative frames selected from said fine sequence, the predetermined number of representatives being a minor fraction of the number of frames in the corresponding fine sequence of frames;
- for each input utterance, generating a temporary matrix of distance metrics relating each of said sequence of input representatives to said states by performing the following steps:
- (a) setting all entries in said temporary matrix to the default value;
- (b) sequentially scanning said input representatives to locate the corresponding lists for included prototype states;
- (c) adjusting those entries in said temporary matrix which are included in said corresponding lists; and
- for each model to be considered, subsampling the corresponding fine sequence of states to obtain a respective coarse sequence comprising a predetermined number of states;
- said predetermined numbers together defining a comparison matrix, there being a preselected region within said matrix which is examined by said method;
- for each state in said limited collection, determining for each state position in said comparison matrix the input representative which provides the best match with that state, considering and examining only frames which lie within said preselected region, a measure of the match being stored in a table;
- calculating, using said table, for each model to be considered a value representing the overall match of said coarse sequence of frames with the respective coarse sequence of states;
- preselecting for accurate comparison those models with the better overall match values as so calculated.
- 8. The method as set forth in claim 7 wherein, in determining the input frame which provides the best match for each possible state in each possible matrix position, the method examines not only the respective subsampled frame but also a preselected number of frames which precede and follow the respective subsampled frame in said fine sequence of frames.
- 9. In a computer implemented system for recognizing spoken utterances which compares an unknown speech segment represented by a fine sequence of frames selected from a preselected set of prototype data frames with at least some of a vocabulary of word models each of which is represented by a fine sequence of prototype states selected from a preselected set of prototype states, a method of preselecting candidate models comprising:
- providing a precalculated matrix of distance metrics relating said prototype frames with said prototype states;
- thresholding said matrix by assigning a default value to metrics which do not meet a preselected criteria for being meaningful;
- for each prototype frame, forming a list of prototype states for which the distance metric is meaningful;
- for each input utterance, generating a fine sequence of prototype frames;
- dividing said fine sequence into a series of equal segments thereby to obtain a coarse set of input sample positions along said fine sequence, the number of sample positions being a minor fraction of the number of frames in the corresponding fine sequence of frames;
- for each input utterance, generating a first temporary matrix of distance metrics relating each of said sequence of input sample positions to said states by performing the following steps:
- (a) setting all entries in said first temporary matrix to a default value;
- (b) sequentially scanning said input representatives to locate the corresponding lists for included prototype states;
- (c) adjusting those entries in said temporary matrix which are included in said corresponding lists;
- for each input utterance, also generating a second temporary matrix of distance metrics relating each of said sequence of input sample positions to said states by performing the following steps:
- (d) setting all entries in said second temporary matrix to a default value;
- (e) sequentially scanning a predetermined number of input frames adjacent to and including each input sample position to locate the corresponding lists for included prototype states;
- (f) determining the one of said predetermined number of frames which best matches each included prototype state; and
- (g) adjusting those entries in said temporary matrix which correspond to said best matches; and
- subsampling at least a selected portion of said vocabulary models;
- scoring the subsampled prototype states from said selected models first using distance metrics obtained from said second temporary matrix; and
- selecting a group of the models scoring higher using said second matrix for scoring using distance metrics obtained from said first matrix.
- 10. A method as set forth in claim 9 wherein the scoring using distance metrics obtained from said first matrix follows a time warping of said sample positions against said subsampled prototyped states.
CROSS-REFERENCE TO RELATED APPLICATIONS
The present application is a continuation-in-part of application Ser. No. 07/905,345 filed Jun. 29, 1992, now U.S. Pat. No. 5,386,492 and a continuation-in-part of application Ser. No. 08/250,696 filed May 27, 1994 now U.S. Pat. No. 5,546,499.
US Referenced Citations (20)
Continuation in Parts (1)
|
Number |
Date |
Country |
| Parent |
905345 |
Jun 1992 |
|