Claims
- 1. A system for processing speech, said speech including a succession of utterances spoken in any of continuous and discrete form, comprising:
- A. means for storing a plurality of word models;
- B. means for identifying a succession of temporal segments of said utterances spoken in any of continuous and discrete form;
- C. means selectively operable on ones of said segments for identifying a subset of said plurality of word models meeting predetermined criteria, said subset defining a list of candidate words;
- D. control means for determining arbitrarily selected frame start times t during said utterances spoken in any of continuous and discrete form, said frame start times being independent of identification of an initial anchor; and
- E. means for generating a signal representative of said list of candidate words for selected ones of said frame start times t determined in step D.
- 2. A prefiltering system for processing speech, said speech including a succession of utterances spoken in any of continuous and discrete form, comprising:
- A. cluster data storage means for storing a plurality of M cluster data sets, C.sub.1, . . . , C.sub.M, where M is an integer greater than 1, each of said cluster data sets including data representative of a plurality of word models;
- B. frame data means for generating a succession of w frame data sets v.sub.t, v.sub.t+1, . . . v.sub.t+w-1, beginning at a frame start time t during said succession of utterances spoken in any of continuous and discrete form, where w is an integer greater than 1, said succession of frame data sets being representative of a corresponding succession of temporal segments of said utterances spoken in any of continuous and discrete form, each of said frame data sets including k values representative of different frame parameters, where k.gtoreq.1;
- C. data reduction means selectively operable on said w frame data sets for generating s reduced frame data sets Y.sub.1, Y.sub.2, . . . , Y.sub.3, where s<w, each of said reduced frame data sets being related to an associated plurality of said frame data sets and including j values representative of different reduced frame data set parameters;
- D. scoring means for evaluating each of said reduced frame data sets against succession of said cluster data sets to generate a cluster score S.sub.Y for each of said cluster data sets;
- E. selectively operable identifying means for identifying each of said word models of said cluster data sets having a cluster score bearing a predetermined relation to at least one threshold score T, said identified word models defining a candidate word list;
- F. control means for determining said frame start times t, where successive start times t are spaced apart arbitrarily, said frame start times being independent of identification of an initial anchor; and
- G. means for generating a signal representative of said candidate word list for preselected ones of said frame start times t determined by said control means.
- 3. A system according to claim 2 wherein said cluster data storage means and said frame data means are adapted whereby each of said frame data sets are associated with duration D.sub.1, and wherein said cluster data sets are each associated with duration D.sub.2, such that:
- D.sub.1 .ltoreq.D.sub.2.
- 4. A system according to claim 2 wherein said cluster data storage means is adapted whereby said cluster data sets are wordstart cluster data sets.
- 5. A system according to claim 2 wherein said cluster data storage means is adapted whereby said word models of each of said cluster data sets correspond to acoustically similar utterances spoken in any of continuous and discrete form over a succession of no more than w frame data sets.
- 6. A system according to claim 2 wherein said data reduction means includes smooth frame means for processing said frame data sets whereby said reduced frame data sets are smoothed frame data sets.
- 7. A system according to claim 6 wherein said smooth frame means is adapted for determining smooth frame data sets in accordance with: ##EQU4## wherein Y.sub.1, Y.sub.2 . . . Y.sub.s are said smooth frame data sets, and whereby each of said smooth frame data sets is associated with b of said frame data sets, and wherein said b are integers greater than 1, a.sub.i are predetermined weighting coefficients, and c defines an offset of each smooth frame data set with respect to the next previous smooth frame data set.
- 8. A system according to claim 7 wherein said data reduction means is adapted whereby w=b+(s-1)c.
- 9. A system according to claim 8 wherein said data reduction means is adapted whereby w=12.
- 10. A system according to claim 8 wherein said data reduction means is adapted whereby k=8.
- 11. A system according to claim 10 wherein said data reduction means is adapted whereby s=3.
- 12. A system according to claim 11 wherein said data reduction means is adapted whereby b=4.
- 13. A system according to claim 12 wherein said data reduction means is adapted whereby c=4.
- 14. A system according to claim 8 wherein w=12, k=8, s=3, b=4, c=4, and a.sub.i =1/b.
- 15. A system according to claim 7 wherein said data reduction means is adapted whereby said weighting coefficients a.sub.i =1/b.
- 16. A system according to claim 2 wherein said cluster data storage means is adapted whereby each of said word models includes r node data vectors f.sub.1, . . . , f.sub.r, where r.ltoreq.s, each of said node data vectors being representative of a characteristic related to the occurrence of a selected one of acoustic segments from each word of a set of words associated with said cluster data sets in said acoustic segment of said utterances spoken in any of continuous and discrete form.
- 17. A system according to claim 16 wherein said cluster data storage means is adapted whereby said characteristic related to the occurrence of a selected one of acoustic segments comprises a probability distribution.
- 18. A system according to claim 17 wherein said cluster data storage means is adapted whereby said cluster score S.sub.Y corresponds to: ##EQU5## wherein Y.sub.i are ones of Y.sub.1, Y.sub.2, . . . , Y.sub.s, and r are said node data vectors f.sub.1, . . . , f.sub.r for corresponding ones of said word models.
- 19. A system according to claim 2 wherein said data reduction means is adapted whereby j=k.
- 20. A system according to claim 2 wherein said identifying means includes a first identifying means for identifying each of said cluster data sets having a cluster score measured with respect to a first predetermined threshold T.sub.1, said identified cluster data sets defining preliminary cluster data sets.
- 21. A system according to claim 20 further comprising means for generating a word score representative of the sum of said cluster score S.sub.y of said cluster data set associated with each of said word models and a language model score S.sub.L and wherein said identifying means further comprises a selectively operable second identifying means for identifying each of said word models of each of said cluster data sets having a word score S.sub.W measured with respect to a second threshold T.sub.2, said sum represented by:
- S.sub.W =S.sub.Y +S.sub.L.
- 22. A speech processing method for processing speech including a succession of utterances spoken in any of continuous and discrete form comprising the steps of:
- A. storing a plurality of M cluster data sets, C.sub.1, . . . , C.sub.M, where M is an integer greater than 1, each of said cluster data sets including data representative of a plurality of word models;
- B. generating a succession of w frame data sets v.sub.t, v.sub.t+1, . . . v.sub.t+w-1, beginning at a frame start time t during said succession of utterances spoken in any of continuous and discrete form, where w is an integer greater than 1, each of said frame data sets being representative of successive acoustic segments of utterances spoken in any of continuous and discrete form for a frame period, each of said frame data sets including k values representative of different frame parameters where k.gtoreq.1;
- C. reducing w of said frame data sets to generate s reduced frame data sets, Y.sub.1, Y.sub.2, . . . , Y.sub.3 where s<w, each of said reduced frame data sets being related to an associated plurality of said frame data sets and including j values related to the k values of said associated frame data sets, where j.ltoreq.k;
- D. evaluating said reduced frame data sets with a succession of said cluster data sets to generate a cluster score S.sub.Y for each of said cluster data sets;
- E. identifying each of said word models having a cluster score bearing a predetermined relation to at least one threshold score T, said identified word models defining a word list;
- F. determining said frame start times t, where successive start times t are identified at arbitrarily selected intervals; said frame start times being independent of identification of an initial anchor; and
- G. generating a signal representative of said candidate word list for selected ones of said frame start times determined in step F.
- 23. A method according to claim 22 wherein said reducing step C further Comprises the substep of smoothing said frame data sets.
- 24. A method according to claim 23 wherein said smoothing substep includes smoothing in accordance with: ##EQU6## wherein Y.sub.1, Y.sub.2, . . . Y.sub.s are said smooth frame data sets, and wherein said b are integers greater than 1, a.sub.i are predetermined weighting coefficients, and c defines an offset of each smooth frame data set with respect to the next previous smooth frame data set.
- 25. A method according to claim 24 wherein said smoothing substep includes setting w=12, k=8, s=3, b=4, c=4, and a.sub.i =1/b.
- 26. A method according to claim 22 wherein said storing step includes the substep of generating for each of said word models, r node data vectors f.sub.1, . . . , f.sub.r, where r is less than or equal to s, each of said node data vectors being representative of a characteristic related to the occurrence of a selected one of acoustic segments from each word of a set of words associated with said cluster data sets in said acoustic segment of said utterances spoken in any of continuous and discrete form, and
- wherein said evaluating step includes the substep of determining said cluster score S.sub.Y in accordance with: ##EQU7## wherein Y.sub.i are ones of Y.sub.1, Y.sub.2, . . . , Y.sub.s, and f.sub.i are said node data vectors f.sub.1, . . . , f.sub.r for corresponding ones of said word models.
- 27. A prefiltering method for processing speech, said speech including a succession of utterances spoken in any of continuous and discrete form, comprising the steps of:
- A. storing a plurality of word models;
- B. identifying a succession of temporal segments of said utterances spoken in any of continuous and discrete form;
- C. operating on ones of said segments and selectively identifying a subset of said plurality of word models meeting predetermined criteria, said subset defining a list of candidate words;
- D. determining arbitrarily selected frame start times t during said utterances spoken in any of continuous and discrete form, said frame start times being independent of identification of an initial anchor; and
- E. generating a signal representative of said list of candidate words for selected ones of said frame start times t determined in step D.
REFERENCE TO RELATED PATENT APPLICATION
This application is a continuation-in-part of U.S. Patent application Ser. No. 542,520, entitled "Large-Vocabulary Continuous Speech Prefiltering and Processing System" (as amended), filed Jun. 22, 1990, corresponding to U.S. Pat. No. 5,202,952, Apr. 13, 1993, and assigned to the assignee of the present application.
US Referenced Citations (13)
Foreign Referenced Citations (1)
Number |
Date |
Country |
WO9200585 |
Jan 1992 |
WOX |
Non-Patent Literature Citations (3)
Entry |
Bahl et al., "Obtaining Candidate Words by Polling in a Large Vocabulary Speech Recognition System," IEEE, Sep. 1988, pp. 489-492. |
Bahl et al., "Matrix Fast Match: A Fast Method for Identifying a Short List of Candidate Words for Decoding," IEEE, Feb. 1989, pp. 345-347. |
Bahl et al., "Constructing Groups of Accoustically Confusable Words," IEEE, Feb. 1990, pp. 85-88. |
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
542520 |
Jun 1990 |
|