Claims
- 1. A system for recognizing speech spoken by a speaker comprising:
at least one visual transducer with a view of the speaker; at least one audio transducer receiving the spoken speech; an audio speech recognizer in communication with the at least one audio transducer, the audio speech recognizer determining a subset of speech elements for at least one speech segment received from the at least one audio transducer, the subset including a plurality of speech elements more likely to represent the speech segment; and a visual speech recognizer in communication with the at least one visual transducer and the audio speech recognizer, the visual speech recognizer operative to: (a) receive at least one image from the at least one visual transducer corresponding to a particular speech segment; (b) receive the subset of speech elements from the audio speech recognizer corresponding to the particular speech segment; and (c) determine a figure of merit for at least one of the subset of speech elements based on the at least one received image.
- 2. A system for recognizing speech as in claim 1 further comprising decision logic in communication with the audio speech recognizer and the visual speech recognizer, the decision logic determining a spoken speech element for each speech segment based on the subset of speech elements from the audio speech recognizer and on at least one figure of merit from the visual speech recognizer.
- 3. A system for recognizing speech as in claim 1 wherein the visual speech recognizer implements at least one hidden Markov model for determining at least one figure of merit.
- 4. A system for recognizing speech as in claim 3 wherein the hidden Markov model bases decisions on at least one feature extracted from at least one image acquired by the at least one visual transducer.
- 5. A system for recognizing speech as in claim 1, the visual speech recognizer converting signals received from the at least one visual transducer into at least one viseme, wherein at least one figure of merit is based on the at least one viseme.
- 6. A system for recognizing speech as in claim 1, the visual speech recognizer extracting at least one geometric feature from each of a sequence of frames received from the at least one visual transducer, wherein at least one figure of merit is based on the at least one extracted geometric feature.
- 7. A system for recognizing speech as in claim 1, the visual speech recognizer determining visual motion of lips of the speaker from a plurality of frames received from the at least one visual transducer, wherein at least one figure of merit is based on the determined lip motions.
- 8. A system for recognizing speech as in claim 1, the visual speech recognizer fitting at least one model to an image of lips received from the at least one visual transducer, wherein the at least one figure of merit is based on the at least one fitted model.
- 9. A system for recognizing speech as in claim 1 wherein at least one speech element comprises a phoneme.
- 10. A system for recognizing speech as in claim 1 wherein at least one speech element comprises a word.
- 11. A system for recognizing speech as in claim 1 wherein at least one speech element comprises a phrase.
- 12. A system for recognizing speech as in claim 1 wherein the visual speech recognizer represents speech elements with a plurality of models, the visual speech recognizer limiting the models considered to determine the figures of merit to only those models representing speech elements in the subset received from the audio speech recognizer.
- 13. A method for recognizing speech from a speaker, the method comprising:
receiving a sequence of audio speech segments from the speaker; for each of at least one of the audio speech segments, determining a subset of possible speech elements most probably spoken by the speaker during the audio speech segment; receiving at least one image of the speaker corresponding to the audio speech segment; extracting at least one feature from the at least one image of the speaker; and determining the most likely speech element from the subset of speech elements based on the at least one extracted feature.
- 14. A method for recognizing speech as in claim 13 wherein determining the most likely speech element comprises determining a video figure of merit for at least one speech element.
- 15. A method for recognizing speech as in claim 14 further comprising:
determining an audio figure of merit for each speech segment based on the audio speech segment; and determining a spoken speech segment based on the audio figures of merit and the video figures of merit.
- 16. A method for recognizing speech as in claim 13 wherein determining the most likely speech element is based on at least one hidden Markov model.
- 17. A method for recognizing speech as in claim 13 wherein extracting at least one feature comprises determining at least one viseme.
- 18. A method for recognizing speech as in claim 13 wherein extracting at least one feature comprises extracting at least one geometric feature from at least one speaker image.
- 19. A method for recognizing speech as in claim 13 wherein extracting at least one feature comprises determining motion of the speaker in a plurality of frames.
- 20. A method for recognizing speech as in claim 13 wherein extracting at least one feature comprises determining at least one model fit to at least one region of the speaker's face.
- 21. A method for recognizing speech as in claim 13 wherein at least one speech element comprises a phoneme.
- 22. A method for recognizing speech as in claim 13 wherein at least one speech element comprises a word.
- 23. A method for recognizing speech as in claim 13 wherein at least one speech element comprises a phrase.
- 24. A method for recognizing speech as in claim 13 wherein the visual speech recognizer represents speech elements with a plurality of models, determining the most likely speech element from the subset of speech elements comprises considering only those visual speech recognizer models representing speech elements in the subset received from the audio speech recognizer.
- 25. A system for enhancing speech spoken by a speaker comprising:
at least one visual transducer with a view of the speaker; at least one audio transducer receiving the spoken speech; a visual speech recognizer in communication with the at least one visual transducer, the visual speech recognizer estimating at least one visual speech parameter for each segment of speech; and a variable filter filtering output from at least one of the audio transducers, the variable filter having at least one parameter value based on the at least one estimated visual speech parameter.
- 26. A system for enhancing speech as in claim 25 wherein the at least one speech parameter comprises at least one viseme.
- 27. A system for enhancing speech as in claim 25 wherein the variable filter comprises at least one discrete filter.
- 28. A system for enhancing speech as in claim 25 wherein the variable filter comprises at least one wavelet-based filter.
- 29. A system for enhancing speech as in claim 25 wherein the variable filter comprises a plurality of parallel filters with adaptive filter coefficients.
- 30. A system for enhancing speech as in claim 25 wherein the variable filter comprises a serially arranged bank of filters implementing a cochlea inner ear model.
- 31. A system for enhancing speech as in claim 25 wherein the variable filter changes at least one filter bandwidth based on the at least one visual speech parameter.
- 32. A system for enhancing speech as in claim 25 wherein the variable filter changes at least one filter cut-off frequency based on the at least one visual speech parameter.
- 33. A system for enhancing speech as in claim 25 wherein the variable filter changes at least one filter gain based on the at least one visual speech parameter.
- 34. A system for enhancing speech as in claim 25 further comprising an audio speech recognizer in communication with the variable filter, the audio speech recognizer generating speech representations based on the at least one filtered audio transducer output.
- 35. A method of enhancing speech from a speaker comprising:
receiving a sequence of images of the speaker for a speech segment; determining at least one visual speech parameter for the speech segment based on the sequence of images; receiving an audio signal corresponding to the speech segment; and variably filtering the received audio signal based on the determined at least one visual speech parameter.
- 36. A method of enhancing speech as in claim 35 wherein determining at least one visual speech parameter comprises determining a viseme.
- 37. A method of enhancing speech as in claim 35 wherein variable filtering comprises changing at least one filter bandwidth based on the at least one visual speech parameter.
- 38. A method of enhancing speech as in claim 35 wherein variable filtering comprises changing at least one filter gain based on the at least one visual speech parameter.
- 39. A method of enhancing speech as in claim 35 wherein variable filtering comprises changing at least one filter cut-off frequency based on the at least one estimated visual speech parameter.
- 40. A method of enhancing speech as in claim 35 further comprising generating a speech representation based on the variably filtered audio signal.
- 41. A method of enhancing speech from a speaker comprising:
receiving a sequence of images of the speaker for a speech segment; determining at least one visual speech parameter for the speech segment based on the sequence of images; receiving an audio signal corresponding to the speech segment; and editing the received audio signal based on the determined at least one visual speech parameter.
- 42. A method of enhancing speech as in claim 41 wherein editing comprises cutting out at least a section of the audio signal.
- 43. A method of enhancing speech as in claim 41 wherein editing comprises inserting a section of speech into the audio signal.
- 44. A method of enhancing speech as in claim 41 wherein editing comprises superposition of another audio section upon a section of the audio signal.
- 45. A method of enhancing speech as in claim 41 wherein editing comprises replacing a section of the audio signal with another audio section.
- 46. A method of detecting speech comprising:
using at least one visual cue about a speaker to filter an audio signal containing the speech; determining a plurality of possible speech elements for each segment of the speech from the filtered audio signal; and selecting among the plurality of possible speech elements based on the at least one visual cue.
- 47. A method of detecting speech as in claim 46 wherein the at least one visual cue comprises at least one viseme.
- 48. A method of detecting speech as in claim 46 wherein the at least one visual cue comprises extracting at least one geometric feature from at least one speaker image.
- 49. A method of detecting speech as in claim 46 wherein the at least one visual cue comprises determining speaker motion in a plurality of image frames.
- 50. A method of detecting speech as in claim 46 wherein the at least one visual cue comprises determining at least one model fit to at least one speaker image.
- 51. A method of detecting speech as in claim 46 wherein the at least one visual cue used to filter the audio signal is different from the at least one visual cue for selecting among possible speech elements.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of U.S. provisional application Serial No. 60/236,720, filed Oct. 2, 2000, which is incorporated herein by reference in its entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60236720 |
Oct 2000 |
US |