Claims
- 1. A system for speech verification of an utterance, comprising:a speech verifier configured to generate a confidence index for said utterance, said utterance containing frames of sound energy, said speech verifier including a noise suppressor, a pitch detector, and a confidence determiner that are stored in a memory device which is coupled to said system, said noise suppressor reducing noise in a frequency spectrum for each of said frames in said utterance, said each of said frames corresponding to a frame set that includes a selected number of previous frames, said noise suppressor summing frequency spectra of each frame set to produce a spectral sum for each of said frames in said utterance; and a processor coupled to said system to control said speech verifier.
- 2. The system of claim 1, wherein said spectral sum for each of said frames is calculated according to a formula: Zn(k)=∑i=nn-N+1Xi( βik)where Zn(k) is said spectral sum for a frame n, Xi(βik) is an adjusted frequency spectrum for a frame i for i equal to n through n−N+1, βi is a frame set scale for said frame i for i equal to n through n−N+1, and N is a selected total number of frames in said frame set.
- 3. The system of claim 2, wherein said frame set scale for said frame i for i equal to n through n−N+1 is selected so that a difference between said frequency spectrum for said frame n of said utterance and a frequency spectrum for said frame n−N+1 of said utterance is minimized.
- 4. The system of claim 1, wherein said pitch detector generates correlation values for each of said frames in said utterance and determines an optimum frequency index for each of said frames in said utterance.
- 5. The system of claim 1, wherein said pitch detector generates correlation values by applying a spectral comb window to said spectral sum for each of said frames in said utterance, and determines an optimum frequency index that corresponds to a maximum of said correlation values.
- 6. The system of claim 5, wherein said pitch detector generates said correlation values according to a formula: Pn(k)=∑i=1N1W(ik)log (&LeftBracketingBar;Zn(ik)&RightBracketingBar;),k=K0,… ,K1where Pn(k) are said correlation values for a frame n, W(ik) is said spectral comb window, Zn(ik) is said spectral sum for said frame n, K0 is a lower frequency index, K1 is an upper frequency index, and N1 is a selected number of teeth of said spectral comb window.
- 7. The system of claim 4, wherein said pitch detector generates alternate correlation values for each of said frames in said utterance and determines an optimum alternate frequency index for each of said frames in said utterance.
- 8. The system of claim 4, wherein said pitch detector generates alternate correlation values by applying an alternate spectral comb window to said spectral sum for each of said frames in said utterance, and determines an optimum alternate frequency index that corresponds to a maximum of said alternate correlation values.
- 9. The system of claim 7, wherein said pitch detector generates said alternate correlation values by a formula: Pn′(k)=∑i=2N1W(ik)log (&LeftBracketingBar;Zn(ik)&RightBracketingBar;),k=K0,… ,K1where P′n(k) are said alternate correlation values for a frame n, W(ik) is a spectral comb window, Zn(ik) is said spectral sum for said frame n, K0 is a lower frequency index, K1 is an upper frequency index, and N1 is a selected number of teeth of said spectral comb window.
- 10. The system of claim 7, wherein said confidence determiner determines a frame confidence measure for each of said frames in said utterance by analyzing a maximum peak of said correlation values for each of said frames.
- 11. The system of claim 7, wherein said confidence determiner determines a frame confidence measure for each of said frames in said utterance according to a formula: cn={1if RnQ>γ and hn=10otherwisewhere cn is said frame confidence measure for a frame n, Rn is a peak ratio for said frame n, hn is a harmonic index for said frame n, γ is a predetermined constant, and Q is an inverse of a width of said maximum peak of said correlation values at a half-maximum point.
- 12. The system of claim 11, wherein said peak ratio is determined according to a formula: Rn=Ppeak-PavgPpeakwhere Rn is said peak ratio for said frame n, Ppeak is said maximum of said correlation values, and Pavg is an average of said correlation values.
- 13. The system of claim 11, wherein said harmonic index is determined by a formula: hn={1if kn′*=kn*0otherwisewhere hn is said harmonic index for said frame n, kn′* is said optimum alternate frequency index for said frame n, and kn* is said optimum frequency index for said frame n.
- 14. The system of claim 10, wherein said confidence determiner determines said confidence index for said utterance according to a formula: C={1if cn=cn-1=cn-2=1,for any n in the utterance0otherwisewhere C is said confidence index for said utterance, cn is said frame confidence measure for a frame n, cn−1 is a frame confidence measure for a frame n−1, and cn−2 is a frame confidence measure for a frame n−2.
- 15. The system of claim 1, wherein said speech verifier further comprises a pre-processor that generates a frequency spectrum for each of said frames in said utterance.
- 16. The system of claim 15, wherein said pre-processor applies a Fast Fourier Transform to each of said frames in said utterance to generate said frequency spectrum for each of said frames in said utterance.
- 17. The system of claim 1, wherein said system is coupled to a voice-activated electronic system.
- 18. The system of claim 17, wherein said voice-activated electronic system is implemented in an automobile.
- 19. A method for speech verification of an utterance, comprising the steps of:generating a confidence index for said utterance by using a speech verifier, said utterance containing frames of sound energy, said speech verifier including a noise suppressor, a pitch detector, and a confidence determiner that are stored in a memory device which is coupled to an electronic system, said noise suppressor suppressing noise in a frequency spectrum for each of said frames in said utterance, said each of said frames in said utterance corresponding to a frame set that includes a selected number of previous frames, said noise suppressor summing frequency spectra of each frame set to produce a spectral sum for each of said frames in said utterance; and controlling said speech verifier with a processor that is coupled to said electronic system.
- 20. The method of claim 19, wherein said spectral sum for each of said frames in said utterance is calculated according to a formula: Zn(k)=∑i=nn-N+1Xi( βik)where Zn(k) is said spectral sum for a frame n, Xi(βik) is an adjusted frequency spectrum for a frame i for i equal to n through n−N+1, βi is a frame set scale for said frame i for i equal to n through n−N+1, and N is a selected total number of frames in said frame set.
- 21. The method of claim 20, wherein said frame set scale for said frame i for i equal to n through n−N+1 is selected so that a difference between said frequency spectrum for said frame n of said utterance and a frequency spectrum for said frame n−N+1 of said utterance is minimized.
- 22. The method of claim 19, further comprising the steps of generating correlation values for each of said frames in said utterance and determining an optimum frequency index for each of said frames in said utterance using said pitch detector.
- 23. The method of claim 19, wherein said pitch detector generates correlation values by applying a spectral comb window to said spectral sum for each of said frames in said utterance, and determines an optimum frequency index that corresponds to a maximum of said correlation values.
- 24. The method of claim 23, wherein said pitch detector generates said correlation values according to a formula: Pn(k)=∑i=1N1W(ik)log (&LeftBracketingBar;Zn(ik)&RightBracketingBar;),k=K0,… ,K1where Pn(k) are said correlation values for a frame n, W(ik) is said spectral comb window, Zn(ik) is said spectral sum for said frame n, K0 is a lower frequency index, K1 is an upper frequency index, and N1 is a selected number of teeth of said spectral comb window.
- 25. The method of claim 22, further comprising the steps of generating alternate correlation values for each of said frames in said utterance and determining an optimum alternate frequency index for each of said frames in said utterance using said pitch detector.
- 26. The method of claim 22, wherein said pitch detector generates alternate correlation values by applying an alternate spectral comb window to said spectral sum for each of said frames in said utterance, and determines an optimum alternate frequency index that corresponds to a maximum of said alternate correlation values.
- 27. The method of claim 25, wherein said pitch detector generates said alternate correlation values by a formula: Pn′(k)=∑i=2N1W(ik)log (&LeftBracketingBar;Zn(ik)&RightBracketingBar;),k=K0,… ,K1where P′n(k) are said alternate correlation values for a frame n, W(ik) is a spectral comb window, Zn(ik) is said spectral sum for said frame n, K0 is a lower frequency index, K1 is an upper frequency index, and N1 is a selected number of teeth of said spectral comb window.
- 28. The method of claim 25, further comprising the step of determining a frame confidence measure for each of said frames in said utterance by analyzing a maximum peak of said correlation values for each of said frames using said confidence determiner.
- 29. The method of claim 25, wherein said confidence determiner determines a frame confidence measure for each of said frames in said utterance according to a formula: cn={1if RnQ>γ and hn=10otherwisewhere cn is said frame confidence measure for a frame n, Rn is a peak ratio for said frame n, hn is a harmonic index for said frame n, γ is a predetermined constant, and Q is an inverse of a width of said maximum peak of said correlation values at a half-maximum point.
- 30. The method of claim 29, wherein said peak ratio is determined according to a formula: Rn=Ppeak-PavgPpeakwhere Rn is said peak ratio for said frame n, Ppeak is said maximum of said correlation values, and Pavg is an average of said correlation values.
- 31. The method of claim 29, wherein said harmonic index is determined by a formula: hn={1if kn′*=kn*0otherwisewhere hn is said harmonic index for said frame n, kn′* is said optimum alternate frequency index for said frame n, and kn* is said optimum frequency index for said frame n.
- 32. The method of claim 28, wherein said confidence determiner determines said confidence index for said utterance according to a formula: C={1if cn=cn-1=cn-2=1,for any n in the utterance0otherwisewhere C is said confidence index for said utterance, cn is said frame confidence measure for a frame n, cn−1 is a frame confidence measure for a frame n−1, and cn−2 is a frame confidence measure for a frame n−2.
- 33. The method of claim 19, further comprising the step of generating a frequency spectrum for each of said frames in said utterance using a pre-processor.
- 34. The method of claim 33, wherein said pre-processor applies a Fast Fourier Transform to each of said frames in said utterance to generate said frequency spectrum for each of said frames in said utterance.
CROSS-REFERENCE TO RELATED APPLICATION
This application is related to, and claims priority in, U.S. Provisional Patent Application Ser. No. 60/099,739, entitled “Speech Verification Method For Isolated Word Speech Recognition,” filed on Sep. 10, 1998. The related applications are commonly assigned.
US Referenced Citations (9)
Non-Patent Literature Citations (3)
Entry |
Martin, Philippe, “Comparison of Pitch Detection By Cepstrum and Spectral Comb Analysis,” Proceedings of ICASSP, 1982, pp. 180-183. |
Tucker, R., “Voice Activity Detection Using A Periodicity Measure,” IEEE Proceedings-1, vol. 139, No. 4, Aug. 1992, pp. 377-380. |
Hermes, Dik J., “Pitch Analysis,” Visual Representations of Speech Signals, 1993, pp. 3-15. |
Provisional Applications (1)
|
Number |
Date |
Country |
|
60/099739 |
Sep 1998 |
US |