Claims
- 1. A method comprising:
identifying an initial set of pitch value candidates within each frame of a plurality of frames of received audio content utilizing a first pitch estimation algorithm; reducing the initial set of pitch value candidates to a select set of select pitch value candidates based, at least in part, on pitch value re-scoring utilizing a second pitch estimation algorithm, wherein the select set of pitch values are selected in substantially real-time; and associating at least some of the select pitch value candidates with at least one speech phoneme in substantially real-time:
- 2. The method as recited in claim 1, wherein the associating further is comprises calculating a transition probability between one of the select pitch value candidates and a select pitch value candidate of an adjacent frame of audio content; and
selecting a pitch value within each frame with the highest transition probability between adjacent frames as the pitch value for the frame.
- 3. The method as recited in claim 2, wherein the transition probability is based, at least in part, on dynamic programming configured to determine a significantly best path between different pitch candidate s of adjacent frames.
- 4. The method as recited in claim 2, further comprising smoothing a curve representing the select pitch values over a plurality of frames based at least in part on other information, wherein the other information includes one or more of an energy value for each frame, a zero crossing rate of the audio content, and/or a vocal tract spectrum of the audio content.
- 5. The method as recited in claim 1, wherein identifying the initial set of pitch value candidates within each frame comprises:
passing each frame of audio content through an average magnitude difference function (AMDF); and selecting N near-zero minima pitch values in the audio content as the initial set of pitch value candidates.
- 6. The method as recited in claim 5, wherein N is set to 288 pitch value candidates, selected as the initial set of pitch value candidates based, at least in part, on the AMDF.
- 7. The method as recited in claim 1, wherein identifying a select set of pitch values comprises:
generating a local score for each of the initial set of pitch value candidates utilizing a normalized cross-correlation function (NCCF); and selecting M pitch value candidates with the highest local score.
- 8. The method as recited in claim 1, further comprising comparing a sequence of multiple phonemes associated with corresponding select pitch value candidates from multiple adjacent frames of the audio content with a language model to determine a syllable of speech in substantially real time.
- 9. The method as recited in claim 8, wherein the language model comprises at least in part one or more syllable-based speech and text corpora.
- 10. The method as recited in claim 1, further comprising comparing a temporal sequence of the phonemes corresponding to adjacent frames of the audio content with a language model to determine a syllable of speech in substantially real time.
- 11. A computer readable medium having computer instructions for performing acts comprising:
identifying an initial set of pitch values within frames of audio content utilizing a first pitch estimation algorithm; reducing the initial set of pitch values to a select set of pitch values based, at least in part, on pitch value re-scoring utilizing a second pitch estimation algorithm, wherein the select set of pitch values are determined in substantially real-time; and associating at least some of the pitch values from the select set with at least one speech phoneme in substantially real-time.
- 12. A computer readable medium as recited in claim 11, having further computer instructions for performing acts comprising:
calculating a transition probability between at least one of the pitch values of adjacent frames.
- 13. A computer readable medium as recited in claim 11, having further computer instructions for performing acts comprising:
within each frame of audio content, selecting a pitch value with the highest transition probability between adjacent frames as the pitch value representing the pitch of the frame.
- 14. A computer readable medium as recited in claim 11, wherein the transition probability is based, at least in part, on dynamic programming configured to determine a significantly best path between different pitch values of adjacent frames.
- 15. A computer readable medium as recited in claim 11, having further computer instructions for performing acts comprising:
smoothing a curve representing the pitch values of the select set over a plurality of frames based, at least in part, on other information.
- 16. A computer readable medium as recited in claim 15, wherein the other information includes one or more of an energy value for each frame, a zero crossing rate of the audio content, and/or a vocal tract spectrum of the audio content.
- 17. A computer readable medium as recited in claim 11, wherein identifying the initial set of pitch values within each frame comprises:
passing each frame of audio content through an average magnitude difference function (AMDF); and selecting N near-zero minima pitch values in the audio content as the initial set of pitch values.
- 18. A computer readable medium as recited in claim 17, wherein N is set to 288 pitch value candidates, selected as the initial set of pitch values based, at least in part, on the AMDF.
- 19. A computer readable medium as recited in claim 11, wherein identifying a select set of pitch values comprises:
generating a local score for each of the initial set of pitch values utilizing a normalized cross-correlation function (NCCF); and selecting M pitch values with the highest local score.
- 20. A computer readable medium as recited in claim 11, further comprising instructions to compare a sequence of multiple phonemes associated with corresponding select pitch value candidates from multiple adjacent frames of the audio content with a language model to determine a syllable of speech in substantially real time.
- 21. A computer readable medium as recited in claim 20, wherein the language model comprises at least in part one or more syllable-based speech and text corpora.
- 22. A computer readable medium as recited in claim 20, further comprising instructions to compare a temporal sequence of the phonemes corresponding to adjacent frames of the audio content with a language model to determine a syllable of speech in substantially real time.
- 23. An audio analysis engine, comprising:
a pitch tracker to:
receive audio content; identify an initial set of pitch value candidates within each frame of a plurality of frames of the received audio content utilizing a first pitch estimation algorithm; reduce the initial set of pitch value candidates to a select set of pitch value candidates based, at least in part, on pitch value re-scoring utilizing a second pitch estimation algorithm, wherein the select set of pitch values are selected in substantially real-time; and a syllable recognition module to associate at least some of the select pitch value candidates determined by the pitch tracker with at least one speech phoneme in substantially real-time. The audio analysis engine as recited in claim 23, wherein the pitch tracker calculates a transition probability between at least one of the select pitch value candidates of adjacent frames and selects a pitch value within each frame with the highest transition probability between adjacent frames as the pitch value for the frame.
- 24. The audio analysis engine as recited in claim 24, wherein the transition probability is based, at least in part, on dynamic programming configured to determine a significantly best path between different pitch candidates of adjacent frames.
- 25. The audio analysis engine as recited in claim 24, wherein the pitch tracker smoothes a curve representing the select pitch values over a plurality of frames based, at least in part, on other information.
- 26. The audio analysis engine as recited in claim 26, wherein the other information includes one or more of an energy value for each frame, a zero crossing rate of the audio content, and/or a vocal tract spectrum of the audio content.
- 27. The audio analysis engine as recited in claim 23, wherein, in response to identifying the initial set of pitch value candidates within each frame, the pitch tracker passes each frame of audio content through an average magnitude difference function (AMDF), and selects N near-zero minima pitch values in the audio content as the initial set of pitch value candidates.
- 28. The audio analysis engine as recited in claim 28, wherein N is set to 288 pitch value candidates, selected as the initial set of pitch value candidates based, at least in part, on the AMDF.
- 29. The audio analysis engine as recited in claim 23, wherein, in response to identifying the select set of pitch values, the pitch tracker generates a local score for each of the initial set of pitch value candidates utilizing a normalized cross-correlation function (NCCF), and selects M pitch value candidates with the highest local score.
- 30. The audio analysis engine as recited in claim 23, wherein the syllable recognition module compares a sequence of multiple phonemes associated with corresponding select pitch value candidates from multiple adjacent frames of the audio content with a language model to determine a syllable of speech in substantially real time.
- 31. The audio analysis engine as recited in claim 31, wherein the language model comprises at least in part one or more syllable-based speech and text corpora.
- 32. The audio analysis engine as recited in claim 23, wherein the syllable recognition module compares a temporal sequence of the phonemes corresponding to adjacent frames of the audio content with a language model to determine a syllable of speech in substantially real time.
RELATED APPLICATIONS
[0001] This is a continuation of U.S. patent application Ser. No. 09/843,212 entitled, “A Method And Apparatus For Tracking Pitch In Audio Analysis,” to Eric I-Chao Chang and Jian Lai Zhou, filed Apr. 24, 2001.
Continuations (1)
|
Number |
Date |
Country |
Parent |
09843212 |
Apr 2001 |
US |
Child |
10860344 |
Jun 2004 |
US |