Claims
- 1. A method comprising:
separating at least a portion of an audio signal into a plurality of frames; extracting line spectrum pairs from each of the plurality of frames; and using at least the line spectrum pairs to classify at least the portion as either speech or non-speech.
- 2. A method as recited in claim 1, wherein the using comprises:
generating an input Gaussian Model corresponding to the plurality of frames based on the extracted line spectrum pairs; comparing the input Gaussian Model to a Vector Quantization codebook including a plurality of trained Gaussian Models; identifying one of the plurality of trained Gaussian Models that is closest to the input Gaussian Model; determining a distance between the input Gaussian Model and the closest trained Gaussian Model; and classifying at least the portion as speech if the distance is less than a threshold value.
- 3. A method as recited in claim 1, wherein the using comprises:
generating an input Gaussian Model corresponding to the plurality of frames based on the extracted line spectrum pairs; identifying one of the plurality of trained Gaussian Models that is closest to the input Gaussian Model; determining a distance between the input Gaussian Model and the closest trained Gaussian Model; and classifying at least the portion as non-speech if the distance is greater than a first threshold value.
- 4. A method as recited in claim 3, further comprising:
determining an energy distribution of the plurality of frames in a first bandwidth; and classifying at least the portion as non-speech if the distance is greater than a second threshold value and the energy distribution of the plurality of frames in the first bandwidth is less than a third threshold value, wherein the second threshold value is less than the first threshold value.
- 5. A method as recited in claim 4, further comprising:
determining an energy distribution of the plurality of frames in a second bandwidth; and classifying at least the portion as speech if the distance is less than the second threshold value and the energy distribution of the plurality of frames in the second bandwidth is greater than a fourth threshold value.
- 6. A method as recited in claim 5, further comprising otherwise classifying at least the portion as speech.
- 7. A method as recited in claim 2, further comprising:
extracting a high zero crossing rate ratio feature from the plurality of frames; extracting a low short time energy ratio feature from the plurality of frames; extracting a spectrum flux feature from the plurality of frames; pre-classifying the portion as speech or non-speech based at least in part on an average zero crossing rate, the high zero crossing rate ratio, the low short time energy ratio, and the spectrum flux features; using a first value as the threshold value if the portion is pre-classified as speech; and using a second value as the threshold value if the portion is pre-classified as non-speech, wherein the second value is less than the first value.
- 8. One or more computer-readable memories containing a computer program that is executable by a processor to perform the method recited in claim 1.
- 9. A method for determining when a speaker changes, the method comprising:
separating at least a portion of an audio signal into a plurality of frames; extracting line spectrum pairs from each of the plurality of frames; and determining when a speaker of the audio signal changes based at least in part on the line spectrum pairs.
- 10. A method as recited in claim 9, wherein the determining comprises:
calculating a difference between line spectrum pairs for successive frames of the plurality of frames; if the difference between two line spectrum pairs exceeds a threshold value, then determining that the speaker has changed, otherwise determining that the speaker has not changed.
- 11. One or more computer-readable memories containing a computer program that is executable by a processor to perform the method recited in claim 9.
- 12. An apparatus comprising:
a line spectrum pair (LSP) analyzer to extract line spectrum pairs from a portion of an audio signal; and a speech discriminator, communicatively coupled to the LSP analyzer, to classify the portion of the audio signal as either speech or non-speech based at least in part on the LSP analyzer.
- 13. An apparatus as recited in claim 12, further comprising:
a distance calculator, communicatively coupled to the LSP analyzer, to determine a distance between at least one of the trained Gaussian Models and an input Gaussian Model based on the extracted line spectrum pairs; and wherein the speech discriminator is further to classify the portion of the audio signal as either speech or non-speech based at least in part on the distance between the at least one of the trained Gaussian Models and the input Gaussian Model.
- 14. An apparatus as recited in claim 12, further comprising:
a Fast Fourier Transform (FFT) analyzer to extract Fast Fourier Transform features from the portion of the audio signal; an energy distribution calculator, communicatively coupled to both the FFT analyzer and the speech discriminator, to determine an energy distribution of the portion of the audio signal in at least one bandwidth; and wherein the speech discriminator is further to classify the portion of the audio signal as either speech or non-speech based at least in part on the energy distribution of the portion of the audio signal in the at least one bandwidth.
- 15. One or more computer-readable media having stored thereon a computer program to classify a portion of an audio signal as speech, music, silence, or environment sound, wherein the computer program, when executed by one or more processors, causes the one or more processors to perform acts including:
(a) analyzing line spectrum pair features of the portion to determine if the portion is speech; (b) analyzing energy features of the portion to determine if the portion is silence; (c) analyzing periodicity features of the portion to determine if the portion is music or environment sound; and (d) classifying the portion as speech, music, silence, or environment sound based on at least one of the analyzing acts (a)-(c).
- 16. One or more computer-readable media as recited in claim 15, wherein the computer program is further to cause the one or more processors to perform the acts (a)-(d) in the order (a), then (b), then (c), then (d).
- 17. One or more computer-readable media as recited in claim 16, wherein the computer program is further to cause the one or more processors to perform act (b) only if act (a) results in a determination that the portion is not speech.
- 18. One or more computer-readable media as recited in claim 16, wherein the computer program is further to cause the one or more processors to perform act (c) only if act (b) results in a determination that the portion is not silence.
RELATED APPLICATIONS
[0001] This is a division of U.S. patent application Ser. No. 09/553,166, filed Apr. 19, 2000, entitled “Audio Segmentation and Classification” to Hao Jiang and Hongjiang Zhang.
Divisions (1)
|
Number |
Date |
Country |
Parent |
09553166 |
Apr 2000 |
US |
Child |
10843011 |
May 2004 |
US |