Claims
- 1. In a system that concatenates speech units to produce synthetic speech, a method for automatically segmenting unit labels, the method comprising:
training a set of Hidden Markov Models (HMMs) using seed data in a first iteration; aligning the set of HMMs using a Viterbi alignment to produce segmented unit labels; and adjusting boundaries of the unit labels using spectral boundary correction.
- 2. A method as defined in claim 1, wherein training a set of Hidden Markov Models further comprises:
initializing the set of HMMs using at least one of hand-labeled bootstrapped data, speaker-independent HMM bootstrapped data, and flat start data; re-estimating the set of HMMs; and performing an embedded re-estimation on the set of HMMs.
- 3. A method as defined in claim 1, wherein adjusting boundaries of the unit labels using spectral boundary correction further comprises adjusting boundaries of the unit labels within specified time windows.
- 4. A method as defined in claim 1, wherein adjusting boundaries of the unit labels using spectral boundary correction further comprises:
combining HMM-based segmentation with spectral features to reduce misalignments between target unit boundaries and boundaries assigned by the HMM-based segmentation.
- 5. A method as defined in claim 1, wherein adjusting boundaries of the phone labels using spectral boundary correction further comprises:
identifying context-dependent time windows around the unit boundaries, wherein the unit boundaries include one or more of:
a vowel-to-vowel boundary; a vowel-to-nasal boundary; a vowel-to-voiced stop boundary; a vowel-to-liquid boundary; a vowel-to-unvoiced stop boundary; a vowel-to-voiced fricative boundary; an unvoiced stop-to-vowel boundary; a nasal-to-vowel boundary; a voiced stop-to-vowel boundary a liquid-to-vowel boundary; an unvoiced fricative-to-vowel boundary; and a voiced fricative-to-vowel boundary.
- 6. A method as defined in claim 5, wherein context-dependent time windows are empirically determined by adjacent phones.
- 7. A method as defined in claim 1, further comprising using the unit labels whose boundaries have been adjusted by spectral boundary correction as input for a next iteration of:
training a set of HMMs; aligning the set of HMMs using a Viterbi alignment to produce phone labels; and adjusting boundaries of the unit labels using spectral boundary correction.
- 8. A computer-readable media having computer-executable instructions for implementing the method of claim 1.
- 9. In a system having a speech inventory that includes phone labels that are concatenated to form synthetic speech, a method for segmenting the phone labels, the method comprising:
performing a first alignment on a trained set of HMMs to produce phone labels that are segmented, wherein each phone label has a spectral boundary; and performing spectral boundary correction on the phone labels, wherein spectral boundary correction re-aligns each spectral boundary using bending points of spectral transitions.
- 10. A method as defined in claim 9, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises bootstrapping the set of HMMs with at least one of speaker-dependent HMMs and speaker-independent HMMs.
- 11. A method as defined in claim 9, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises:
initializing the set of HMMs; re-estimating the set of HMMs; and performing embedded re-estimation on the set of HMMs.
- 12. A method as defined in claim 9, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented further comprises performing a Viterbi alignment on the trained set of HMMs to produce phone labels that are segmented.
- 13. A method as defined in claim 11, wherein performing a first alignment on a trained set of HMMs to produce phone labels that are segmented and performing spectral boundary correction on the phone labels are performed iteratively.
- 14. A method as defined in claim 13, further comprising training the set of HMMs using phone labels having boundaries that have been re-aligned using spectral boundary correction.
- 15. A method as defined in claim 9, wherein performing spectral boundary correction on the phone labels further comprises performing spectral boundary correction on the phone labels within a context-dependent time window.
- 16. A method as defined in claim 15, further comprising empirically determining the context-dependent time window using adjacent phones.
- 17. A method as defined in claim 15, wherein each spectral boundary is between a first phone class and a second phone class.
- 18. A computer-readable media having computer-executable instructions for implementing the method of claim 9.
- 19. A method for segmenting phone labels to reduce misalignments in order to improve synthetic speech when the phone labels are concatenated, the method comprising:
training a set of HMMs using one of a specific speaker's hand-labeled speech data and speaker-independent speech data; segmenting the trained set of HMMs using a first alignment to produce phone labels, wherein each phone label has a spectral boundary; and using a weighted slope metric to identify bending points of spectral transitions, wherein each bending point corresponds to a spectral boundary; and correcting a particular spectral boundary of a particular phone label if the particular spectral boundary does not coincide with a particular bending point.
- 20. A method as defined in claim 19, wherein using a weighted slope metric to identify bending points of spectral transitions further comprises applying the weighted slope metric within context-dependent time windows such that spurious spectral boundaries are not applied to the phone labels.
- 21. A method as defined in claim 20, further comprising retraining the set of HMMs using the phone labels that have been corrected using the weighted slope metric.
- 22. A method as defined in claim 20, wherein each spectral boundary is defined by a first phone class and a second phone class, wherein the first phone class and the second phone class include at least one of a vowel, an unvoiced stop, a voiced stop, an unvoiced fricative, a voiced fricative, a liquid class and a nasal class.
- 23. A method as defined in claim 20, further comprising determining context-dependent time windows empirically.
- 24. A computer-readable media having computer-executable instructions for performing the method of claim 19.
RELATED APPLICATIONS
[0001] This application claims the benefit of U.S. Provisional Patent Application Serial No. 60/369,043 entitled “System and Method of Automatic Segmentation for Text to Speech Systems” and filed Mar. 29, 2002, which is incorporated herein by reference.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60369043 |
Mar 2002 |
US |