Claims
- 1. A method for synthesizing speech, the method comprising:
generating a training context vector for each of a set of training speech units in a training speech corpus, each training context vector indicating the prosodic context of a training speech unit in the training speech corpus; indexing a set of speech segments associated with a set of training speech units based on the context vectors for the training speech units; generating an input context vector for each of a set of input speech units in an input text, each input context vector indicating the prosodic context of an input speech unit in the input text; using the input context vectors to find a speech segment for each input speech unit; and concatenating the found speech segments to form a synthesized speech signal.
- 2. The method of claim 1 wherein the each context vector comprises a position-in-phrase coordinate indicating the position of the speech unit in a phrase.
- 3. The method of claim 1 wherein the each context vector comprises a position-in-word coordinate indicating the position of the speech unit in a word.
- 4. The method of claim 1 wherein the each context vector comprises a left phonetic coordinate indicating a category for the phoneme to the left of the speech unit.
- 5. The method of claim 1 wherein the each context vector comprises a right phonetic coordinate indicating a category for the phoneme to the right of the speech unit.
- 6. The method of claim 1 wherein the each context vector comprises a left tonal coordinate indicating a category for the tone of the speech unit to the left of the speech unit.
- 7. The method of claim 1 wherein the each context vector comprises a right tonal coordinate indicating a category for the tone of the speech unit to the right of the speech unit.
- 8. The method of claim 1 wherein indexing a set of speech segments comprises generating a decision tree based on the training context vectors.
- 9. The method of claim 8 wherein using the input context vectors to find a speech segment comprises searching the decision tree using the input context vector.
- 10. The method of claim 9 wherein searching the decision tree comprises:
identifying a leaf in the tree for each input context vector, each leaf comprising at least one candidate speech segments; and selecting one candidate speech segment in each leaf node, wherein if there is more than one candidate speech segment on the node The selection is based on a cost function.
- 11. The method of claim 10 wherein the cost function comprises a distance between the input context vector and a training context vector associated with a speech segment.
- 12. The method of claim 11 wherein the cost function further comprises a smoothness cost that is based on a candidate speech segment of at least one neighboring speech unit.
- 13. The method of claim 12 wherein the smoothness cost gives preference to selecting a series of speech segments for a series of input context vectors if the series of speech segments occurred in series in the training speech corpus.
- 14. A method of selecting sentences for reading into a training speech corpus used in speech synthesis, the method comprising:
identifying a set of prosodic context information for each of a set of speech units; determining a frequency of occurrence for each distinct context vector that appears in a very large text corpus; using the frequency of occurrence of the context vectors to identify a list of necessary context vectors; and selecting sentences in the large text corpus for reading into the training speech corpus, each selected sentence containing at least one necessary context vector.
- 15. The method of claim 14 wherein identifying a collection of prosodic context information sets as necessary context information sets comprises:
determining the frequency of occurrence of each prosodic context information set across a very large text corpus; and identifying a collection of prosodic context information sets as necessary context information sets based on their frequency of occurrence.
- 16. The method of claim 15 wherein identifying a collection of prosodic context information sets as necessary context information sets further comprises:
sorting the context information sets by their frequency of occurrence in decreasing order; determining a threshold, F, for accumulative frequency of top context vectors; and selecting the top context vectors whose accumulative frequency is not smaller than F for each speech unit as necessary prosodic context information sets.
- 17. The method of claim 14 further comprising indexing only those speech segments that are associated with sentences in the smaller training text and wherein indexing comprises indexing using a decision tree.
- 18. The method of claim 17 wherein indexing further comprises indexing the speech segments in the decision tree based on information in the context information sets.
- 19. The method of claim 18 wherein the decision tree comprises leaf nodes and at least one leaf node comprises at least two speech segments for the same speech unit.
- 20. A method of selecting speech segments for concatenative speech synthesis, the method comprising:
parsing an input text into speech units; identifying context information for each speech unit based on its location in the input text and at least one neighboring speech unit; identifying a set of candidate speech segments for each speech unit based on the context information; and identifying a sequence of speech segments from the candidate speech segments based in part on a smoothness cost between the speech segments.
- 21. The method of claim 20 wherein identifying a set of candidate speech segments for a speech unit comprises applying the context information for a speech unit to a decision tree to identify a leaf node containing candidate speech segments for the speech unit.
- 22. The method of claim 21 wherein identifying a set of candidate speech segments further comprises pruning some speech segments from a leaf node based on differences between the context information of the speech unit from the input text and context information associated with the speech segments.
- 23. The method of claim 20 wherein identifying a sequence of speech segments comprises using a smoothness cost that is based on whether two neighboring candidate speech segments appeared next to each other in a training corpus.
- 24. The method of claim 21 wherein identifying a sequence of speech segments further comprises identifying the sequence based in part on differences between context information for the speech unit of the input text and context information associated with a candidate speech segment.
- 25. A computer-readable medium having computer executable instructions for synthesizing speech from speech segments based on speech units found in an input text, the speech being synthesized through a method comprising steps of:
identifying context information for each speech unit based on the prosodic structure of the input text; identifying a set of candidate speech segments for each speech unit based on the context information; identifying a sequence of speech segments from the candidate speech segments; concatenating the sequence of speech segments without modifying the prosody of the speech segments to form the synthesized speech.
REFERENCE TO RELATED APPLICATION
[0001] The present application claims priority to a U.S. Provisional application having serial No. 60/251,167, filed on Dec. 4, 2000 and entitled “PROSODIC WORD SEGMENTATION AND MULTI-TIER NON-UNIFORM UNIT SELECTION”.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60251167 |
Dec 2000 |
US |