Multi-unit approach to text-to-speech synthesis

Information

  • Patent Application
  • 20070192105
  • Publication Number
    20070192105
  • Date Filed
    February 16, 2006
    18 years ago
  • Date Published
    August 16, 2007
    16 years ago
Abstract
Methods, apparatus, systems, and computer program products are provided for synthesizing speech. One method includes matching a first level of units of a received input string to audio segments from a plurality of audio segments including using properties of or between first level units to locate matching audio segments from a plurality of selections, parsing unmatched first level units into second level units, matching the second level units to audio segments using properties of or between the units to locate matching audio segments from a plurality of selections and synthesizing the input string, including combining the audio segments associated with the first and second units.
Description

DESCRIPTION OF DRAWINGS


FIG. 1 is a block diagram illustrating a proposed system for text-to-speech synthesis.



FIG. 2 is a block diagram illustrating a synthesis block of the proposed system of FIG. 1.



FIG. 3A is a flow diagram illustrating one method for synthesizing text into speech.



FIG. 3B is a flow diagram illustrating a second method for synthesizing text into speech.



FIG. 4 is a flow diagram illustrating a method for providing a plurality of audio segments having defined properties that can be used in the method shown in FIG. 3.



FIG. 5 is a schematic diagram illustrating linked segments.



FIG. 6 is a schematic diagram illustrating another example of linked segments.



FIG. 7 is a flow diagram illustrating a method for matching units from a stream of text to audio segments at a highest possible unit level.



FIG. 8 is a schematic diagram illustrating linked segments.


Claims
  • 1. A method, including: matching phrase units of a received input string to audio segments from a plurality of audio segments including using properties of or between phrase units to locate matching audio segments from a plurality of selections;parsing unmatched phrase units into word units;matching the word units to audio segments using properties of or between words to locate matching audio segments from a plurality of selections; andsynthesizing the input string, including combining the audio segments associated with the phrase and word units.
  • 2. The method of claim 1, wherein matching the phrase units further comprises: searching metadata associated with the plurality of audio segments and that describes the properties of or between the plurality of audio segments.
  • 3. The method of claim 1, wherein matching the word units further comprises: searching metadata associated with the plurality of audio segments and that describes the properties of or between the plurality of audio segments.
  • 4. The method of claim 1, further comprising: parsing unmatched word units into sub-word units;matching the sub-word units to audio segments including, searching metadata associated with the plurality of audio segments and that describes properties of or between the plurality of audio segments.
  • 5. The method of claim 1, further comprising: parsing unmatched word units into phonetic segment units;matching the phonetic segment units to audio segments including, searching metadata associated with the plurality of audio segments and that describes properties of or between the plurality of audio segments.
  • 6. The method of claim 4 or 5, wherein synthesizing the input string includes: combining the audio segments associated with phrase, word, sub-word and phonetic segment units.
  • 7. The method of claim 1, further comprising: providing an index to the plurality of audio segments.
  • 8. The method of claim 1, further comprising: generating metadata associated with the plurality of audio segments.
  • 9. The method of claim 8, wherein generating the metadata comprises: receiving a voice sample;determining two or more portions of the voice sample having properties; andgenerating a portion of the metadata associated with a first portion of the voice sample to associate a second portion of the voice sample, and a portion of the metadata associated with the second portion of the voice sample to associate the first portion of the voice sample.
  • 10. The method of claim 8, wherein generating the metadata comprises: receiving a voice sample;delimiting a portion of the voice sample in which articulation relationships are substantially self-contained; andgenerating a portion of the metadata to describe the portion of the voice sample.
  • 11. The method of claim 1, wherein the phrase units each comprise one or more of one or more sentences, one or more phrases, one or more word pairs, or one or more words.
  • 12. The method of claim 1, wherein the input string is received from an application or an operating system.
  • 13. The method of claim 1, further comprising: transforming unmatched portions of the input string to uncorrelated phonemes.
  • 14. The method of claim 1, wherein the input string comprises ASCII or Unicode characters.
  • 15. The method of claim 1, further comprising: outputting amplified speech comprising the combined audio segments.
  • 16. A method, including: receiving a stream of textual input;matching portions of the input textual stream to audio segments derived from one or more voice samples at multiple levels; andsynthesizing matching audio segments into speech output.
  • 17. The method of claim 16, wherein synthesizing comprises: synthesizing both matching audio segments for successfully matched portions of the input stream and uncorrelated phonemes for unmatched portions of the input stream.
  • 18. A computer program product including instructions tangibly stored on a computer-readable medium, the product including instructions for causing a computing device to: match phrase units of an input string to audio segments from a plurality of audio segments;parse unmatched phrase units into word units;match the word units to audio segments; andsynthesize the input string, including combining the audio segments associated with the phrase and word units.
  • 19. A system, including: an input capture routine to receive an input string that includes phrase units;a unit matching engine, in communication with the input capture routine, to match the phrase units to audio segments from a plurality of audio segments including using properties of or between audio segments for matching phrase units;a parsing engine, in communication with the unit matching engine, to parse unmatched phrase units into word units, the unit matching engine configured to match the word units to audio segments including using properties of or between the audio segments for matching word units;a synthesis block, in communication with the unit matching engine, to synthesize the input string, including combining the audio segments associated with the phrase and word units; anda storage unit to store audio segments and properties of or between the audio segments.
  • 20. A method including providing a library of audio segments and associated metadata defining properties of or between a given segment and another segment, the library including one or more levels of units in accordance with a hierarchy;matching, at a first level of the hierarchy, units of a received input string to audio segments, the received input string having one or more units at a first level;parsing unmatched units to units at a second level in the hierarchy;matching one or more units at the second level of the hierarchy to audio segments; andsynthesizing the input string including combining the audio segments associated with the first and second levels.
  • 21. A method including receiving audio segments;parsing the audio segments into units of a first level in a hierarchy of levels;defining properties of or between units;storing the units and the properties;parsing the units into sub-units;defining properties of or between the sub-units; andstoring the sub-units and properties.
  • 22. The method of claim 21 further comprising; parsing a received input string to units;determining properties of or between the units if any;matching units to stored units using the properties;parsing unmatched units to sub-units;determining properties of or between the sub-units if any;matching one or more sub-units to stored sub-units; andsynthesizing the input string including combining the audio segments associated with the units and sub-units.
  • 23. The method of claim 21 where the units are phrase units and the sub-units are word units.
  • 24. The method of claim 21 where the units are word units and the sub-units are phonetic segments.
  • 25. The method of claim 21 further comprising defining properties between units and sub-units, and storing the properties with both the associated units and sub-units.
  • 26. The method of claim 21 further comprising continuing to parse the sub-units to phonetic segments, determining properties of or between phonetic segments if any, and storing the phonetic segments including properties.
  • 27. The method of claim 26 further comprising storing the phonetic segments without the properties.
  • 28. The method of claim 21 further comprising parsing the sub-units into parsed sub-units;defining properties of or between the parsed sub-units; andstoring the parsed sub-units and properties.
  • 29. The method of claim 27 further comprising parsing a received input string to units;determining properties of or between the units if any;matching units to stored units using the properties;parsing unmatched units to sub-units;determining properties of or between the sub-units if any;matching one or more sub-units to stored sub-units;parsing unmatched sub-units;determining properties of or between the parsed sub-units if any;matching parsed sub-units to stored parsed units; andsynthesizing the input string including combining the audio segments associated with the units, sub-units and parsed sub-units.
  • 30. The method of claim 27 further comprising storing the parsed sub-units without the properties.
  • 31. The method of claim 30 further comprising parsing a received input string to units;determining properties of or between the units if any;matching units to stored units using the properties;parsing unmatched units to sub-units;determining properties of or between the sub-units if any;matching one or more sub-units to stored sub-units;parsing unmatched sub-units;determining properties of or between the parsed sub-units if any;matching parsed sub-units to stored parsed units using the properties; andsynthesizing the input string including combining the audio segments associated with the units, sub-units and parsed sub-units.
  • 32. A method including receiving audio segments;parsing the audio segments into units of a first level in a hierarchy of levels;defining properties of or between units;storing the units and the properties;parsing the units into units of a next level in the hierarchy of levels;defining properties of or between units in the next level;storing the units and properties; andcontinuing to parse units at a given level into units at a next level in the hierarchy until a final parsing is performed;at each level, defining properties of or between units and storing the units and the properties; andat a final level in the hierarchy storing units.
  • 33. The method of claim 32 further comprising; parsing a received input string to units;determining properties for the units if any;matching units having properties to stored units at a first level in the hierarchy;parsing unmatched units in a given level of the hierarchy to units at a next level in the hierarchy;determining properties for the parsed units if any;matching one or more parsed units to stored units at a given level in the hierarchy; andsynthesizing the input string including combining the audio segments associated with the units and parsed units.