The present disclosure relates generally to text-to-speech (TTS) systems, and, in particular, to a system and method for selecting among TTS systems dynamically.
The quality of the output of a text-to-speech synthesis system is dependent on the particular text presented as input; some sentences synthesize well, while others are plagued by discontinuities and bad prosody. Moreover, systems using different algorithms or different settings may behave differently on a given text. One system may perform better than another system on some texts, but worse on others. Typically, a TTS system uses a particular algorithm and system, and adjusts the parameters related to that algorithm and system.
Embodiments of the invention include a method for dynamically selecting among text-to-speech systems, the method including identifying text for converting into a speech waveform, synthesizing the text by two or more TTS systems, generating a candidate waveform from each of the systems, generating a score from each of the systems, comparing each of the scores, selecting a score based on a criteria and selecting one of the three waveforms based on the selected of the three scores.
Additional embodiments include a system for dynamically selecting among text-to-speech systems, including a first text synthesizer, a second text synthesizer, a third text synthesizer (or multiple synthesizers), an input device providing desired text to be converted into a speech output, to the first, second and third text synthesizers and an output device for receiving synthesized waveforms and a score from the first second and third text synthesizers, the output device determining a low cost score for each of the waveforms and generating one of the three waveforms with the lowest cost score as an output waveform as the speech output for said desired text.
Further embodiments include a storage medium with machine-readable computer program code for dynamically selecting among text-to-speech systems, the storage medium including instructions for causing a system to implement a method, including identifying text for converting into an output speech waveform, synthesizing the text by multiple TTS systems, generating a candidate waveform from each of the systems, generating a cost function score from each of the systems, associating each of the scores with the respective waveforms, identifying the lowest cost function score and generating the waveform associated with the lowest cost function score as the output speech waveform.
Other systems, methods, and/or computer program products according to embodiments will be or become apparent to one with skill in the art upon review of the following drawings and detailed description. It is intended that all such additional systems, methods, and/or computer program products be included within this description, be within the scope of the present invention, and be protected by the accompanying claims.
The subject matter which is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
The detailed description explains the preferred embodiments of the invention, together with advantages and features, by way of example with reference to the drawings.
Exemplary embodiments include a system for dynamically and automatically selecting among TTS systems having different algorithms for generating waveforms. The desired text is synthesized several times by different systems, and the output is selected dynamically among the systems based on a confidence score or a minimum cost function score to produce the final synthetic speech output waveform. The score is used as a switch to select one of the available TTS renditions of the text as the speech output.
Various choices for the multiple TTS systems exist. In general, in the embodiments described herein, it is understood that several different TTS technologies can be implemented such as, but not limited to: a formant TTS engine; a concatenative TTS engine; a Hidden-Markov-Model-based engine, etc. Another choice is to use the same basic technology, but vary some of the parameters to generate different outputs. For example, the concatenative TTS engine has weights allow a trade-off of various aspects of the cost function. Therefore, in one implementation, a trade-off of spectral smoothness with closeness to the prosodic targets when selecting a segment for concatenation could be made. By adjusting the weights controlling this trade-off different output speech from the same system could be generated.
It is appreciated that the exemplary embodiments of the methods and systems described here apply to TTS for speech at various utterance pieces including sentence-by-sentence, word-by-word, syllable-by-syllable, etc.
Referring still to
Therefore, in system 100, desired text 105 is synthesized by three systems 110, 120, 130, each of which generates a candidate waveform and a score reflecting the quality of its output 115, 125, 135. Those scores carried in output 115, 125, 135 are then compared and the waveform generated by the system reporting the lowest cost is selected as the best waveform for the text to be synthesized, and output by selector 140. The best waveform is taken as the output of the overall system 100.
As discussed above, the selection process is automatic and dynamic, based on a confidence score or other quality measure automatically assigned to each of the candidate TTS system 110, 120, 130 outputs 115, 125, 135. In exemplary embodiments, each synthesis system 110, 120, 140 reports a cost associated with synthesizing the desired text 105, which is output to selector 140. Cost reflects the ability of the system to achieve a smooth output, to match the desired pitch and durations, etc. For example, in the speech generation process, the degree of mismatch between the input text and the output waveform is determined by a cost function. Mismatch can be determined by a variety of factors such as but not limited to sequences of phonemes and prosodic characteristics (intonation). Many concatenative TTS systems use cost functions internally to select a sequence of segments to synthesize a given text. In general, the higher the cumulative cost function for a given piece of dialog (utterance), the worse the overall naturalness and intelligibility of the speech generated. Cost function is therefore an inherent measure of the quality of concatenative speech generation.
In an exemplary embodiment, system 100 uses of that same cost function as a means of assigning a measure of quality to the system outputs. The synthetic speech generated by the synthesis system reporting the lowest cost is then selected as the final output. In the case where the cost functions used by different systems are not directly comparable (e.g. one system multiplies all costs by 10, so that its scores tend to be larger than the scores of the other systems) a function of the scores rather than the scores themselves may be used, where the function normalizes the scores so that they may be compared.
The processing can actually occur at various levels. Fusion can be late, where the sentence or paragraph is generated by each candidate system and the entire passage is chosen from one of the systems based on cost. Fusion can also be early, where the decision for which system's output to choose happens at the phase, word, or sub-word level. When fusion happens earlier than at the sentence level, the sub-sentence portions of speech are concatenated at system output to form the desired sentence.
It is appreciated that system 100 and method 200 as described above allow for automatic selection of the best waveform output for any given text. Therefore, for one section of desired text, the first engine may produce the lowest cost function score. Therefore, the waveform output of the first engine is automatically selected as the output waveform of the overall system. For the next section of desired text, the third engine may have the lowest cost function score. Therefore, the waveform output of the third engine is automatically selected s the output of the system. For the third section of text, the second engine may produce the lowest cost function score. Therefore, the output waveform of the second engine is automatically selected as the output of the overall system, and so on.
As described above, embodiments can be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. In exemplary embodiments, the invention is embodied in computer program code executed by one or more network elements. Embodiments include computer program code containing instructions embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other computer-readable storage medium, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. Embodiments include computer program code, for example, whether stored in a storage medium, loaded into and/or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another. Furthermore, the use of the terms a, an, etc. do not denote a limitation of quantity, but rather denote the presence of at least one of the referenced item.
Number | Name | Date | Kind |
---|---|---|---|
5832433 | Yashchin et al. | Nov 1998 | A |
6141642 | Oh | Oct 2000 | A |
6243681 | Guji et al. | Jun 2001 | B1 |
6725199 | Brittan et al. | Apr 2004 | B2 |
7483834 | Naimpally et al. | Jan 2009 | B2 |
20010047260 | Walker et al. | Nov 2001 | A1 |
20060041429 | Amato et al. | Feb 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
20080172234 A1 | Jul 2008 | US |