The present invention relates to a text-to-speech system and the method thereof, and more particularly to a multi-language text-to-speech system and the method thereof.
For a text-to-speech system, the text has only the linguistic features whether the input data is a paragraph or an article. It means the text does not contain any acoustic features, for example, tones, durations or speeds. Therefore, the system has to generate possible acoustic features of these texts through an automatic prediction. Recently, the stringing method is very popular, which picks up a sound unit corresponding to the word from a prerecorded database.
The major function of a text-to-speech system is to convert a text input to a fluent speech output. Please refer to will become
(ni3)
(chil guo4)
(zao3 can1)
(le3)
(ma5)”, where some words have been determined as a meaningful term. After the linguistic processing, each semantic segment is assembled as a relevant speech data. Finally, the prosody processing is taken to adjust pitch contours, volumes and durations of each acoustic unit of the sentence.
A multi-language text-to-speech system and method are disclosed in the U.S. Pat. No. 6,141,642. The method includes different linguistic processing systems to proceed tasks of text-to-speech in different languages respectively, and then the combination of speech data from different processing systems is output. In the U.S. Pat. No. 6,243,681B1, a multi-language speech synthesizer for a computer telephony integration system is disclosed. The disclosed multi-language speech synthesizer includes several speech synthesizers for text-to-speech with different languages. Then, the speech data from different linguistic processing systems are combined and output
The above-mentioned US patents are both based on the combination of different acoustic databases of different languages. When the speech data is output, users will hear different sounds of each language, which means the voices and the prosodies are different and inconsistent. Further, even all words of each language could be recorded by the same speaker, it spends lots of efforts and is not easily achievable.
In order to overcome the foresaid drawbacks in the prior arts, the present invention provides a text-to-speech system and the method thereof, especially a multi-language text-to-speech system and the method thereof.
It is an aspect of the present invention to provide a text-to-speech system, including a text processor dividing a first text data and a second text data from a text string having at least a first language and a second language; a database including a plurality of acoustic units commonly used by the first and second language; a first speech synthesis unit and a second speech synthesis unit generating a first speech data corresponding to the first text data and a second speech data corresponding to the second text data respectively by using the plurality of acoustic units; and a prosody processor optimizing prosodies of the first and second speech data.
Preferably, the first and second text data include acoustic data respectively.
Preferably, the plurality of acoustic units are recorded from the same speaker.
Preferably, the prosody processor includes a reference prosody.
More preferably, the prosody processor determines a first prosody parameter and a second prosody parameter for the first speech data and the second speech data respectively according to the reference prosody.
More preferably, the first and second prosody parameters define tones, volumes, speeds and durations for the first and second speech data.
More preferably, the prosody processor connects the first speech data with the second speech data in a hierarchical manner according to the first and second prosody parameters to obtain a successive prosody thereof.
More preferably, the prosody processor further adjusts connected the first speech data and the second speech data.
It is another aspect of the present invention to provide a method for a text-to-speech conversion, including steps of: (a) providing a text string comprising at least a first language and a second language; (b) discriminating a first text data and a second text data from the text string; (c) providing a database having a plurality of acoustic units commonly used by the first and second languages; (d) generating a first speech data corresponding to the first text data and a second speech data corresponding to the second text data respectively by using the plurality of acoustic units; and (e) optimizing prosodies of the first and second speech data.
Preferably, the first and second text data include acoustic data respectively.
Preferably, the plurality of acoustic units are recorded from the same speaker.
Preferably, the step (e) further includes a step (e1) of providing a reference prosody.
More preferably, the step (e) further includes a step (e2) of determining a first prosody parameter and a second prosody parameter for the first and second speech data respectively according to the reference prosody.
More preferably, the first and second prosody parameters define tones, volumes, speeds and durations of the first and second speech data.
Preferably, the step (e) further includes a step (e3) of connecting the first and second speech data in a hierarchical manner according to the first and second prosody parameters to obtain a successive prosody.
More preferably, the step (e) further includes a step (e4) of adjusting connected the first and second speech data.
It is a further aspect of the present invention to provide a text-to-speech system, including: a text processor discriminating a first text data and a second text data from a text data comprising at least a first language and a second language; a translation module translating the second text data to a translated data in the first language; a speech synthesis unit receiving the first text data and the translated data and generating a speech data therefrom; and a prosody processor optimizing a prosody of the speech data.
Preferably, the second text data is at least one selected from a group consisting of a word, a phrase and a sentence.
Preferably, the speech synthesis unit further includes an analyzing module for rearranging the first text data and the translated data to obtain the speech data with a correct grammar and meaning according to the first language.
Preferably, the prosody processor includes a reference prosody.
More preferably, the prosody processor determines a prosody parameter for the speech data according to said reference prosody.
More preferably, the prosody parameters defines tones, volumes, speeds and durations of the speech data.
More preferably, the prosody processor adjusts the speech data according to the prosody parameters to obtain a successive prosody thereof.
It is further another aspect of the present invention to provide a method for a text-to-speech conversion, including steps of: (a) providing a text data comprising at least a first language and a second language; (b) dividing a first text data and a second text data from the text data; (c) translating the second text data to a translated data in the first language; (d) generating a speech data corresponding to the first text data and the translated data; and (e) optimizing a prosody of the speech data.
Preferably, the second text data is at least one selected from a group consisting of a word, a phrase and a sentence.
Preferably, the step (d) further includes a step (d1) of rearranging the first text data and the translated data according to grammar and meanings of the first language to obtain the speech data with a correct grammar and meaning.
Preferably, the step (e) further includes a step (e1) of providing a reference prosody.
More preferably, the step (e) further includes a step (e2) of determining a prosody parameter of the speech data according to the reference prosody.
More preferably, the prosody parameters defines a tone, volume, speed, and duration of the speech data.
More preferably, the step (e) further includes a step (e3) of adjusting the speech data according to the prosody parameters to obtain a successive prosody thereof.
The above aspects and advantages of the present invention will become more readily apparent to those ordinarily skilled in the art after reviewing the following detailed description and accompanying drawings, in which:
The present invention will be described more specifically with reference to the following embodiments. It is to be noted that the following descriptions of preferred embodiments of this invention are presented herein for the purposes of illustration and description only; it is not intended to be exhaustive or to be limited to the precise form disclosed.
Please refer to
The components of the text-to-speech system and the functions thereof are described below. The text processor 11 receives a text string, which includes a text data of at least a first language and a second language. The text processor 11 divides a first text data and a second text data from the text string according to different languages, and the first text data and the second text data contain acoustic data and semantic segments. The database of acoustic units 12 includes a plurality of acoustic units, which are commonly used by the first language and the second language. Preferably, the database of acoustic units 12 is recorded from the same speaker.
The first speech synthesis unit 131 and the second speech synthesis unit 132 automatically acquire the acoustic units defined in the first language and the second languages through the algorithm. When the acoustic units defined in the first language and the second language are the commonly used acoustic units in the database, the first and second speech synthesis units then synthesize the speech with the commonly used acoustic units, and generate a first speech data corresponding to the first text data and a second speech data corresponding to the second text data respectively.
The prosody processor 14 receives the first and second speech data and optimizes the prosodies thereof. The prosody processor 14 includes a reference prosody, and the prosody processor 14 determines a first prosody parameter and a second prosody parameter for the first and second speech data respectively according to the reference prosody. The first and second prosody parameters represent tones, volumes, speeds and durations for the first and second speech data respectively. Then, the prosody processor 14 connects the first speech data with the second speech data in a hierarchical manner according to the first and second prosody parameters to obtain a successive prosody thereof. Thus, a fluent synthetic speech is output.
mother”, the text processor 22 discriminates the text string to three text data, i.e. “father”,
and “mother” according to Chinese and English respectively. The text data contain acoustic data and are further divided into “fa”, “th”, “er”,
, “mo”, “th”, and “er”. Since the acoustic units of “fa” and “mo” are commonly used by Chinese and English in the database, the English speech synthesis unit 231 will acquire the defined acoustic units through an algorithm automatically after receiving the text data of “father” and “mother”. The acoustic units of “fa” and “mo” are acquired directly from the database 21, and the acoustic units of “th” and “er” are picked up from the database of English speech synthesis unit 231. Therefore, the English speech of the word “father” and “mother” are generated.
The Chinese speech synthesis unit 232 receives the text data of and also tries to acquire the acoustic unit through the algorithm. However, the acoustic unit of
is not built in the database; it is generated from the database of the Chinese speech synthesis unit 232. Therefore, the Chinese speech of
is synthesized.
Then, the synthetic Chinese and English are input into the prosody processor 24 for overall prosody processing. Please refer to mother” is converted by the text-to-speech system according to the present invention. The output speech is proceeded in English and Chinese alternatively. In order to perform the synthetic speech of different languages fluently, it is required to adjust tones (F0 base), volumes (Vol base), speeds (Speed base) and durations. The prosody processor of the present invention has a reference prosody as the basis for adjustment. Furthermore, the prosody parameters defines tones, volumes, speeds and durations of each speech data. Therefore, the prosody processor of the present invention connects different languages in a hierarchical manner according to the reference prosodies and prosody parameters to obtain a successive prosody. For example, in this preferred embodiment, the text string “father
mother” includes a main language, i.e. English and a minor language, i.e. Chinese. The prosody parameters “(F0b, Volb) and (F0e, Vole)” of the minor language
is determined according to the reference prosody. After that, the prosody parameters of the main language is determined. Then, the prosody processor further adjusts the prosody parameters of the main language “father” and “mother” to “(F01, Vol1) . . . (F0n, Voln)” and “(F01, Vol1) . . . (F0m, Volm)” respectively according to the prosody parameters of the minor language in order to obtain a successive prosody thereof.
Please refer to
is input into the text processor 51, and the text string is divided to text data “tomorrow” and
according to English and Chinese respectively. The text data
is translated to English text data “will it rain?” by a translation module 52. Then the speech synthesis unit 53 receives text data “tomorrow” and “will it rain?” and converts the text data into a speech data. The speech synthesis unit further includes an analyzing module, which rearranges the received text data “tomorrow” and “will it rain?” to obtain the speech data “Will it rain tomorrow?” with a correct grammar and meaning according to the English grammar and meanings. The prosody processor 54 is used for optimizing the prosodies of the speech data. The prosody processor 54 further contains a reference prosody and determines a prosody parameter of the speech data according to the reference prosody. The prosody parameters defines tones, volumes, speeds and durations of the speech. Therefore, the prosody processor 54 can adjust the speech data according to the prosody parameters to obtain a successive prosody thereof.
The above-mentioned embodiments are illustrated in the combination of Chinese and English speech. However, the text-to-speech system and method according to the present invention can be applied to other combinations of different languages.
According to the present invention, the text-to-speech system and method can convert a text string, which is a combination of several languages, into a native and fluent multi-language synthetic speech through a database of acoustic units and prosody processing. Besides, the text-to-speech system and method according to the present invention further includes a translation module for translating a text string, which is a combination of several languages, to a native and fluent multi-language synthetic speech through the translation module and prosody processing. The text-to-speech system and method according to the present invention overcome the drawbacks of a faltering speech when a multi-language text-to-speech conversion is processed in the prior arts.
While the invention has been described in terms of what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention needs not be limited to the disclosed embodiment. On the contrary, it is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims which are to be accorded with the broadest interpretation so as to encompass all such modifications and similar structures.
| Number | Date | Country | Kind |
|---|---|---|---|
| 093138499 | Dec 2004 | TW | national |