While the quality of text-to-speech (TTS) synthesis has been greatly improved in the recent years, various telecommunication applications (e.g. information inquiry, reservation and ordering, and email reading) demand higher synthesis quality than current TTS systems can provide. In particular, with globalization and its accompanying mixing of languages, such applications can benefit from a multilingual TTS system in which one engine can synthesize multiple languages or even mixed-languages. Most conventional TTS systems can only deal with a single language where sentences of voice databases are pronounced by a single native speaker. Although multilingual text can be correctly read by switching voices or engines at each language change, it is not practically feasible for code-switched text in which the language changes occur within a sentence as words or phrases. Furthermore, with the widespread use of mobile phones or embedded devices, the footprint of a speech synthesizer becomes a factor for applications based on such devices.
Studies of multilingual TTS systems indicate that phonetic coverage can be achieved by collecting multilingual speech data, but language-specific information (e.g. specialized text analysis) is also required. A global phone set, which uses the smallest phone inventory to cover all phones of the languages affected, has been tried in multilingual or language-independent speech recognition and synthesis. Such an approach adopts phone sharing with the phonetic similarity measured by data-driven clustering methods or phonetic-articulatory features defined by the International Phonetic Alphabet (IPA). Intense interest exists as to small footprint aspects of TTS systems, noting that Hidden Markov Model-based speech synthesis tends to be more promising. Some Hidden Markov Model (HMM) synthesizers can have a relatively small footprint (e.g., ≦2 MB), which lends itself to embedded systems. In particular, such HMM synthesizers have been successfully applied to speech synthesis of many monolinguals, e.g. English, Japanese and Mandarin. Such an HMM approach has been applied for multilingual purposes where an average voice is first trained by using mixed speech from several speakers in different languages and then the average voice is adapted to a specific speaker. Consequently, the specific speaker is able to speak all the languages contained in the training data.
Through globalization, English words or phrases embedded in Mandarin utterances are becoming more popularly used among students and educated people in China. However, Mandarin and English belong to different language families; these languages are highly unrelated in that seldom phones can be shared together based on examination of their IPA symbols.
A bilingual (Mandarin-English) TTS is conventionally built based on pre-recorded Mandarin and English sentences uttered by a bilingual speaker where a unit selection module of the system is shared across the two languages, while phones from the two different languages are not shared with each other. Such an approach has certain shortcomings. The footprint of such a system is large, i.e., about twice the size of a single language system. In practice, it is also not easy to find a sufficient number professional bilingual speakers to build multiple bilingual voice fonts for various applications.
Various exemplary techniques discussed herein pertain to multilingual TTS systems. Such techniques can reduce a TTS system's footprint compared to existing techniques that require a separate TTS system for each language.
An exemplary method for generating speech based on text in one or more languages includes providing a phone set for two or more languages, training multilingual HMMs where the HMMs include state level sharing across languages, receiving text in one or more of the languages of the multilingual HMMs and generating speech, for the received text, based at least in part on the multilingual HMMs. Other exemplary techniques include mapping between a decision tree for a first language and a decision tree for a second language, and optionally vice versa, and Kullback-Leibler divergence analysis for a multilingual text-to-speech system.
Non-limiting and non-exhaustive embodiments are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified.
Techniques are described herein for use in multilingual TTS systems. Such techniques may be applied to any of a variety of TTS approaches that use probabilistic models. While various examples are described with respect to HMM-based approaches for English and Mandarin, exemplary techniques may apply broadly to other languages and TTS systems for more than two languages.
Several exemplary approaches for sound sharing are described herein. An approach that uses an IPA-based examination of phones is suitable for finding some phones from English and Mandarin are sharable. Another exemplary approach demonstrates that sound similarities exist at the level of sub-phonemic productions, which can be sharable as well. Additionally, complex phonemes may be rendered by two or three simple phonemes and numerous allophones, which are used in specific phonetic contexts, provide more chances for phone sharing between Mandarin and English.
Various exemplary techniques are discussed with respect to context-independence and context-dependence. A particular exemplary technique includes context-dependent HMM state sharing in bilingual (Mandarin-English) TTS system. Another particular exemplary technique includes state level mapping for new language synthesis without having to rely on speech for a particular speaker in the new language. More specifically, a speaker's speech sounds in another language mapped to sounds in the new language to generate speech in the new language. Hence, such a method can generate speech for a speaker in a new language without requiring recorded speech of the speaker in the new language. Such a technique synthetically extends the language speaking capabilities of a user.
An exemplary approach is based on a framework of HMM-based speech synthesis. In this framework, spectral envelopes, fundamental frequencies, and state durations are modeled simultaneously by corresponding HMMs. For a given text sequence, speech parameter trajectories and corresponding signals are then generated from trained HMMs in the Maximum Likelihood (ML) sense.
Various exemplary techniques can be used to build an HMM-based bilingual (Mandarin-English) TTS system. A particular exemplary technique includes use of language-specific and language-independent questions designed for clustering states across two languages in one single decision tree. Trial results demonstrate that an exemplary TTS system with context-dependent HMM state sharing across languages outperforms a simple baseline system where two separate language-dependent HMMs are used together. Another exemplary technique includes state mapping across languages based upon the Kullback-Leibler divergence (KLD) to synthesize Mandarin speech using model parameters in an English decision tree. Trial results demonstrate that synthesized Mandarin speech via such an approach is highly intelligible.
An exemplary technique can enhance learning by allowing a student to generate foreign language speech using the student's native language speech sounds. Such a technique uses a mapping, for example, established using a talented bilingual speaker. According to such a technique, the student may more readily comprehend the foreign language when it is synthesized using the student's own speech sounds, albeit from the speakers native language. Such a technique optionally includes supplementation of the foreign language, for example, as the student becomes more proficient, the student may provide speech in the foreign language.
The STT method 110 receives energy (e.g., analog to digital conversion to a digital waveform) or a recorded version of energy (e.g., digital waveform file), parameterizes the energy waveform 112 and recognizes text corresponding to the energy waveform 114. The TTS method 120 receives text, performs a text analysis 122, a prosody analysis 124 and then generates an energy waveform 126.
As already mentioned, exemplary techniques described herein pertain primarily to TTS methods and systems and, more specifically, to multilingual TTS methods and systems.
The English method and system 202 and the Mandarin method and system 204 are described simultaneously as the various steps and components are quite similar. The English method and system 202 receive English text 203 and the Mandarin method and system 204 receive Mandarin text 205. TTS method 220 and 240 perform text analysis 222, 242, prosody analysis 224, 244 and waveform generation 226, 246 to produce waveforms 207, 208. Of course, for example, specifics of text analyses differ from English and Mandarin.
The English TTS system 230 includes English phones 232 and English HMMs 234 to generate waveform 207 while the Mandarin TTS system 250 includes Mandarin phones 252 and Mandarin HMMs 254 to generate waveform 208.
As described herein, an exemplary method and system allows for multilingual TTS.
As for building a bilingual, Mandarin and English, TTS system such as the system 330 of
One exemplary approach examines IPA symbols for phones of a first language and phones of a second language for purposes of phone sharing. IPA is an international standard for use in transcribing speech sounds of any spoken language. It classifies phonemes according to their phonetic-articulatory features. IPA fairly accurately represents phonemes and it is often used by classical singers to assist in singing songs in any of a variety of languages. Phonemes of different languages labeled by the same IPA symbol should be considered as the same phoneme when ignoring language-dependent aspects of speech perception.
The exemplary IPA approach and an exemplary Kullback-Leibler divergence (KLD) approach are explained with respect to
In the exemplary IPA approach, which examines IPA symbols, eight consonants, /k/, /p□/, /t□/, /f/, /s/, /m/, /n/ and /l/, and two vowels (ignoring the tone information), /□/ and /a/, can be shared between the two languages. Thus, the IPA approach can determine a shared phone set.
In the exemplary KLD-based approach, a determination block 408 performs a KLD-based analysis to by checking EP 410 and MP 420 for sharable phones (SP) 430. The KLD technique provides an information-theoretic measure of (dis)similarity between two probability distributions. When the temporal structure of language HMMs is aligned by dynamic programming, KLD can be further modified to measure the difference between HMMs of two evolving speech sounds.
where μ and Σ are the corresponding mean vectors and covariance matrices, respectively. According to the KLD technique 440, each EP and each MP in block 404 is acoustically represented by a context-independent HMM with 5 emitting states (States 1-5 in
According to the KLD technique 440, the spectral feature 442 used for measuring the KLD between any two given HMMs is the first 24 LSPs out of the 40-order LSP 416 and the first 24 LSPs out of the 40-order LSP 426. The first 24 are chosen because, in general, the most perceptually discriminating spectral information is located in the lower frequency range.
In the KLD example of
While the KLD-based technique of
Mandarin is a tonal language of the Sino-Tibetan family, while English is a stress-timed language of the Indo-European family; hence, the analysis results shown in
From another perspective, many complex phonemes can be well rendered by two or three phonemes (e.g. an English diphthong may be similar to a Mandarin vowel pair). An exemplary method can find sharing of sounds by comparing multiple phone groups of one language to sounds in another language, which may be multiple phone groups as well (see, e.g., the method 700 of
Moreover, as described herein, allophones (e.g., the Initial ‘w’/u/ in Mandarin corresponds to [u] in syllable ‘wo’ and [v] in syllable ‘wei’) provide more chances for phone sharing between Mandarin and English under certain contexts. Therefore, an exemplary method can use context-dependent HMM state level sharing for a bilingual (Mandarin-English) TTS system (see, e.g., the method 800 of
Yet another approach described herein includes state level mapping for new language synthesis without recording data (see, e.g., the method 900 of
In the example of
i) Language-independent questions: e.g. Velar_Plosive, “Does the state belong to velar plosive phones, which contain // (Eng.), /k□/ (Eng.), /k/ (Man.) or /k□/ (Man.)?”
ii) Language-specific questions: e.g. E_Voiced_Stop, “Does the state belong to English voiced stop phones, which contain /b/, /d/ and /□/?”
According to manner and place of articulations, supra-segmental features, etc., questions are constructed so as to tie states of English and Mandarin phone models together.
In the example of
According to the technique 900, a build block 914 builds two language-specific decision trees by using bilingual data recorded by one speaker. Per mapping block 918, each leaf node in the Mandarin decision tree (MT) 920 has a mapped leaf node, in the minimum KLD sense, in the English decision tree (ET) 910. Per mapping block 922, each leaf node in the English decision tree (ET) 910 has a mapped leaf node, in the minimum KLD sense, in the Mandarin decision tree (MT) 920. In the tree diagram, tied, context-dependent state mapping (from Mandarin to English) is shown (MT 920 to ET 910). The directional mapping from Mandarin to English can have more than one leaf nodes in the Mandarin tree mapped to one leaf node in the English tree. As shown in the diagram, two nodes in the Mandarin tree 920 are mapped into one node in the English tree 910 (see dashed circles). The mapping from English to Mandarin is similarly done but in a reverse direction, for example, for every English leaf node, the technique finds its nearest neighbor, in the minimum KLD sense, among all leaf nodes in the Mandarin tree. A particular map node-to-node link may be unidirectional or bidirectional.
With respect to speech synthesis,
A determination block 1024 determines upper bound of KLD between two MSD-HMMs according to the equation of
To synthesize speech in a new language without pre-recorded data from the same voice talent, the mapping established with bilingual data and new monolingual data recorded by a different speaker can be used. For example, a context-dependent state mapping trained from speech data of a bilingual (English-Mandarin) speaker “A” can be used to choose the appropriate states trained from speech data of a different, monolingual Mandarin speaker “B” to synthesize English sentences. In this example, the same structure of decision trees should be used for Mandarin training data from speakers A and B.
System 1100 is a direct combination of HMMs (Baseline). Specifically, the system 1100 is a baseline system, where language-specific, Mandarin and English HMMs and decision trees are trained separately 1104, 1108. In the synthesis part, input text is converted first into a sequence of contextual phone labels through a bilingual TTS text-analysis frontend 1112 (Microsoft® Mulan software marketed by Microsoft Corporation, Redmond, Wash.). The corresponding parameters of contextual states in HMMs are retrieved via language-specific decision trees 1116. Then LSP, gain and F0 trajectories are generated in the maximum likelihood sense 1120. Finally, speech waveforms are synthesized from the generated parameter trajectories 1124. In synthesizing a mixed-language sentence, depending upon the text segments to be synthesized is Mandarin or English, appropriate language-specific HMMs are chosen to synthesize corresponding parts of the sentence.
System 1200 includes state sharing across languages. In the system 1200, both 1,000 Mandarin sentences and 1,024 English sentences were used together for training HMMs 1204 and context-dependent state sharing across languages as discussed above was applied. Per a text analysis block 1208, since there are no mixed-language sentences in the training data, the context of phones at a language switching boundary (e.g. the left phone or the right phone), is replaced with the nearest context in the language which the central phone belongs to in the text analysis module. For example, the triphone //(E)−//(C)+/□/(C) will be replaced with /o□/(C)−/□/(C)+/□/(C), where the left context /o□/(C) is the nearest Mandarin substitute for /□/(E) according to the KLD measure. In a synthesis block 1212, decision trees of mixed-languages are used instead of the language-specific ones as in block 1124 of the system 1100.
System 1300 includes state mapping across languages. In this system, training of Mandarin HMMs 1304 and English HMMs 1308 occurs followed by building two language-specific decision trees 1312 (see, e.g., ET 910 and MT 920 of
Synthesis quality is measured objectively in terms of distortions between original speech and speech synthesized by the system 1100 and the system 1200. Since the predicted HMM state durations of generated utterances are in general not the same as those of original speech, the trials measured the root mean squared error (RMSE) of phone durations of synthesized speech. Spectra and pitch distortions were then measured between original speech and synthesized speech where the state durations of the original speech (obtained by forced alignment) were used for speech generation. In this way, both spectrum and pitch are compared on a frame-synchronous basis between the original and synthesized utterances.
Table 1410 shows the averaged log spectrum distance, RMSE of F0 and phone durations evaluated in 100 test sentences (50 Mandarin and 50 English) generated by the system 1100 and the system 1200. The data indicate that the distortion difference between the system 1100 and the system 1200 in terms of log spectrum distance, RMSEs of F0 and duration are negligibly small.
The plot 1420 provides results of a subjective evaluation. Informal listening to the monolingual sentences synthesized by the system 1100 and the system 1200 confirms the objective measures shown in the table 1410: i.e. there is hardly any difference, subjective or objective, in 100 sentences (50 Mandarin, 50 English) synthesized by the systems 1100 and 1200.
Specifically, the results of the plot 1420 are from the 50 mixed-language sentences generated by the two systems 1100 and 1200 as evaluated subjectively in an AB preference test by nine subjects. The preference score of the system 1200 (60.2%) is significantly higher than that of the system 1100 (39.8%) (α=0.001, CI=[0.1085, 0.3004]). The main perceptually noticeable difference in the paired sentences synthesized by the systems 1100 and 1200 is at the transitions between English and Chinese words in the mixed-language sentences. State sharing through tied states across Mandarin and English in the system 1200 helps to alleviate the problem of segmental and supra-segmental discontinuities between Mandarin and English transitions. Since all training sentences are either exclusively Chinese or English, there is no specific training data to train such language-switching phenomena. As a result, the system 1100, without any state sharing across English and Mandarin, is more prone to the synthesis artifacts at the switches of English and Chinese words.
Overall, results from the trials indicate that system 1200, which is obtained via efficient state tying across different languages and with a significantly smaller HMM model size than the system 1100, can produce the same synthesis quality for non-mixed language sentences and better synthesis quality for mixed-language ones.
With respect to the system 1300, fifty Mandarin test sentences were synthesized by English HMMs. Five subjects were asked to transcribe the 50 synthesized sentences to evaluate their intelligibility. A Chinese character accuracy of 93.9% is obtained.
An example of F0 trajectories predicted by the system 1100 (dotted line) and the system 1300 (solid line) are shown in plot 1430 of
As described herein, various exemplary techniques are used to build exemplary HMM-based bilingual (Mandarin-English) TTS systems. The trial results show that the exemplary TTS system 1200 with context-dependent HMM state sharing across languages outperforms the simple baseline system 1100 where two language-dependent HMMs are used together. In addition, state mapping across languages based upon the Kullback-Leibler divergence can be used to synthesize Mandarin speech using model parameters in an English decision tree and the trial results show that the synthesized Mandarin speech is highly intelligible.
According to the technique 1370, a provision block 1374 provides the voice of a talented speaker that is fluent in language 1 and language 2 where language 1 is understood (e.g., native) by the ordinary speaker and where language 2 is not fully understood (e.g., foreign) by the ordinary speaker. A map block 1378 maps leaf nodes for language 1 to “nearest neighbor” leaf nodes for language 2 for the voice of the talented speaker. As the talented speaker can provide “native” sounds in both languages, the mapping can more accurately map similarities between sounds used in language 1 and sounds used in language 2.
The technique 1370 continues in provision block 1382 where the voice of the ordinary speaker in language 1 is provided. An association block 1386 associates the provided voice sounds of the ordinary speaker with the appropriate leaf nodes for language 1. As a map already exists, as established using the talented speaker's voice, between language 1 sounds and language 2 sounds, an exemplary system can now generate at least some language 2 speech using the ordinary speaker's sounds from language 1.
For purposes of TTS, a provision block 1390 provides text in language 2, which is, for example, the language “foreign” to the ordinary speaker, and a generation block 1394 generates speech in language 2 using the map and the voice (e.g., speech sounds) of the ordinary speaker in language 1. Thus, the technique 1370 extends the speech abilities of the ordinary speaker to language 2.
In the example of
In the example of
In block 1478, the student trains an exemplary TTS system in the student's native language where the TTS system maps the student's speech sounds to the foreign language. To more fully comprehend the speech of the teacher and hence the foreign language, per block 1482, the student enters text for the uttered phrase (e.g., “the grass is green”). In a generation block 1486, the TTS system generates the foreign language speech using the student's speech sounds, which are more familiar to the student's ear. Consequently, the student more readily comprehends the teacher's utterance. Further, the TTS system may display or otherwise output a listing of sounds (e.g., phonetically or as words, etc.) such that the student can more readily pronounce the phrase of interest (i.e., per the entered text of block 1482). The technique 1470 can provide a student with feedback in a manner that can enhance learning of a language.
In the exemplary techniques 1370 and 1470, sounds may be phones, sub-phones, etc. As already explained, at the sub-phone level mapping may occur more readily or accurately, depending on the similarity criterion (or criteria) used. An exemplary technique may use a combination of sounds. For example, phones, sub-phones, complex phones, phone pairs, etc., may be used to increase mapping and more broadly cover the range of sounds for a language or languages.
An exemplary method for generating speech based on text in one or more languages, implemented at least in part by a computer, includes providing a phone set for two or more languages, training multilingual HMMs where the HMMs includes state level sharing across languages, receiving text in one or more of the languages of the multilingual HMMs and generating speech, for the received text, based at least in part on the multilingual HMMs. Such a method optionally includes context-dependent states. Such a method optionally includes clustering states into a decision tree, for example, where the clustering may use of a language independent question and/or a language specific question.
An exemplary method for generating speech based on text in one or more languages, implemented at least in part by a computer, includes building a first language specific decision tree, building a second language specific decision tree, mapping a leaf node form the first tree to a leaf node of the second tree, mapping a leaf node from the second tree to a leaf node of the first tree, receiving text in one or more of the languages of the first language and the second language and generating speech, for the received text, based at least in part on the mapping a leaf node form the first tree to a leaf node of the second tree and/or the mapping a leaf node from the second tree to a leaf node of the first tree. Such a method optionally uses a KLD technique for mapping. Such a method optionally includes multiple leaf nodes of one decision tree that map to a single leaf node of another decision tree. Such a method optionally generates speech occurs without using recording data. Such a method may use unidirectional mapping where, for example, mapping only exists from language 1 to language 2 or only exists from language 2 to language 1.
An exemplary method for reducing memory size of a multilingual TTS system, implemented at least in part by a computer, includes providing a HMM for a sound in a first language, providing a HMM for a sound in a second language, determining line spectral pairs for the sound in the first language, determining line spectral pairs for the sound in the second language, calculating a KLD score based on the line spectral pairs for the for the sound in the first language and the sound in the second language where the KLD score indicates similarity/dissimilarity between the sound in the first language and the sound in the second language and building a multilingual HMM-based TTS system where the TTS system comprises shared sounds based on KLD scores. In such a method, the sound in the first language may be a phone, a sub-phone, a complex phone, a phone multiple, etc., and the sound in the second language may be a phone, a sub-phone, a complex phone, a phone multiple, etc. In such a method, a sound may be a context-dependent sound.
The computing device shown in
With reference to
The operating system 1505 may include a component-based framework 1520 that supports components (including properties and events), objects, inheritance, polymorphism, reflection, and provides an object-oriented component-based application programming interface (API), such as that of the .NET™ Framework manufactured by Microsoft Corporation, Redmond, Wash.
Computing device 1500 may have additional features or functionality. For example, computing device 1500 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
Computing device 1500 may also contain communication connections 1516 that allow the device to communicate with other computing devices 1518, such as over a network. Communication connection(s) 1516 is one example of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Various modules and techniques may be described herein in the general context of computer-executable instructions, such as program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. for performing particular tasks or implement particular abstract data types. These program modules and the like may be executed as native code or may be downloaded and executed, such as in a virtual machine or other just-in-time compilation execution environment. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
An implementation of these modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can be any available media that can be accessed by a computer. By way of example, and not limitation, computer readable media may comprise “computer storage media” and “communications media.”
An exemplary computing device may include a processor, a user input mechanism (e.g., a mouse, a stylus, a scroll pad, etc.), a speaker, a display and control logic implemented at least in part by the processor to implement one or more of the various exemplary methods described herein for TTS. For TTS, such a device may be a cellular telephone or generally a handheld computer.
One skilled in the relevant art may recognize, however, that the techniques described herein may be practiced without one or more of the specific details, or with other methods, resources, materials, etc. In other instances, well known structures, resources, or operations have not been shown or described in detail merely to avoid obscuring aspects of various exemplary techniques.
While various examples and applications have been illustrated and described, it is to be understood that the techniques are not limited to the precise configuration and resources described above. Various modifications, changes, and variations apparent to those skilled in the art may be made in the arrangement, operation, and details of the methods, systems, etc., disclosed herein without departing from their practical scope.