The present invention relates to speech synthesis. In particular, the present invention relates to a multi-lingual speech synthesis system.
Text-to-speech systems have been developed to allow computerized systems to communicate with users through synthesized speech. Some applications include spoken dialog systems, call center services, voice-enabled web and e-mail services, to name a few. Although text-to-speech systems have improved over the past few years, some shortcomings still exist. For instance, many text-to-speech systems are designed for only a single language. However, there are many applications that need a system that can provide speech synthesis of words from multiple languages, and in particular, speech synthesis where words from two or more languages are contained in the same sentence.
Systems, that have been developed to provide speech synthesis for utterances having words from multiple languages, use separate text-to-speech engines to synthesize words from each respective language of the utterance, each engine generating waveforms for the synthesized words. The waveforms are then joined or otherwise outputted successively in order to synthesize the complete utterance. The main drawback of this approach is that voices coming out of the two engines usually sound different. Users are commonly annoyed when hearing such voice utterances, because it appears that two different speakers are speaking. In addition, overall sentence intonation is destroyed, which impairs comprehension.
Accordingly, a system for multi-lingual speech synthesis that addresses at least some of the foregoing disadvantages would be beneficial and improve multi-lingual speech synthesis.
A text processing system for a speech synthesis system receives input text comprising a mixture of at least two languages and provides an output that is suitable for use by a back-end portion of a speech synthesizer. Generally, the text processing system includes language-independent modules and language-dependent modules that perform text processing. This architecture has the advantage of smooth switching between languages and maintaining fluent intonation for mixed-lingual sentences.
Before describing aspects of the present invention, it may be helpful to first describe exemplary computer environments for the invention.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. Tasks performed by the programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures herein as processor executable instructions, which can be written on any form of a computer readable media.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 100.
Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, FR, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 190.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Memory 204 is implemented as non-volatile electronic memory such as random access memory (RAM) with a battery back-up module (not shown) such that information stored in memory 204 is not lost when the general power to mobile device 200 is shut down. A portion of memory 204 is preferably allocated as addressable memory for program execution, while another portion of memory 204 is preferably used for storage, such as to simulate storage on a disk drive.
Memory 204 includes an operating system 212, application programs 214 as well as an object store 216. During operation, operating system 212 is preferably executed by processor 202 from memory 204. Operating system 212, in one preferred embodiment, is a WINDOWS® CE brand operating system commercially available from Microsoft Corporation. Operating system 212 is preferably designed for mobile devices, and implements database features that can be utilized by applications 214 through a set of exposed application programming interfaces and methods. The objects in object store 216 are maintained by applications 214 and operating system 212, at least partially in response to calls to the exposed application programming interfaces and methods.
Communication interface 208 represents numerous devices and technologies that allow mobile device 200 to send and receive information. The devices include wired and wireless modems, satellite receivers and broadcast tuners to name a few. Mobile device 200 can also be directly connected to a computer to exchange data therewith. In such cases, communication interface 208 can be an infrared transceiver or a serial or parallel communication connection, all of which are capable of transmitting streaming information.
Input/output components 206 include a variety of input devices such as a touch-sensitive screen, buttons, rollers, and a microphone as well as a variety of output devices including an audio generator, a vibrating device, and a display. The devices listed above are by way of example and need not all be present on mobile device 200. In addition, other input/output devices may be attached to or found with mobile device 200 within the scope of the present invention.
To further help understand the present invention, it may be helpful to provide a brief description of current speech synthesizers or engines 300 and 302, which are illustrated in
Speech synthesizer 302 also includes the text and prosody analysis module 303 that receives the input text 306 and provides a symbolic description of prosody at output 308. However, as illustrated, front-end portion 304 also includes a prosody prediction module 320 that receives the symbolic description of prosody 308 and provides a numerical description of prosody at output 322. As is known, prosody prediction module 320 takes some high-level prosodic constraints, such as part-of-speech, phrasing, accent and emphasizes, etc., as input and makes predictions on pitch, duration, energy, etc., generating deterministic values for them that comprise output 322. Output 322 is provided to back-end portion 312, which in this form comprises a speech generation module 326 that generates the synthesized speech waveform 314, which has prosody features matching the numerical description of prosody input 322. This can be achieved by setting corresponding parameters in a formant based or LPC based back-end or by applying prosody scaling algorithms such as PSOLA or HNM in a concatenative back-end.
Upon normalization, a morphological analysis module 342 can be used to perform morphological analysis to ascertain plurals, past tense, etc. in the input text. Syntactic/semantic analysis can then be performed by module 344 to identify parts of speech (POS) of the words or to predict syntactic/semantic structure of sentences, if necessary. Further processing can then be performed if desired by module 346 that groups the words into phrases according to the input from module 344 (i.e., the POS tagging or syntactic/semantic structure) or simply by commas, periods, etc. Semantic features including stress, accent, and/or focus are predicted by module 348. Grapheme-to-phoneme conversion module 350 converts the words to phonetic symbols corresponding to proper pronunciation. The output of 303 is the phonetic unit strings with symbolic description of prosody 308.
It should be emphasized that the modules forming text and prosody analysis portion 303 are merely illustrative and are included as necessary to generate the desired output from front-end portion 304 to be used by the back-end portion 312 illustrated in
For multi-lingual text, a speech engine 300 or 302 would be provided for each language of the text to be synthesized. Portions corresponding to each separate language in the text would be provided to the respective single-language speech synthesizer, and processed separately, wherein the outputs 314 would be joined or otherwise successively outputted using suitable hardware. As discussed in the background section, disadvantages include loss of overall sentence intonation and portions of a single sentence appearing to emanate from two or more different speakers.
In the illustrative embodiment, the text and prosody analysis portion 400 contains a language dispatch module that includes a language identifier module 406 and an integrator. The language identifier module 406 receives the input text 402 and includes or associates language identifiers (Ids) or tags to sentences and/or words denoting them appropriately for the language they are used in. In the example illustrated, Chinese characters and English characters use very distinctly different codes to form the input text 402, thus it is relatively easy to identify that part of the input text 402 corresponding to Chinese or corresponding to English. For languages such as French, German or Spanish where common characters may be present in each of the languages, further processing may be needed.
The input text having appropriate language identifiers is then provided to an integrator module 410. Generally, the integrator module 410 manages data flow between the language-independent and language-dependent modules and maintains a unified data flow to ensure appropriate processing upon receipt of the output from each of the modules. Typically, the integrator module 410 first passes the input text having language identifiers to a text-normalization module 412. In the embodiment illustrated, the text-normalization module 412 is a language independent rule interpreter. The module 412 includes two components. One is a pattern identifier, while the other is a pattern interpreter, which converts a matching pattern into a readable text string according to rules. Each rule has two parts, the first part is a definition of a pattern, while the other is the converting rule for the pattern. The definition part can either be shared by both languages or be specified to one of them. The converting rules are typically language specific. If a new language is added, the rule interpreting module does not need to be changed, only new rules for the new language need be added. As appreciated by those skilled in the art, the text-normalization module 412 could precede the language identifier module 410 if appropriate processing is provided in the text-normalization module 412 to identify each of the language words in the input text.
Upon receipt of the output from the text- normalization module 412, the integrator 410 forwards appropriate words and/or phrases for text and prosody analysis to the appropriate language-dependent module. In the illustrated example, a Chinese Mandarin module 420 and an English module 422 are provided. The Chinese module 420 and the English module 422 deal with all language specific processes such as phrasing and grapheme-to-phoneme conversion for both languages, word segmentation for Chinese and abbreviation expansion for English, to name a few. In
In addition to language identifiers, the segments of the input text 402 may include or have associated therewith identifiers denoting their position in the input text 402 such that upon receipt of the outputs from the various language-independent and language-dependent modules, the integrator 410 can reconstruct the proper order of the segments, since not all segments are processed by the same modules. This allows parallel processing and thus faster processing of the input text 402. Of course, processing of the input text 402 can be segment by segment in the order as found in the input text 402.
The outputs from the language-dependent modules are then processed by a unified feature extraction module 430 for prosody and phonetic context. In this manner, overall sentence intonation is not lost since the prosodic and phonetic context will be analyzed for the entire sentence after text and prosody analysis by modules 420 and 422 for Chinese and English segments as appropriate. In the illustrated embodiment, an output 432 of the text and prosody analysis portion 400 is a sequential unit list (including units in both English and Mandarin) with unified feature vectors that include prosodic and phonetic context. Unit concatenation can then be provided in the back-end portion such as illustrated in
In one embodiment, the back-end portion 312 can take the form as illustrated in
All instances for a base unit are clustered using a CART (Classification and Regression Tree) by querying about the prosodic constraints. The splitting criterion for CART is to maximize reduction in the weighted sum of the MSEs (Mean Squared Error) of the three features: the average f0, the dynamic range of f0, and the duration. The MSE of each feature is defined as the mean of the square distances from the feature values of all instances to the mean value of their host leaves. After the trees are grown, instances on the same leaf node have similar prosodic features. Two phonetic constraints, the left and right phonetic context and a smoothness cost are used to assure the continuity of the concatenation between the units. Concatenative cost is defined as the weighted sum of the source-target distances of the seven prosodic constraints, the two phonetic constraints and the smoothness cost. The distance table for each prosodic/phonetic constraint and the weights for all components are first assigned manually and then tuned automatically with the method presented in “Perpetually optimizing the cost function for unit selection in a TTS system for one single run of MOS evaluation”, Proc. of ICSLP'2002, Denver, by H. Peng, Y. Zhao and M. Chu. When synthesizing an utterance, prosodic constraints are first used to find a cluster of instances (a leaf node in the CART tree) for each unit, then, a Viterbi search is used to find the best instance for each unit that will generate the smallest overall concatenative cost. The selected segments are then concatenated one by one to form a synthetic utterance. Preferably, the corpus of units is obtained from a single bilingual speaker. Although the two languages adopt units of different size, they share the same unit selection algorithm and the same set of features for units. Therefore, the back-end portion of the speech synthesizer can process unit sequences in a single language or a mixture of the two languages. Selection of unit instances in accordance with that described above is described in greater detail in U.S. patent application Ser. No. 20020099547A1, entitled “Method and Apparatus for Speech Synthesis Without Prosody Modification” and published Jul. 25, 2002, the content of which is hereby incorporated by reference in its entirety.
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
4718094 | Bahl | Jan 1988 | A |
5146405 | Church | Sep 1992 | A |
5384893 | Hutchins | Jan 1995 | A |
5440481 | Kostoff | Aug 1995 | A |
5592585 | Van Coile et al. | Jan 1997 | A |
5727120 | Van Coile et al. | Mar 1998 | A |
5732395 | Alexander Silverman | Mar 1998 | A |
5839105 | Ostendorf et al. | Nov 1998 | A |
5857169 | Seide | Jan 1999 | A |
5890117 | Silverman | Mar 1999 | A |
5905972 | Huang et al. | May 1999 | A |
5912989 | Watanabe | Jun 1999 | A |
5933806 | Beyerlein | Aug 1999 | A |
5937422 | Nelson | Aug 1999 | A |
6064960 | Bellegarda et al. | May 2000 | A |
6076060 | Lin et al. | Jun 2000 | A |
6101470 | Eide et al. | Aug 2000 | A |
6141642 | Oh | Oct 2000 | A |
6151576 | Warnock et al. | Nov 2000 | A |
6172675 | Ahmad | Jan 2001 | B1 |
6185533 | Holm et al. | Feb 2001 | B1 |
6230131 | Kuhn et al. | May 2001 | B1 |
6401060 | Critchlow et al. | Jun 2002 | B1 |
6499014 | Chihara | Dec 2002 | B1 |
6505158 | Conkie | Jan 2003 | B1 |
6665641 | Coorman et al. | Dec 2003 | B1 |
6708152 | Kivimaki | Mar 2004 | B2 |
6751592 | Shiga | Jun 2004 | B1 |
6829578 | Huang et al. | Dec 2004 | B1 |
6978239 | Chu et al. | Dec 2005 | B2 |
7010489 | Lewis et al. | Mar 2006 | B1 |
20020072908 | Case et al. | Jun 2002 | A1 |
20020103648 | Case et al. | Aug 2002 | A1 |
20020152073 | DeMoortel et al. | Oct 2002 | A1 |
20030208355 | Stylianou et al. | Nov 2003 | A1 |
Number | Date | Country |
---|---|---|
0 984 426 | Mar 2000 | EP |
1213705 | Jun 2002 | EP |
Number | Date | Country | |
---|---|---|---|
20040193398 A1 | Sep 2004 | US |