The present invention deals with speech properties. More specifically, the present invention deals with unit inventories in text-to-speech systems.
Speech signal generators or synthesizers in a text-to-speech (TTS) system can be classified into three distinct categories: articulatory synthesizers; formant synthesizers; and concatenative synthesizers. Articulatory synthesizers are based on the physics of sound generation in the vocal apparatus. Individual parameters related to the position and movement of vocal chords are provided. The sound generated therefrom is determined according to physics. In view of the complexity of the physics, practical applications of this type of synthesizer are considered to be far off.
Formant synthesizers do not use equations of physics to generate speech, but rather, model acoustic features or the spectra of the speech signal, and use a set of rules to generate speech. In a formant synthesizer, a phoneme is modeled with formants wherein each formant has a distinct frequency “trajectory” and a distinct bandwidth which varies over the duration of the phoneme. An audio signal is synthesized by using the frequency and bandwidth trajectories to control a formant synthesizer. While the formant synthesizer can achieve high intelligibility, its “naturalness” is typically low, since it is very difficult to accurately describe the process of speech generation in a set of rules. In some systems, in order to mimic natural speech, the synthetic pronunciation of each phoneme is determined by a set of rules which analyzes the phonetic context of the phoneme. U.S. Pat. No. 4,979,216 issued to Malsheen et al. describes a text-to-speech synthesis system and method using context dependent vowel allophones.
Concatenation systems and methods for generating text-to-speech operate under an entirely different principle. Concatenative synthesis uses pre-recorded actual speech forming a large database or corpus. The corpus is segmented based on phonological features of a language. Commonly, the phonological features include transitions from one phoneme to at least one other phoneme. For instance, the phonemes can be segmented into diphone units, syllables or even words. Diphone concatenation systems are particularly prominent. A diphone is an acoustic unit which extends from the middle of one phoneme to the middle of the next phoneme. In other words, the diphone includes the transition between each partial phoneme. It is believed that synthesis using concatenation of diphones provides good voice quality since each diphone is concatenated with adjoining diphones where the beginning and the ending phonemes have reached steady state, and since each diphone records the actual transition from phoneme to phoneme.
In a concatenative Text-to-speech (TTS) system, speech output is generated by concatenating small pre-stored speech segments one by one. Most state-of-the-art TTS systems adopt corpus-driven approaches, called unit selection, due to their capability to generate highly natural speech. In these systems, a set of “atom units”, that is the smallest constituents in the concatenation procedure that could not be segmented further are defined. Typically there are many instances with phonetic and prosodic variations for the units that are kept in a very large unit inventory, and a unit selection algorithm is used to select the most suitable unit sequence by minimizing a cost function.
Defining a suitable set of atom units is very important for such systems. There is always a balance between two conflicting requirements for the unit inventory. On the one hand, in order to get natural prosody, smaller units are preferred so that a pre-recorded unit inventory could cover as many prosodic variations of each unit as possible. On the other hand, in order to make concatenated utterances smooth, larger units are preferred because they reduce the likelihood of an unsmooth concatenation in the synthesized utterances. Strategies for defining the atom unit differ among languages due to the different phonological characteristics of languages. For languages that have a relatively small syllable set, such as Chinese, which contains less than 2000 syllables, syllables are often used as the atom units. However, using syllables as atom units becomes somewhat impractical for languages that have too many syllables to enumerate effectively. For example, English contains more than 20,000 possible syllables. This makes it difficult to generate a closed list of syllables for English. In such a language, smaller atom units such as the phoneme, diphone or the mixture of the two is often adopted. However, using such small units has many shortcomings.
Using smaller units means more units per utterance and more instances per unit. That is a much larger search space for unit selection and more search time is required during speech generation.
Smaller units also cause more difficulties in precise unit segmentation. This is crucial for speech quality of synthesized speech. For example, in English, the word ‘yes’ consists of three phones, /j/, /e/ and /s/, where the boundary between /e/ and /s/ can be labeled easily, yet it is difficult to separate /j/ from /e/ due to the flat transition between their formant tracks. Moreover, experimentation shows that if the co-articulation between two phones is strong, it is difficult to smoothly concatenate two segments selected from different locations during the synthesis phase.
Therefore, it has been desired for a method to define a set of atom units having a size between phone and syllable to increase the overall efficiency of the text to speech system in large syllable languages such as English
One embodiment of the present invention is directed towards a method for defining a set of atom units for use in the unit inventory of a text-to-speech synthesizer.
A spoken text along with a phonetic transcription of the text is received. Then a list of monophones for the target language is obtained. These monophones form the basis of the unit inventory for the language and the speaker. Next the method identifies a set of common multiphones for the language. These common multiphones form the atom units for the language and are sized between a phone and a syllable. These common multiphones are then added to the unit inventory for the target language. The atom units are of varying sizes, and are not merely diphones, triphones, or quinphones as used in previous systems.
In determining the common multiphones to add to the unit inventory, the present invention uses an expanded nucleus slice for each syllable in the lexicon. The expanded nucleus slice is between a phone and a full syllable. In one embodiment the common multiphones that are selected are those multiphones, whose frequency of occurrence in the training data exceeds a threshold value. The common multiphones are then added to the unit inventory.
The remaining multiphones are considered non-common. The non-common multiphones are decomposed according to a set of rules until a sequence that is composed of one of the common multiphones and several monophones at its margin, or a list of monophones is identified. If the non-common multiphone cannot be decomposed to match either a sequence that is composed of one of the common multiphones and several monophones at its margin, or a list of monophones, it is added to the unit inventory. If the decomposed slice is matched with an entry in the unit inventory, the process of decomposing is stopped.
During the process of decomposition, any phones that are removed from the slice are added to the adjoining slice. The newly formed slices are then decomposed to determine if the newly formed slice should be included in the unit inventory.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 195.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An exemplary text-to-speech synthesizer 200 is illustrated in
The unit concatenation module 230 receives the phoneme string and constructs corresponding synthetic speech, which is provided as an output signal 260 to a digital-to-analog converter 270, which in turn, provides an analog signal 275 to the speaker 83.
Based on the string input from the text analyzer 220, the unit concatenation module 230 selects representative instances from a unit inventory 240 after working through corresponding decision trees stored at 250. The unit inventory 240 is a store of representative context-dependent phoneme-based units of actual acoustic data. In one embodiment, triphones (a phoneme with its one immediately preceding and succeeding phonemes as the context) are used for the context-dependent phoneme-based units. Other forms of phoneme-based units include quinphones and diphones or other n-phones. The decision trees 250 are accessed to determine which acoustic instance of a phoneme-based unit is to be used by the unit concatenation module 230. In one embodiment, the phoneme-based unit is one phoneme so a total of 45 phoneme decision trees are created and stored at 250. However, other numbers of phoneme decision trees can be used.
The decision tree 250 is illustratively a binary tree that is grown by splitting a root node and each of a succession of nodes with a linguistic question associated with each node, for instance, a question asking about the category of the left (preceding) or right (following) phoneme. The linguistic questions about a phoneme's left or right context are usually generated by an expert in linguistics in a design to capture linguistic classes of contextual affects. In one embodiment, Hidden Markov Models (HMMs) are created for each unique context-dependent phoneme-based unit. One illustrative example of creating the unit inventory 240 and the decision trees 250 is provided in U.S. Pat. No. 6,163,769 entitled “TEXT-TO-SPEECH USING CLUSTERED CONTEXT-DEPENDENT PHONEME-BASED UNITS”, which is assigned to the same assignee as the present application. However, other methods can be used.
As stated above, the unit concatenation module 230 selects the representative instance from the unit inventory 240 after working through the decision trees 250. During run time, the unit concatenation module 230 can either concatenate the best preselected phoneme-based unit or dynamically select the best phoneme-based unit available from a plurality of instances that minimizes a joint distortion function. In one embodiment, the joint distortion function is a combination of HMM score, phoneme-based unit concatenation distortion and prosody mismatch distortion.
The text-to-speech synthesizer 200 can be embodied in the computer 50 wherein the text analyzer 220 and the unit concatenation module 230 are hardware or software modules, and where the unit inventory 240 and the decision trees 250 can be stored using any of the storage devices described with respect to computer 50. As appreciated by those skilled in the art, other forms of text-to-speech synthesizers can be used. Besides the concatenative synthesizer 200 described above, articulator synthesizers and formant synthesizers can also be used to provide audio proofreading feedback.
The first step of the process is to receive or identify a complete list of monophones for the target language. This is illustrated at step 310. The target language can be any spoken language, such as Chinese, English, French, German, Hindi, Italian, Japanese or Spanish. Next, a spoken lexicon or speech corpus in the target language is received. The lexicon provided includes a phonetic transcription for each of the words that comprise the lexicon. This is illustrated at step 320. However, it should be noted that the order of steps 310 and 320 can be reversed.
Once the speech lexicon and monophones are received a set of common multiphone units are identified. Common multiphone units are units that are sized between a phone and a syllable. This is illustrated at step 330. The identified common multiphones are then added to the unit inventory for the target language. This is illustrated at step 340.
The first step in identifying the common multiphone units is to decompose each syllable contained in the lexicon into a plurality of slices. This is illustrated at step 410. In one embodiment the syllable is broken down into three slices. However, other numbers of slices can be used. For purposes of this discussion these slices are referred to as an onset slice, a nucleus slice, and a coda slice.
This view provides better results as co-articulation between vowels and other sonorants are typically strong while the boundaries between such phonemes are often difficult to determine. By grouping the vowel and surrounding sonorants into the same unit, the unit segmentation problem is generally easier to manage, and the likelihood of generating an unsmooth concatenation for the syllable is reduced. The formation of the nucleus slice is illustrated at step 415.
Once the nucleus slice is determined at step 415, the onset and coda slices for the syllable are determined at step 420. At this step all consonants in the syllable occurring before the nucleus slice 515 form the onset slice 513 and all consonants occurring after the nucleus slice 515 form the coda slice 517. However, other methods for generating a slice can be used. While the present invention discusses three slices, only the nucleus slice is needed as all syllables have a nucleus, but may not have a coda slice such as in “shoe”, or may not have an onset slice such as in “eight”.
The next step is to generate an initial slice set for the target language. This is illustrated at step 430. In order to generate a full list of possible slices for the target language, a lexicon containing word entries with pronunciations in that language is needed. This lexicon corresponds to the lexicon obtained at step 320 in
Table 1 illustrates an example of a portion of an English lexicon which can be used by the present invention. All of syllables in the lexicon are decomposed into one to three slices according to the list of phonemes received at step 310 in
Once the lexicon has been decomposed into slices, a set of common slices is identified. This is illustrated at step 440. The common slices not already in the unit inventory, based on the obtained list of phones are added to the unit inventory at step 450. The present invention then decomposes the non-common slices according to a set of rules until a sequence that is composed of one of the common multiphones and several monophones at its margin, or a list of monophonesis identified. This is illustrated at step 460. Non-common slices are only added to the unit inventory if it is not possible to decompose the slice into an atom unit that matches an atom unit already in the unit inventory either as a phone or common multiphone slice. The process of adding slices or atom units to the unit inventory is discussed in greater detail with respect to
In an ideal environment where storage size of the unit inventory is not an issue it is desirable to use the slice set developed at step 430 as the atom unit set for the unit inventory. However, it has been found that some slices in the set have very low frequency and provide very little to the overall unit inventory. In other words, these slices are those that are found in infrequently used words or words that are not native to the target language. To increase the efficiency of the unit inventory, these non-common slices should not be treated as a single unit. Therefore, the present invention takes these non-common slices and breaks the slices into smaller slices. This process is also called decomposition of the slice. However, the non-common slices must first be identified.
In order to identify the non-common slices the present invention determines the frequency of each slice in the set of initial slices. This is illustrated at step 610. In one embodiment the slice's frequency is equal to the total number of words in the speech corpus or lexicon having the slice. However, as the slice set is used as a portion of the atom units in the unit inventory it is desirable to verify that each slice has appeared enough times in the speech corpus or lexicon prior to adding the slice to the unit inventory. Therefore, in one embodiment the present invention takes into account the frequency of the word in the speech corpus.
Next the slices are sorted based on the frequency or number of occurrences of the slice in the speech corpus. By sorting the slices in the initial list in the order of frequencies it is often the case that distribution of the slices is uneven. That is some slices occur much more frequently than others. For example, in English, the cumulative frequency of the top 50% of the slices represents as many as 99% of the total occurrences of all slices in the speech corpus. The sorting of the slices is illustrated as step 620.
Once the slices have been sorted in the order determined above at step 620, the present invention identifies those slices whose frequency of occurrence exceeds a threshold value. This is illustrated at step 630. Depending on the configuration of the system the threshold value can be set differently. In one embodiment those slices that occur more than a set number of times, such as 12, are considered common slices. In another embodiment those slices that represent a set percentage of the total slices are considered common. Typically in this situation, the percentage will be significantly less than one percent. Those slices identified as common are added to the unit inventory at step 640.
Next the non-common slices are decomposed into a sequence of a common slice plus monophones or a sequence of monophones. There are several methods that can be used to decompose noncommon slices. One method is to construct a look-up table to map the decomposing operations. A second method could split the slices into phones. However, in one embodiment of the present invention a rule-based method, which combines the statistics over the corpus script and human prior phonology knowledge, is used. The basic idea behind this method is to re-compose the odd target phone cluster with a core slice plus other marginal mono-phones. In other words, the present invention determines how to truncate a phone cluster based on its heading or tailing phone, according to a set of truncating priority rules, until a residual set of the phone cluster is covered by the defined slice set, or no further truncation can occur. One example of the truncation is discussed with respect to
The first step in this process is the decomposition of nucleus slices. The format of a nucleus slice can be represented as:
[sonorant consonant cluster] xx [sonorant consonant cluster]
where “xx” denotes a vowel in the nucleus. As discussed above, some non-common nucleus slices should be truncated into a core nucleus slice plus other marginal mono-phones as illustrated below:
[sonorant *] core nucleus slice [sonorant *]
For the nuclei outlying the core nucleus slice set, the slice is truncated on its heading or tailing phone, according to a set of truncating priority rules, until the residual is covered by the core nucleus slice set. In one embodiment the truncating priority is based on the phonetic and phonologic knowledge of the language. However, other truncation processes can be used. This process does not guarantee uniformity for all languages, but provides sufficient coverage for the language.
The first step in the exemplary truncation rules is to determine if a left nasal such as [m n ng] is present in the slice. This is illustrated at step 710. If the left nasal is present the system truncates the nasal off of the slice. If the nasal is not present the system determines if a right nasal, such as [m n ng] is present in the slice. This is illustrated at step 720. If the right nasal is present the system truncates the right nasal from the slice.
If the right nasal is not present the system determines if a right glide, such as [y w], is present in the slice. This is illustrated at step 730. If the right glide is present the system removes the glide from the slice. If the right glide is not present in the slice the system determines if the slice contains a left lateral, such as [l r]. This is illustrated at step 740. If the left lateral is present in the slice the left lateral is removed from the slice.
If a left lateral is not present in the slice the system determines if there is a right “1” sound present in the slices. This is illustrated at step 750. If the right “1” sound is present in the slice, it is removed from the slice. If the right “1” is not present in the slice the system determines if there is a left glide, such as [y w], present in the slice. This is illustrated at step 760. If a left glide is present it is removed from the slice.
If a left glide is not present in the slice the system determines if there is a right “r” present in the slice. This is illustrated at step 770. If there is a right “r” present in the slice, it is removed from the slice. If the system process through the entire list of rules for truncating the slice, the slice can according to one embodiment be added to the unit inventory at step 775.
The truncation of the slice is illustrated at step 780. At this step the phone that was identified in the rules is removed from the slice, and the remaining slice is reformed. Next the remaining phone cluster is compared against the slices in the unit inventory. This is illustrated at step 790. If the new phone cluster is not present in the unit inventory, the truncation process will be repeated until the remaining phone cluster is either matched with a cluster in the unit inventory or the system completes all of the truncating rules. The portion of the phone cluster that is removed from the slice is treated as a either a new onset or new coda slice. In an alternative embodiment the removed phones are added to the adjoining onset or coda slice. This is illustrated at step 795.
Since the set of nucleus slices is changed, and the onset and coda slices are regenerated it is necessary to decompose these slices as well. In a process similar to the process illustrated above for the nucleus slice, only high frequency slices in the onset and coda slice sets are kept as a single unit, others are truncated. For example in English, only some high frequency consonant clusters in onset part such as /st/, /sp/, /st/ are treated as one slice, all others are split into mono-phones. This is illustrated as step 650 of
The final step of the process is to verify the coverage of the slice set. This is illustrated at step 660. At this step the process determines that any syllables present in the language should be able to be formed by slices or their combinations in the unit inventory. This is especially important for those syllables that do not appear in the speech corpus that was used for counting the frequencies of occurrences. Therefore it is desirable that the set of atom units in the unit inventory includes all mono-phones for the target language. Many onset, nucleus and coda are mono-phones as well as the marginal truncated mono-phones thus making this test an easy one. If all of the monophones for the language are not present in the unit inventory, the frequency threshold for the three types of slices can be increased respectively until all monophones for the language are included in the unit inventory.
Although the present invention has been described with reference to particular embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.