The present invention relates to speech synthesis. In particular, the present invention relates to adaptation of general-purpose text-to-speech systems to specific domains.
Text-to-speech (TTS) technology enables a computerized system to communicate with users utilizing synthesized speech. With newly burgeoning applications such as spoken dialog systems, call center services, and voice-enabled web and email services, increasing emphasis is put on generating natural sounding speech. The quality of synthesized speech is typically evaluated in terms of how natural or human-like are produced speech sounds.
Simply replaying a recording of an entire sentence or paragraph of speech can produce very natural sounding speech. However, the complexity of human languages and the limitations of computer storage make it impossible to store every conceivable sentence that may occur in a text. Instead, systems have been developed to use a concatenative approach to speech synthesis. This concatenative approach combines stored speech samples representing small speech units such as phonemes, diphones, triphones, syllables or the like to form a larger speech signal unit.
Concatenation based speech synthesis has been widely adopted and rapidly developed. To some extent, this type of speech synthesis involves collecting, annotating, indexing and retrieving speech units within large databases. Accordingly, it follows that the naturalness of the synthesized speech depends to some extent on the size and coverage of a given unit inventory. Due to the complexity of human languages and the limitations of computer storage and processing, generally expanding the unit inventory is not a particularly efficient way to increase naturalness of speech for a general-purpose TTS system. However, expanding the unit inventory is a reasonable method for increasing naturalness of a specific domain for a domain-specific TTS system.
The simplest way for generating speech prompt in domain-specific applications is to play back a collection of pre-stored waveforms for words, phrases and sentences. When the domain is narrow and closed, very natural speech prompt can be generated with this method at relatively low cost. However, when the domain is not closed or is broader, or when the number of domains increases, the cost for constructing and maintaining such prompt systems increases greatly.
A general-purpose TTS system is preferred instead. However, general-purpose TTS systems sometimes cannot generate high quality speech for some domains, especially when the domain mismatches the speech corpus that is used as the unit inventory. It would be desirable to have a general-purpose TTS system that can produce rather natural speech without domain restrictions and that can generate more natural speech for a specific domain after domain adaptation. Domain adaptation is a concept that has been explored in many research areas; however, few studies have been conducted in the context of TTS systems. Efficient domain adaptation of a general-purpose TTS can be accomplished through generation of an optimized script for collecting domain-specific speech.
Embodiments of the present invention pertain to adaptation of a corpus-driven general-purpose TTS system to at least one specific domain. The domain adaptation is realized by adding a limited amount of domain-specific speech that provides a maximum impact on improved perceived naturalness of speech. An approach for generating optimized script for adaptation is proposed, the core of which is a dynamic programming based algorithm that segments domain-specific corpus minimum number of segments that appear in the unit inventory. Increases in perceived naturalness of speech after adaptation are estimated from the generated script without recording speech from it.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. Tasks performed by the programs and modules are described below and with the aid of figures. Those skilled in the art can implement the description and figures as processor executable instructions, which can be written on any form of a computer readable media.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 100.
Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, FR, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read-only media (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in FIG. 1, provide storage of computer readable instructions, data structures, program modules and other data for the computer 110. In
A user may enter commands and information into the computer 110 through input devices such as a keyboard 162, a microphone 163, and a pointing device 161, such as a mouse, trackball or touch pad. Other input devices (not shown) may include a joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 120 through a user input interface 160 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port, game port, or a universal serial bus (USB). A monitor 191 or other type of display device is also connected to the system bus 121 via an interface, such as a video interface 190. In addition to the monitor, computers may also include other peripheral output devices such as speakers 197 and printer 196, which may be connected through an output peripheral interface 190.
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110. The logical connections depicted in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communication over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
To assist in understanding the usefulness of the present invention, it may be helpful to provide a high-level overview of a general-purpose corpus-driven TTS system. Such a system is depicted in block form as system 200 in
System 200 is illustratively configured to construct synthesized speech 202 from input text 204. A speech component bank or unit inventory 208 contains speech components. In order to generate speech 202, a component locator 210 is utilized to match input text 204 with speech components contained in bank 208. Speech constructor 212 is then utilized to assemble the speech components selected from bank 208 so as to create speech 202 based on input text 204.
In accordance with one general aspect of the present invention, naturalness of speech 202 is improved through a system of selective domain adaptation of inventory 208. This domain adaptation is realized by adding optimized units of speech to bank 208. The optimized units to be added are illustratively based on scripts 214 that are automatically generated and derived from a target domain text corpus 206.
The core of the present invention, which will be described in detail below, involves at least three primary parts. The first part is an addition of domain-specific speech into the unit inventory of a corpus-driven TTS engine to improve the naturalness of synthetic speech on the target domain.
The second part is a measurement of the naturalness of synthetic speech on the target domain before and after adding domain-specific speech to the general unit inventory. The naturalness is illustratively measured in terms of Average Concatenative Cost (ACC) and Average Segment Length (ASL) in order to enable a determination as to estimated improvements in Mean Opinion Score (MOS). An estimated impact of the added domain-specific speech on naturalness can be determined even before the added speech is actually recorded.
The third part is a generation and utilization of an algorithm to generate a domain-specific script for recording speech. The script generation algorithm can include any of several proposed constraints. A first proposed constraint is a minimization of the amount of speech data to be recorded given a certain requirement on target ACC (or ASL, or estimated MOS). The amount of speech can be measured by the number of words (for alphabet languages such as English) or the number of characters (or Kanji) for Chinese or Japanese. A second proposed constraint is a minimization of ACC (or maximization of ASL or estimated MOS) for a given amount of speech to be recorded.
In the last decade, concatenation based speech synthesis has been widely adopted and rapidly developed because of its potential of producing high quality speech. To some extent, speech synthesis becomes a problem of collecting, annotating, indexing and retrieval within large speech databases. The naturalness of synthesized speech depends to some extent on the size and coverage of the unit inventory. Though generally expanding the unit inventory is not an efficient way to increase naturalness for a general-purpose TTS, it is a reasonable method for increasing naturalness on a specific domain. Thus, the problem of domain adaptation is converted into a problem of generating an optimized script for collecting domain-specific speech. The sticking point of the problem is to find an efficient objective measure for naturalness.
A formal evaluation has been done to investigate the relationship between the naturalness of synthetic speech and some objective measures. The measurement ACC is shown to be highly correlated with MOS, which reveals that the ACC predicts, to a great extent, the perceptual behavior of human beings. Four hundred ACC vs. MOS pairs are plotted in
Several factors are considered when calculating ACC. Among them, smoothness cost proves to be a very important one. With this constraint, the longest speech segment that matches other prosodic and phonetic constraints should be selected for concatenation. Utterances concatenated by a few long segments often sound very natural. Thus, the Average Segment Length (ASL), defined as the average number of characters in selected segments, also reflects naturalness. The ASL vs. MOS pairs for 400 synthetic utterances are plotted in
The domain adaptation problem to be solved by embodiment of the present invention can be described as follows:
1. Definition of Symbols
2. The problem
In accordance with one aspect of the present invention, an automatic approach is provided for generating Us. Generally speaking, the goal is to generate optimized script(s) Us that will provide maximum increase in ASL (and therefore perceived naturalness) within a size limitation Ss. Theoretically, the larger Ss is, the larger the ASL will be. However, it is normally undesirable to spend too much time and energy on speech collection for specific domains. An automatic approach is proposed by embodiments of the present invention and relates to an extraction of Domain-Specific Strings (DSS) one by one according to their contribution to their increase in ASL. A stop threshold for Ss and/or ASL can be selected according to a particular user's expectation of recording effort and naturalness.
ACC is proposed to be a good objective measure for naturalness of synthetic speech, from which MOS can be estimated (from the MOST-COST curve in
A broad overview of a method of generating optimized script(s) Us is illustrated in
In the DSS extraction step, all sentences in Cs are assumed to be synthesized by concatenating sub-strings that appear in the unit inventory U. Among many possible schemes for sub-string selection, the one with the maximum ASL is assumed to be the most natural one. A Dynamic Programming (DP) based algorithm is presently proposed for finding the segmentation scheme with the minimum number of segments i.e., maximum ASL. Details of the algorithm will be described below. After finding the best string sequence for all sentences in Cs, the ASL for Cs, when U is used, is given by equation (1):
ASL(Cs,U)=Size(Cs)/Count(Segment,(Cs,U)) (1)
where, Size (Cs) is the number of characters in corpus Cs, and Count (Segment, (Cs, U)) is the number of segments used to synthesize Cs with U. Obviously, when a DSS (or a corresponding sentence that contains the DSS) is added into U, ASL(Cs, U) will increase. An iterative algorithm is utilized to search for a DSS that will provide maximum increase in ASL(Cs, U) one by one until a predetermined threshold for ASL, or a threshold for a predetermined number of DSS, is met. This optimization of DSS is reflected in at block 504 in
In some instances it will be most desirable that DSS carry sentence level prosody. For example, it may be desirable to maintain sentence level intonation with regard to all DSS. Accordingly an optional step indicated by block 506 is performed in order to generate Domain Dependent Sentences (DDS) that include the extracted optimal DSS selected from Cs. Specific schemes for the generation of DDS, will be described in greater detail below.
A detailed and specific flow diagram of an approach for generating domain-specific script is provided in
A PAT tree is an efficient data structure that has been successfully utilized in the field of information retrieval and content indexing. A PAT tree is a binary digital tree in which each internal mode has two branches and each external node represents a semi-infinite string (denoted as Sistring). For constructing a PAT tree, each Sistring in the corpus should be encoded into a bit stream. For example, GB2312 code for Chinese is used. Once the PAT tree is constructed, all Sistrings which appear in the corpus can be retrieved efficiently. In accordance with block 608, a list of candidate DSS is generated from the tree for Cs by the criteria that candidate DSS should appear in Cs for at least N times and they should never appear in Ug.
In accordance with block 610, to find the best DSS from all candidates, Cs is segmented into substrings appearing in Ug with the maximum ASL constraint. The problem is best illustrated in the context of a specific example:
A sentence with N Chinese characters is denoted as C1C2 . . . CN. It is to be segmented into M (M≦N) sub-strings, all of which should appear at least once in Ug. Though many segmentation schemes exist, only the one with the smallest M is what is searched for. In fact, it turns out to be a searching problem for the optimal path, which is illustrated under the DP framework in
The segmentation algorithm is described as follows:
Step 1: Initialization
Step 2: Recursion
Step 3: Termination
In accordance with block 610 in
ASLIPC=(ASLa−ASLo)/L (3)
where, L is the length of a candidate DSS in characters, ASL0 is the ASL for Cs when it is segmented by the unit inventory without current candidate DSS, and ASLa is the ASL after adding current candidate DSS into the unit inventory. Among the extracted DSS, some are sub-strings of the others. It is not necessary to keep them all. The shorter ones can be pruned under certain circumstances. For example, extracted DSS can be optionally eliminated if it is a part of a longer one. It should be noted that block 611 in
As was discussed above, once optimal DSS have been selected, in accordance with block 612, an optional step of DDS generation can be performed. Since sentences are sometimes preferred for speech data collections to carry sentence level prosody, DDS that cover all extracted DSS can be generated. Though they can be written manually, it is more efficient to select DDS from Cs automatically.
All sentences in Cs are considered as candidates for DDS generation. The criterion for selecting DDS is illustratively ASLIPC for a sentence, which is the sum of ASL increase for all DSS appearing in the sentence divided by the sentence length. The sentence with the highest ASLIPC is illustratively selected first and removed from the candidate list Cs. The DSS appearing in this sentence should be removed from the DSS list too. These procedures are illustratively done iteratively until the DSS candidate list is empty or the number of selected sentences reaches a predetermined limit. Block 614 in
Experimentation performed in association with the present invention has shown that the amount of MOS increase to the general-purpose TTS system depends not only on the size of the training set and the size of the script for adaptation, but also on the broadness of the domain. Narrower domains have larger increases in MOS.
In accordance with a specific experiment,
In accordance with another specific experiment,
In accordance with another specific experiment,
In accordance with another specific experiment,
The present invention presents a framework for generating domain-specific scripts. With it, application developers can estimate how much improvement can be achieved before starting to record speech for a specific domain. Experiments show that the extent of increase in naturalness depends on only on the size of the training set and the size of the script for adaptation, but also on the broadness of the domain. Greater increases in naturalness are observed for narrower domains.
Although the present invention has been described with reference to preferred embodiments, workers skilled in the art will recognize that changes may be made in form and detail without departing from the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
5913194 | Karaali et al. | Jun 1999 | A |
6665641 | Coorman et al. | Dec 2003 | B1 |
6934680 | Holzapfel | Aug 2005 | B2 |
6996529 | Minnis | Feb 2006 | B1 |