This application claims the benefit of European Patent Application No. EP5105449.2 filed Jun. 21, 2005.
1. Field of the Invention
The present invention relates to computer-generated text-to-speech conversion, and, more particularly, to updating a Concatenative Text-To-Speech (CTTS) system with a speech database from a base version to a new version.
2. Description of the Related Art
Natural speech output is one of the key elements for a wide acceptance of voice enabled applications and is indispensable for interfaces that can not make use of other output modalities, such as plain text or graphics. Recently, major improvement in the field of text-to-speech synthesis has been made by the development of so-called “corpus-based” methods: systems such as the IBM trainable text-to-speech system or AT&T's NextGen system make use of explicit or parametric representations of short segments of natural speech, referred to herein as “synthesis units,” that are extracted from a large set of recorded utterances in a preparative synthesizer training session, and which are retrieved, further manipulated, and concatenated during a subsequent speech synthesis runtime session.
In more detail, and with a particular focus on the disadvantages of prior art, such methods for operating a CTTS system include the following features:
The pre-processed text, the requested sequence of synthesis units, and the desired intonation contour are passed to a back-end concatenation module 14 that generates the synthetic speech in a synthesis engine 16. For that purpose, a back-end database 18 of speech segments is searched for units that best match the acoustic/prosodic specifications computed by the front-end. The back-end database 18 stores an explicit or parametric representation of the speech data.
Synthesis units, such as phones, sub-phones, diphones, or syllables, are well known to sound different when articulated in different acoustic and/or prosodic contexts. Consequently, a large number of these units have to be stored in the synthesizer's database in order to enable the system to produce high quality speech output across a broad variety of applications or domains. For combinatorial and performance reasons, it is prohibitive to search all instances of a required synthesis unit during runtime. Accordingly, a fast selection of suitable candidate segments is generally performed based upon to previously established criterion, and not performed based upon the entirety of synthesis units in the synthesizer's database.
With reference to
While concatenative text-to-speech synthesis is able to produce synthetic speech of remarkable quality, it is also true that such systems sound most natural for applications and/or domains that have been thoroughly covered by the recording script (i.e., the above-mentioned base text) and are thus present in the speech database. Different speaking styles and acoustic contexts are only two reasons that help to explain this observation.
Since it is impossible to record speech material for all possible applications in advance, both the construction of synthesizers for limited domains and adaptation with additional, domain-specific prompts, have been proposed in the literature. Limited domain synthesis constructs a specialized synthesizer for each individual application. Domain adaptation adds speech segments from a domain-specific speech corpus to an already existing, general synthesizer.
Referencing
Therefore, while both approaches, limited domain synthesis and domain adaptation, can help to increase the quality of synthetic speech for a particular application, these methods are disadvantageously time-consuming and expensive, since a professional human speaker (preferably the original voice talent) has to be available for the update speech session, and because of the need for expert phonetic-linguistic skills in the synthesizer construction step (shown in
Prior art unit selection based text-to-speech systems can generate high quality synthetic speech for a variety of applications, but achieve best results for domains and applications that are covered in the base recordings used for synthesizer construction. Prior art methods for the adaptation of a speech synthesizer towards a particular application demand the recording of additional human speech corpora covering additional application-specific text, which is time consuming and expensive, and ideally requires the availability of the original voice talent and recording environment.
The domain adaptation method disclosed in the present invention overcomes this problem. By making use of statistics generated during the CTTS system runtime, the present invention examines the acoustic and/or prosodic contexts of the application-specific text, and re-organizes the speech segments in the base database according to newly created contexts. The latter is achieved by application-specific decision tree modifications. Thus, in contrast to prior art, adaptation of a CTTS system according to the present invention requires only a relatively small amount of application-specific text, and does not require additional speech recordings. The present invention, therefore, allows the creation of application-specific synthesizers with improved output speech quality for arbitrary domains and applications at very low cost.
The present invention can be implemented in accordance with numerous aspects consistent with material presented herein. For example, one aspect of the present invention can include a method and respectively programmed computer system for updating a Concatenative Text-To-Speech System (CTTS) with a speech database from a base version to a new version. The CTTS system can use segments of natural speech, stored in its original form or any parametric representation, which is obtained by recording a base text. The recorded speech can be dissected into synthesis units including, but not limited to, subphones (such as a ⅓ phone), phones, diphones, and syllables. Speech can be synthesized by a concatenation and modification of the synthesis units. The base speech database can include base of acoustic and/or prosodic context classes derived from and thus matching said base text.
A method of updating to a new data base better suited for synthesizing text from a predetermined target application can include specifying a new text corpus subset that is not completely covered by the base speech database. Acoustic contexts from the base version speech database that are present in the target application can be collected. Acoustic context classes which remain unused when the CTTS system is used for synthesizing new text of the target application can be discarded. New context classes can be created from the discarded classes. The speech database can be re-indexed to reflect the newly created context classes.
In one embodiment, the speech segments can be organized in a clustered hierarchy of subsets of speech segments, or even in a tree-like hierarchy. This organization provides a fast runtime operation.
Both the removal of unused acoustic and/or prosodic contexts and the creation of new context classes can be implemented as operations on decision trees, such as pruning (removal of subtrees) and split-and-merge (for the creation of new subtrees).
The method can be enriched advantageously with a weighting function. One such weighting function can analyze which of the synthesis units under a single given leaf is used with which frequency. The speech database update procedure can be triggered without human intervention, when a predetermined condition is met. This function can be customized to the new speech database relatively small, which speeds up the segment search, thus improving the scalability of the application. The function also allows the speech database to be updated without a significant human intervention.
In one embodiment, the method can be advantageously applied for portlets each producing a voice output. Each of the portlets can be equipped with a portlet-specific database.
The present invention can be performed automatically without a human trigger, i.e., an “online-adaptation.” An automatically triggered embodiment can include a step of collecting CTTS-quality data during runtime of the CTTS system. The system can check for a predetermined CTTS update condition. A speech database update procedure can be automatically performed when the predetermined CTTS update condition is met.
Benefits of the invention can result from an ability to adapt a speech database without requiring an additional recording of application specific prompts. Specific benefits can include: improved quality of synthetic speech achieved without additional costs; an increase in application lifecycle, since adaptation can be applied whenever the design of the application changes; and, lower skill levels needed for creation and maintenance of speech synthesizers for specific domains, since the invention is based only upon domain specific text.
It should be noted that various aspects of the invention can be implemented as a program for controlling computing equipment to implement the functions described herein, or a program for enabling computing equipment to perform processes corresponding to the steps disclosed herein. This program may be provided by storing the program in a magnetic disk, an optical disk, a semiconductor memory, or any other recording medium. The program can also be provided as a digitally encoded signal conveyed via a carrier wave. The described program can be a single program or can be implemented as multiple subprograms, each of which interact within a single computing device or interact in a distributed fashion across a network space.
There are shown in the drawings, embodiments which are presently preferred, it being understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
The present invention adapts a general domain Concatenative Text-to-Speech (CTTS) system for a target application. The invention presupposes that a speech synthesizer uses one or more decision trees or a decision network for a selection of candidate speech segments. These candidate speech segments are subject to further evaluation by the concatenation engine's search module. The target application is defined by a representative, but not necessarily exhaustive, text corpus. Accordingly, the invention teaches a method for decision tree adaptations for fast selection of candidate speech segments at runtime for target applications, where additional speech recordings are not necessary to tailor the CTTS system decision tree structure to the target application, which is the case for conventional CTTS implementations.
It should be noted that while many examples for the present invention are phrased in terms of decision tree adaptation in an acoustic context, the invention can be applied in other contexts. For example, the present invention can apply to the adaptation of decision trees used by a synthesizer for the computation of loudness, pitch, duration, and the like.
Further, the inventive arrangements detailed herein are not to be construed as limited to decision tree implementations. The invention can also be implemented for other tree-like data structures, such as a hierarchy of speech segment clusters. In a hierarchy, the present invention can be used for finding a set of candidate speech segments that best match the requirements imposed by the CTTS systems's front-end. In a hierarchy case, instead of being used to find an appropriate decision tree leaf, the invention can be used to identify a cluster (subset of speech segments) based upon a distance measurement that best matches front-end requirements. The adaptive tree traversal tailored for a target application remains the same for the hierarchy of speech segment clusters implementation as it does for the decision tree embodiment.
In order to allow a fast selection of candidate speech segments during runtime, decision trees for each synthesis unit (e.g., for phones or, preferably, sub-phones) are trained as part of the synthesizer construction process, and the same decision trees are traversed during synthesizer runtime.
Decision tree growing divides the general domain training data aligned to a particular synthesis unit into a set of homogeneous regions, i.e. a number of clusters with similar spectral or prosodic properties, and thus similar sound. It does so by starting with a single root node holding all the data, and by iteratively asking questions about a unit's phonetic and/or prosodic context, e.g., of the form:
Is the phone to the left a vowel?
Is the phone two positions to the right a plosive?
Is the current phone part of a word-initial syllable?
In each step of the process, the question that yields the best result with respect to some pre-defined measurement of homogeneity is stored in the node, and two successor nodes are created which hold all data that yield a positive (or negative, respectively) answer to the selected question. The process stops, if a given number of leaves, i.e., nodes without successors, are reached.
During runtime, after baseform generation by the synthesizer's front-end, the decision tree for each required synthesis unit is traversed from top to bottom by asking the question stored in each node and following the respective YES- or NO-branch until a leaf node is reached. The speech segments associated to these leaves are now suitable candidate segments from which the concatenation engine has to select the segment that, in terms of a pre-defined cost function, best matches the requirements imposed by the front-end as well as the already synthesized speech.
If text from a new domain or application has to be synthesized, the same runtime procedure is carried out using the general domain synthesizer's decision tree. However, since the decision tree was designed to discriminate speech segments according to the acoustic and/or prosodic contexts in the training data, traversal of the tree will frequently end in the very same leaves, therefore making only a small fraction of all speech segments available for further search. As a consequence, the prior art back-end may search a list of candidate speech segments that are less suited to meet the prosody targets specified by the front-end, and output speech of less than optimal quality will be produced.
Domain specific adaptation of context classes, as provided by the present invention, will overcome this problem by altering the list of candidate speech segments, thus allowing the back-end search to access speech segments that potentially better match the prosodic targets specified by the front-end. Thus, better output speech is produced without the incorporation of additionally recorded domain specific speech material, as it is required by prior art synthesizer adaptation methods.
For the purpose of domain adaptation, the steps shown in
In step 460, decision tree adaptation occurs, where new context classes are created. This creation of context classes can utilize decision tree pruning and/or refinement techniques.
In step 470, the speech data base used by the target application can be re-indexed. This step can tag the synthesizer's speech database according to the newly created context classes. Database size for the target application can be optionally reduced to increase searching speech.
In step 480, after the database or tree structure used for fast candidate selection is updated, which can occur automatically at runtime, speech synthesis tasks can be performed. It should be emphasized that the database or tree structure is updated for the target application without requiring additional speech recordings, as would be the case for a conventionally implemented system.
The steps shown in
With additional reference to
Specifically, the method begins with a decision 440 to perform an update of the speech database. The context identification step 450 is implemented in the program component—which can be a part of the synthesis engine. The program component can use a pre-existing general domain synthesizer with decision trees shown in
In a context identification 450 the following actions can be performed:
As a result, two disjointed sets of decision tree leaves can be obtained. A first one having counter values above the threshold. The second one with counter values below the threshold. Leaves 627, 628, 629 in the first set can carry the speech segments 634 and 636, . . . , 641 for acoustic and/or prosodic contexts present in the application specific new text. Leaf 630 from the second set can contain speech segments 631, . . . , 633 that are not accessible by the new application due to the previously mentioned context mismatch of training data and new application.
In the decision tree adaptation step 460, an adaptation software program can perform a decision tree adaptation procedure which is best implemented as an iterative process that discards and/or creates acoustic contexts based on the information collected in the precedent context identification step 450. Assuming a binary decision tree, we can distinguish three different situations:
By comparing the new leaves' usage counters to a new threshold (which may be different to the previous one), the process creates two new sets of (un-)used leaves in each iteration. The process stops if either further pruning is not applicable or if a stop criterion is reached. For example, the step criterion can occur once a predefined number of leaves, or speech segments per leaf, is reached.
The lower part of
Then, in a final adaptation step 470 the program component re-builds the speech database storing all the speech segments by means of a re-indexing procedure, which transforms the new tree structure into a respective new database structure having a new arrangement of table indexes.
Finally, the speech database is completely updated in step 480, still comprising only the original speech segments, but now being organized according to the characteristic acoustic and/or prosodic contexts of the new domain. Thus, the adapted database and decision tree can be used instead of their general domain counterparts in normal runtime operation mode.
The descriptive data mentioned above can include, but is not limited to, any (combination) of the following:
During application runtime, the synthesis engine collects the above-mentioned descriptive data, which allows the judgment of the quality of the CTTS system and are thus called CTTS quality data (step 750). The CTTS quality data can be checked against a predetermined CTTS update condition 760.
If the condition is not met, the system continues to synthesize speech using the current (original) versions of the acoustic/prosodic decision trees and speech segment database (see the YES-branch in block 770). Otherwise (NO-Branch) the current version of the system is considered as being not sufficient for the given application, and in a step 780 the CTTS system is prepared for a database update procedure. This preparation can be implemented by defining a time during run-time, where it can be reasonably expected that the update-procedure does not interrupt a current CTTS application session.
Thus, as a skilled reader may appreciate, the foregoing embodiment of the present invention offers an improved quality of synthetic speech output for a particular application or domain without imposing restrictions on the synthesizer's universality and without the need of additional speech recordings.
It should be noted that the term “application” as used in this disclosure does not necessarily refer to a single task with a static set of prompts, but can also refer to a set of different, dynamically changing applications, e.g., a set of voice portlets in a web portal application such as the WebSphere® Voice Application Access environment. It is further important to note that in the case of a multilingual text-to-speech system, these applications are not required to output speech in one and the same language.
The present invention can be realized in hardware, software, or a combination of hardware and software. A synthesis tool, according to the present invention, can be realized in a centralized fashion in one computer system or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention can also be embedded in a computer program which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
Computer program means or computer program in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following:
a) conversion to another language, code, or notation; and
b) reproduction in a different material form.
Number | Date | Country | Kind |
---|---|---|---|
5105449 | Jun 2005 | EP | regional |
Number | Name | Date | Kind |
---|---|---|---|
6029132 | Kuhn et al. | Feb 2000 | A |
6081774 | de Hita et al. | Jun 2000 | A |
7328157 | Chu et al. | Feb 2008 | B1 |
20020087314 | Fischer et al. | Jul 2002 | A1 |
20020095282 | Goronzy et al. | Jul 2002 | A1 |
20020120450 | Junqua et al. | Aug 2002 | A1 |
20030055641 | Yi et al. | Mar 2003 | A1 |
20040015478 | Pauly | Jan 2004 | A1 |
20050131676 | Ghasemi et al. | Jun 2005 | A1 |
20060069566 | Fukada et al. | Mar 2006 | A1 |
20060074674 | Zhang et al. | Apr 2006 | A1 |
Entry |
---|
Fischer et al. “Domain adaptation methods in the IBM trainable text-to-speech system”, ICSLP, Oct. 2004. |
Cronk et al. “Optimized stopping criteria for Tree-based unit selection in concatenative synthesis”, ICSLP, 2002. |
Kain et al. “Text-to-speech voice adaptation from sparse training data”, ICSLP, 1998. |
Hunt et al. “Unit selection in a conccatenative speech synthesis system using a large speech database”, ICASSP, 1996. |
Yamagishi et al. “Speaking Style Adaptation Using Context Clustering Decision Tree for HMM-Based Speech Synthesis”, IEEE ICASSP, 2004. |
Chu et al. “Domain Adaptation for TTS system”, IEEE ICASSP 1992. |
Number | Date | Country | |
---|---|---|---|
20060287861 A1 | Dec 2006 | US |