In one embodiment, rather than calculating each measure during the selection process, measures can be pre-calculated and stored in cost tables 118. In particular, a cost table can be calculated for each phonetic and prosodic feature component.
Before speech synthesizer 100 can be utilized to construct speech 102, it must be initialized with samples of speech units taken from a training text 120 that is audibly read to provide training speech 122. The manner in which training text 120 and training speech 122 are obtained is not pertinent to this description. However, in one embodiment, the training text 120 can be obtained from a large corpus of text using the techniques described in U.S. Pat. No. 6,978,239.
Storing speech units 116 begins by parsing the sentences of text 120 into individual speech units that are annotated with high-level phonetic and prosodic information. In
Prosodic context:
Phonetic context:
It should be noted other units of a language besides a “word” and “phrase” as described above can be used, if desired.
The context vectors produced by context vector generator 112 are provided to a speech storing component 124 along with speech samples produced by a sampler 126 from training speech signal 122. Each sample provided by sampler 126 corresponds to a speech unit identified by parser 108. Speech storing component 124 indexes each speech sample by its context vector to form an indexed set of stored speech units 116.
As indicated above one aspect described herein is selecting speech units from stored speech units 116 based on context-dependent HMM models that have been previously trained to represent units in different phonetic and prosodic contexts. In particular, the synthesizer 100 uses a cost function to aid in the selection of speech units for speech synthesis. The cost function is typically defined as a weighted sum of the target cost and the concatenation cost. The target cost is the sum of measures in phonetic constraints and/or prosodic constraints and will be discussed further below. The concatenation cost can take any appropriate values, indicators, or labels, such as binary values, 0 when the two segments to be concatenated are succeeding segments in the recorded speech and 1 otherwise.
The target cost takes into account the compatibility between the candidate speech unit and the target speech unit. Let t=[t1,t2, . . . ,tJ] and u=[u1,u2, . . . ,uJ] denote the corresponding target and candidate context vectors, respectively. Generally, target cost is defined as:
where Cjt,j=1,2, . . . J, is the sub-cost for the jth feature, and is weighted by wjt.
The sub-costs for the categorical features can be automatically estimated by acoustically modeling on the context classes of the feature. Referring to
There are many ways to define the measure between context dependent HMM acoustic models. In one embodiment, Kullback-Leibler Divergence (KLD) is used to measure the dissimilarity between two HMM acoustic models. In terms of KLD, the target cost can be represented as:
where Tj and Uj denote the target and candidate models corresponding to unit feature tj and uj, respectively. For purposes of explanation, the target cost can be defined based on phonetic and prosodic features (i.e. a phonetic target sub-cost and a prosodic target sub-cost). A schematic diagram of measuring the target cost for a first HMM ti 302 and a second HMM uj 304 with KLD 306 is illustrated in
A significant problem in target cost estimation is how to build reliable context-feature HMMs, which characterize the addressed context classes, while removing the influences of other features. In the discussion provided below, exemplary methods are provided to build appropriate HMM models for both phonetic and prosodic features and obtain corresponding measures between different combinations. Methods 400 and/or 500 illustrated in
Using by way of example the features described above, the phonetic target sub-cost comprises sub-costs for the Left Phone Context (LPhC) and the Right Phone Context (RPhC). For example, when selecting a speech unit /aw/ for the target phone sequence /m aw/, a speech unit /aw/ following a /m/ is desired; yet assume only /aw/'s following other speech units are available. Hence a measurement that can rank similarity of LPhC of the available /aw/'s is needed.
At step 404, acoustic models for sub-units of the speech units are created based on preceding phones of the speech unit and succeeding phones of the speech unit. Although various phone models can be used, in one embodiment by way of example, biphone model is used to represent the LPhC and RPhC. As indicated above, the models can be estimated from the regular tri-phone HMMs. Using by way of example LPhC, let l−c+r denote a triphone model, where l, c, and r are the left phone, center phone, and right phone, respectively. When the focus is on the LPhC of c, all triphone models with center phone c, the specified left phone l and whatever right phone are extracted and merged into a left biphone model l-c for c. The left biphone models are then substantially independent from its right context, i.e. the states on the right half of the model should have little discriminating information about the right phone context and the states on the left half of the model preserve the discrimination between left phones.
At step 406, representative measures are calculated for different combinations of sub-units of the same phonetic context. As discussed above, the measure can be the KLD, a novel algorithm of which is discussed below. At step 406, the measure can be normalized, for instance, the KLD measure can be normalized with a logarithm function, or sigmoid function, into a fixed range such as 0 to 1 so that the weight of the sub cost for each feature can somewhat keep unchanged.
At step 408, the representative measures can be organized based on phonetic context in a convenient form such as in a look-up form, thereby forming some of the cost tables 118 of
It should be pointed out that the representative measures obtained with the proposed method are unit dependent, i.e. if 40 phones are defined for English, 40 cost tables will be created for each of the phonetic context vector feature element LPhC and RPhC.
The prosodic target costs comprise the sub-costs, for example, Position in Phrase (PinP), Position in Word (PinW), Position in Syllable (PinS), Accent Level in Word (AinW) and Emphasis Level of the word in Phrase (EinP).
At step 504, prosody-sensitive HMM acoustic models are obtained from the training data. In particular, after the base monophone HMMs are trained at step 502, then the base phone models are split into an appropriate number of prosody sensitive HMMs for the particular prosodic context feature, i.e. the phone set is expanded by integrating with the categorical labels for the prosodic context feature, which may take the form of c:x, where x is phone c's categorical label for a given prosodic context feature. Using by way of example PinW, it may have 4 categorical values or labels: at Onset, Nucleus, Coda of a word or a Mono syllable word. To model PinW context, base monophone HMMs are first trained, then the base phone models are split into 4 PinW sensitive HMMs, i.e. the phone set is expanded by integrating with PinW, taking the form of c:x, where x is phone c's PinW label. For example, the word ‘robot’ with pronunciation /r ow-b ax t/ is composed of two syllables, where the first syllable is with PinW Onset, and the second with Coda, thus the phones are expanded with the form as /r:o ow:o-b:c ax:c t:c/, where o stands for Onset, c for coda.
In a manner similar to step 406, step 506 representative measures for acoustic models having different prosody context are calculated for each speech unit. Again, in one embodiment, this calculation can comprise calculating the KLD of the different prosody contexts in the manner discussed below, where the calculated measure can be normalized. For instance, the normalized KLD between HMM models c:x1 and c:x2 represents the measure between PinW x1 and x2 for speech unit c. All the prosodic target sub-costs can be calculated in this manner by creating the mono-phone models extended with corresponding prosodic labels. However, it should be noted, with respect to PinS, vowels and consonants generally take on two category values each. Vowels generally are either nucleus or mono, while consonants are either onset or coda. With respect to Accent in Word, this may be based only on whether a vowel is accented or not, although if desired Accent in Word can be extended to consonants. Emphasis in Phrase is based on whether the word is emphasized or not within the phrase.
At step 508, like step 408, the. representative measures can be organized based on prosodic context in a convenient form such as in a look-up form, thereby forming some of the cost tables 118 of
As indicated above, speech unit dependent cost tables for each of the context vector features can be generated. This is in contrast to prior art cost tables that are typically shared by all speech units. Speech unit dependent cost tables can improve naturalness in view of the inherent differences in the audible sound of the speech units particularly when rendered in different contexts. However, in a further embodiment, some cost tables can be shared by a plurality of speech units, if desired. Grouping of speech units to share one or more cost tables, can be performed in several ways including manually grouping according to phonetic knowledge, and/or automatic grouping based on the similarity of the cost tables.
A method 600 for using the cost measures calculated above and stored for example in cost tables 118 of
Stated another way,
Kullback-Leibler Divergence Calculation
Kullback-Leibler Divergence (KLD) is a meaningful statistical measure of the dissimilarity between two probabilistic distributions. If two N-dimensional distributions are respectively assigned to probabilistic or statistical models M and {tilde over (M)} of x (where untilded and tilded variables are related to the target model and its competing model, respectively), KLD between the two models can be calculated as:
However, given two stochastic processes, it is usually cumbersome to calculate their KLD since the random variable sequence can be infinite in length. Although a procedure has been advanced to approximate KLD rate between two HMMs, the KLD rate only measures the similarity between steady-states of two HMMs, while at least with respect to acoustic processing, such as speech processing, the dynamic evolution is of more concern than the steady-states.
KLD between two Gaussian mixtures forms the basis for comparing a pair of acoustic HMM models. In particular, using an unscented transform approach, KLD between two N dimensional Gaussian mixtures
(where o is the sigma point, w is the kernel weight μ is the mean vector, Σ is the covariance matrix, m is index of M Gaussian kernels) can be approximated by:
where om,k is the kth sigma point in the mth Gaussian kernel of M Gaussian kernels of b.
Use of the unscented transform is useful in comparing HMM models.
As is known, HMMs for phones can have unequal number of states. In the following, a synchronous state matching method is used to first measure the KLD between two equal-length HMMs, then it is generalized via a dynamic programming algorithm for HMMs of different numbers of states. It should be noted, all the HMMs are considered to have a no skipping, left-to-right topology.
In left-to-right HMMs, dummy end states are only used to indicate the end of the observation, so it is reasonable to endow both of them an identical distribution, as a result, D(bJ∥{tilde over (b)}J)=0. Based on the following decomposition of π (vector of initial probabilities), A (state transition matrix) and d (_distance between two states):
the following relationship is obtained:
where T represent transpose, t is time index and τ is the length of observation in terms of time.
By substituting,
an approximation of KLD for symmetric (equal length) HMMs can be represented as:
where Δi,j represents the symmetric KLD between the ith state in the first HMM and the jth state in the second HMM, and can be represented as:
where li=(1/1−aii) is the average duration of the ith state and the terms
Having described calculation of KLD for equal length HMMs, a more flexible KLD method using Dynamic Programming (DP) will be described to deal with two unequal-length left-to-right HMMs, where J and {tilde over (J)} will be used to denote the state numbers of the first and second HMM, respectively.
In a state-synchronized method as described above and illustrated in
will first be compared.
It can be shown that that the upper bound can be represented as
D
S(H∥{tilde over (H)})≦Δ1,1+Δ2,1+φ(ã11,a11,a22) (5)
where φ(ã11,a11,a22) is a penalty term following the function φ(z,x,y)=(1−z)/(1−x)+(1−z)/(1−y). Although it is to be appreciated that any suitable penalty may be used, including a zero penalty.
Referring to
It has been discovered that the calculation of KLD between two HMMs can be treated in a manner similar to a generalized string matching process, where state and HMM are counterparts of character and string, respectively. Although various algorithms as is known can be used as is done in string matching, in one embodiment, the basic DP algorithm (Seller, P., “The Theory and Computation of Evolutionary Distances: Pattern Recognition”, Journal of Algorithms. 1: 359-373, 1980) based on edit distance (Levenshtein, V., “Binary Codes Capable of Correcting Spurious Insertions and Deletions of Ones”, Problems of information Transmission, 1:8-17, 1965) can be used. The algorithm is flexible to adaptation in the present application.
In string matching, three kinds of errors are considered: insertion, deletion and substitution. Edit distances caused by all these operations are identical. In KLD calculation, they should be redefined to measure the divergence reasonably. Based on Equation (5) and the atom operation of state copy, generalized edit distances can be defined as:
Generalized substitution distance: If the ith state in the first HMM and the jth state in the second HMM are compared, the substitution distance should be δS(i,j)=Δi,j.
Generalized insertion distance: During DP, if the ith state in the first HMM is treated as a state insertion, three reasonable choices for its competitor in the 2nd HMM can be considered:
(a) Copy the jth state in the second HMM forward as a competitor, then the insertion distance is
δIF(i,j)=Δi−1,j+Δi,j+φ(ãjj,ai−1,i−1,aii)−Δi,j=Δi,j+φ(ãjj,ai−1,i−1,aii)
(b) Copy the j+1th state in the second HMM backward as a competitor, then the insertion distance is
δIB(i,j)=Δi,j+1+Δi+1,j+1+φ(ãj+1,j+1,aii,ai+1,j+1)−Δi+1,j+1=Δi,j+1+φ(ãj+1,j+1,aii,ai+1,i+1)
(c) Incorporate a “non-skippable” short pause (sp) state in the second HMM as a competitor with the ith states in the first HMM, and the insertion distance is defined as δIS(i,j)=Δi,sp. Here the insertion of the sp state is not penalized because it is treated as a modified pronunciation style to have a brief stop in some legal position. It should be noted that the short pause insertion is not always reasonable, for example, it may not appear at any intra-syllable positions.
Generalized deletion distance: A deletion in the first HMM can be treated as an insertion in the second HMM. So the competitor choices and the corresponding distance are symmetric to those in state insertion:
δDF(i,j)=Δi,j+φ(aii,ãj−1,j−1,ajj),
δDB(i,j)=Δi+1,j+φ(ai+1,i+1,ãjj,ãj+1,j+1),
δDS(i,j)=Δsp,j.
To deal with the case of HMM boundaries, the following are defined:
Δi,j=∞,(i ∉ [1,J−1] or j ∉ [1,J−1]),
φ(a,ãj−1,j−1,ãjj)=∞(j ∉ [2,{tilde over (J)}−1]) and
φ(ã,ai−1,i−1,ajj)=∞(i ∉ [2,J−1])
In view of the foregoing, a general DP algorithm for calculating KLD between two HMMs regardless of whether they are equal in length can be described. This method is illustrated in
If desired, during DP at step 1104, a J×{tilde over (J)} cost matrix C can be used to save information. Each element Ci,j is an array of {Ci,j,OP},OP∈Ω, where Ci,j,OP means the partially best result when the two HMMs reach their ith and jth states respectively, and the current operation is OP.
Saving all or some of the operation related variables may be useful since the current operation depends on the previous one. A “legal” operation matrices listed in table 1 below may be used to direct the DP procedure. The left table is used when sp is incorporated, and the right one is used when it is forbidden.
For all OP∈Ω, elements of cost matrix C can be filled iteratively as follows:
where From(i,j,OP) is the previous position given the current position (i,j) and current operation OP, from
At the end of the dynamic programming, the KLD approximation can be obtained:
In a further embodiment, another J×{tilde over (J)} matrix B can be used as a counterpart of C at step 1106 to save the best previous operations during DP. Based on the matrix, the best state matching path can be extracted by back-tracing from the end position (J−1,{tilde over (J)}−1).
In the state-synchronized algorithm, there is a strong assumption that the two observation sequences jump from one state to the next one synchronously. For two equal-length HMMs, the algorithm is quite effective and efficient. Considering the calculation of A as a basic operation, its computational complexity is O(J). This algorithm lays a fundamental basis for the DP algorithm.
In the DP algorithm, the assumption that the two observation sequences jump from one state to the next one synchronously is relaxed. After penalization, the two expanded state sequences corresponding to the best DP path are equal in length, so the state-synchronized algorithm (
In addition to the examples herein provided, other well known computing systems, environments, and/or configurations may be suitable for use with concepts herein described. Such systems include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The concepts herein described may be embodied in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Those skilled in the art can implement the description and/or figures herein as computer-executable instructions, which can be embodied on any form of computer readable media discussed below.
The concepts herein described may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both locale and remote computer storage media including memory storage devices.
With reference to
Computer 1310 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 1310 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by computer 1300.
The system memory 1330 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 1331 and random access memory (RAM) 1332. A basic input/output system 1333 (BIOS), containing the basic routines that help to transfer information between elements within computer 1310, such as during start-up, is typically stored in ROM 1331. RAM 1332 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 1320. By way of example, and not limitation,
The computer 1310 may also include other removable/non-removable volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
A user may enter commands and information into the computer 1310 through input devices such as a keyboard 1362, a microphone 1363, and a pointing device 1361, such as a mouse, trackball or touch pad. These and other input devices are often connected to the processing unit 1320 through a user input interface 1360 that is coupled to the system bus, but may be connected by other interface and bus structures, such as a parallel port or a universal serial bus (USB). A monitor 1391 or other type of display device is also connected to the system bus 1321 via an interface, such as a video interface 1390.
The computer 1310 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 1380. The remote computer 1380 may be a personal computer, a hand-held device, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 1310. The logical connections depicted in
When used in a LAN networking environment, the computer 1310 is connected to the LAN 1371 through a network interface or adapter 1370. When used in a WAN networking environment, the computer 1310 typically includes a modem 1372 or other means for establishing communications over the WAN 1373, such as the Internet. The modem 1372, which may be internal or external, may be connected to the system bus 1321 via the user-input interface 1360, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 1310, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
It should be noted that the concepts herein described can be carried out on a computer system such as that described with respect to
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not limited to the specific features or acts described above as has been held by the courts. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.