A text-to-speech engine is a software program that generates speech from inputted text. A text-to-speech engine may be useful in applications that use synthesized speech, such as a wireless communication device that reads incoming text messages, a global positioning system (GPS) that provides voice directional guidance, or other portable electronic devices that present information as audio speech.
Many text-to-speech engines use Hidden Markov Model (HMM) based text-to-speech synthesis. A variety of contextual factors may affect the quality of synthesized of human speech. For instance, parameters such as spectrum, pitch and duration may interact with one another during speech synthesis. Thus, important contextual factors for speech synthesis may include, but are not limited to, phone identity, stress, accent, position. In HMM-based speech synthesis, the label of the HMMs may be composed of a combination of these contextual factors. Moreover, conventional HMM-based speech synthesis also uses a universal Maximum Likelihood (ML) criterion during both training and synthesis. The ML criterion is capable of estimating statistical parameters of the HMMs. The ML criterion may also impose a static-dynamic parameter constraint during speech synthesis, which may help to generate a smooth parametric trajectory that yields highly intelligible speech.
However, speech synthesized using conventional HMM-based approaches may be overly smooth, as ML parameter estimation after decision tree-based tying usually leads to highly averaged HMM parameters. Thus, speech synthesized using the conventional HMM-based approaches may become blurred and muffled. In other words, the quality of the synthesized speech may be degraded.
Described herein are techniques and systems for using rich context modeling to generate Hidden Markov Model (HMM)-based synthesized speech from text. The use of rich context modeling, as described herein, may enable the generation of synthesized speech that is of higher quality (i.e., less blurred and muffled) than speech that is synthesized using conventional HMM-based speech synthesis.
The rich context modeling described herein initially uses a special training procedure to estimate rich context model parameters. Subsequently, speech may be synthesized based on the estimated rich context model parameters. The spectral envelopes of the speech synthesized based on the rich context models may have crisper formant structures and richer details than those obtained from conventional HMM-based speech synthesis.
In at least one embodiment, a text-to-speech engine refines a plurality of rich context models based on decision tree-tied Hidden Markov Models (HMMs) to produce a plurality of refined rich context models. The text-to-speech engine then generates synthesized speech for an input text based at least on some of the plurality of refined rich context models.
This Summary is provided to introduce a selection of concepts in a simplified form that is further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference number in different figures indicates similar or identical items.
The embodiments described herein pertain to the use of rich context modeling to generate Hidden Markov Model (HMM)-based synthesized speech from input text. Many contextual factors may affect HMM-based synthesis of human speech from input text. Some of these contextual factors may include, but are not limited to, phone identity, stress, accent, position. In HMM-based speech synthesis, the label of the HMMs may be composed of a combination of context factors. “Rich context models”, as used herein, refer to these HMMs as they exist prior to decision-tree based tying. Decision tree-based tying is an operation that is implemented in conventional HMM-based speech synthesis. Each of the rich context models may carry rich segmental and suprasegmental information.
The implementation of text-to-speech engines that uses rich context models in HMM-based synthesis may generate speech with crisper formant structures and richer details than those obtained from conventional HMM-based speech synthesis. Accordingly, the use of rich context models in HMM-based speech synthesis may provide synthesized speech that is more natural sounding. As a result, user satisfaction with embedded systems, server system, and other computing systems that present information via synthesized speech may be increased at a minimal cost. Various example use of rich context models in HMM-based speech synthesis in accordance with the embodiments are described below with reference to
The text-to-speech engine 102 may be implemented on an electronic device 104. The electronic device 104 may be a portable electronic device that includes one or more processors that provide processing capabilities and a memory that provides data storage/retrieval capabilities. In various embodiments, the electronic device 104 may be an embedded system, such as a smart phone, a personal digital assistant (PDA), a digital camera, a global position system (GPS) tracking unit, or the like. However, in other embodiments, the electronic device 104 may be a general purpose computer, such as a desktop computer, a laptop computer, a server, or the like. Further, the electronic device 104 may have network capabilities. For example, the electronic device 104 may exchange data with other electronic devices (e.g., laptops computers, servers, etc.) via one or more networks, such as the Internet.
The text-to-speech engine 102 may ultimately convert the input text 106 into synthesized speech 108. The input text 106 may be inputted into the text-to-speech engine 102 as electronic data (e.g., ACSCII data). In turn, the text-to-speech engine 102 may output synthesized speech 108 in the form of an audio signal. In various embodiments, the audio signal may be electronically stored in the electronic device 104 for subsequent retrieval and/or playback. The outputted synthesized speech 108 (i.e., audio signal) may be further transformed by electronic device 104 into an acoustic form via one or more speakers.
During the conversion of input text 106 into synthesized speech 108, the text-to-speech engine 102 may generate rich context models 110 from the input text 106. The text-to-speech engine 102 may further refine the rich context models 110 into refined rich context models 112 based on decision tree-tied Hidden Markov Models (HMMs) 114. In various embodiments, the decision tree-tied HMMs 114 may also be generated by the text-to-speech engine 102 from the input text 106.
Subsequently, the text-to-speech engine 102 may derive a guiding sequence 116 of HMM models from the decision tree-tied HMMs 114 for the input text 106. The text-to-speech engine 102 may also generate a plurality of candidate sequences of rich context models 118 for the input text 106. The text-to-speech engine 102 may then compare the plurality of candidate sequences 118 to the guiding sequence of HMM models 116. The comparison may enable the text-to-speech engine 102 to obtain an optimal sequence of rich context models 120 from the plurality of candidate sequences 118. The text-to-speech engine 102 may then produce synthesized speech 108 from the optimal sequence 120.
The selected components may be implemented on an electronic device 104 (
The memory 204 may include volatile and/or nonvolatile memory, removable and/or non-removable media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules or other data. Such memory may include, but is not limited to, random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology; CD-ROM, digital versatile disks (DVD) or other optical storage; magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices; and RAID storage systems, or any other medium which can be used to store the desired information and is accessible by a computer system. Further, the components may be in the form of routines, programs, objects, and data structures that cause the performance of particular tasks or implement particular abstract data types.
The memory 204 may store components of the text-to-speech engine 102. The components, or modules, may include routines, programs instructions, objects, and/or data structures that perform particular tasks or implement particular abstract data types. The components may include a training module 206, a pre-selection module 208, a HMM sequence module 210, a least divergence module 212, a unit pruning module 214, a cross correlation search module 216, a waveform concatenation module 218, and a synthesis module 220. The components may further include a user interface module 222, an application module 224, an input/output module 226, and a data storage module 228.
The training module 206 may train a set of rich context models 110, and in turn, a set of decision tree-tied HMMs 114, to model speech data. For example, the set of HMMs 114 may be trained via, e.g., a broadcast news style North American English speech sample corpus for the generation of American-accented English speech. In other examples, the set of HMMs 114 may be similarly trained to generate speech in other languages (e.g., Chinese, Japanese, French, etc.). In various embodiments, the training module 206 may initially derive the set of rich context models 110. In at least one embodiment, the rich context models may be initialized by cloning mono-phone models.
The training module 106 may estimate the variance parameters for the set of the rich context models 110. Subsequently, the training module 206 may derive the decision tree-tied HMMs 114 from the set of rich context models 110. In at least one embodiment, a universal Maximum Likelihood (ML) criterion may be used to estimate statistical parameters of the set of decision tree-tied HMMs 114.
The training module 206 may further refine the set of rich context models 110 based on the decision tree-tied HMMs 114 to generate a set of refined rich context models 112. In various embodiments of the refinement, the training module 206 may designate the set of decision-tree tied HMMs 114 as a reference. Based on the reference, the training module 206 may perform a single pass re-estimation to estimate the mean parameters for the set of rich context models 110. This re-estimation may rely on the set of decision tree-tied HMMs 114 to obtain the state-level alignment of the speech corpus. The mean parameters of the set of rich context models 110 may be estimated according to the alignment.
Subsequently, the training module 206 may tie the variance parameters of the set of rich context models 110 using a conventional tree structure to generate the set of refined context rich models 112. In other words, the variance parameters of the set of rich context models 110 may be set to be equal to the variance parameters of the set of decision tree-tied HMMS 114. In this way, the data alignment of the rich context models during training may be insured by the set of the decision tree-tied HMMs 114. As further described below, the refined rich context models 112 may be stored in a data storage module 228.
The pre-selection module 208 may compose a rich context model candidate sausage. The composition of a rich context model candidate sausage may be the first step in the selection and assembly of a sequence of rich context models that represents the input text 106 from the set of refined context models 112.
In some embodiments, the pre-selection module 208 may initially extract the tri-phone-level context of each target rich context label of the input text 106 to form a pattern. Subsequently, the pre-selection module 208 may chose one or more refined rich context models 112 that match this tri-phone pattern to form a sausage node of the rich candidate sausage. The pre-selection module 208 may further connect successive sausage nodes to compose a sausage node. The use of tri-phone-level, context based pre-selection by the pre-selection module 208 may maintain the size of sequence selection search space at a reasonable size. In other words, the tri-phone-level pre-selection may maintain a good balance between sequence candidate coverage and sequence selection search space size.
However, in alternative embodiments in which the pre-selection module 208 is unable to obtain a tri-phone pattern, the pre-selection module 208 may extract bi-phone level context of each target rich context label of the input text 106 to form a pattern. Subsequently, the pre-selection module 208 may chose one or more refined rich context models 112 that match this bi-phone pattern to form a sausage node.
The pre-selection module 208 may connect successive sausage nodes to compose a rich context model candidate sausage, as shown in
Returning to
The least divergence module 212 may determine the optimal sequence 120 from a rich context model candidate sausage, such as the candidate sausage 302 of the input text 106. The optimal sequence 120 may be further used to generate a speech trajectory that is eventually converted into synthesized speech.
In various embodiments, the optimal sequence 120 may be a sequence of rich context models that exhibits a global trend that is “closest” to the guiding sequence 116. It will be appreciated that the guiding sequence 116 may provide an over-smoothed but stable trajectory. Therefore, by using this stable trajectory as a guide, the least divergence module 212 may select a sequence of rich context models, or optimal sequence 120, that has the smoothness of the guiding sequence 116 and the improved local speech fidelity provided by the refined rich context models 112.
The least divergence module 212 may search for the “closest” rich context model sequence by measuring the distance between the guiding sequence 116 and a plurality of rich context model candidate sequences 118 that are encompassed in the candidate sausage 302. In at least one embodiment, the least divergence module 212 may adopt an upper-bound of a state-aligned Kullback-Leibler divergence (KLD) approximation as the distance measure, in which spectrum, pitch, and duration information are considered simultaneously.
Thus, given P={p1, p2, . . . pN} as the decision tree-tied guiding sequence 116, the least divergence module 212 may determine the state-level duration of the guiding sequence 116 using the conventional duration model, which may be denoted as T={t1, t2, . . . tN}. Further, for each of rich context model candidate sequences 118, the least divergence module 212 may set the corresponding state sequence to be aligned to the guiding sequence 116 in a one-to-one mapping. It will be appreciated that due to the particular structure of the candidate sausage 302, the guiding sequence 116 and each of the candidate sequences 118 may have the same number of states. Therefore, any of the candidate sequences 118 may be denoted as Q={q1, q2, . . . qN}, and share the same duration with the guiding sequence 116.
Accordingly, the least divergence module 212 may use the following approximated criterion to measure the distance between the guiding sequence 116 and each of the candidate sequences 118 (in which S represents spectrum, and f0 represents pitch):
D(P,Q)=ΣnDKL(pn,qn)·tn (1)
and in which DKL(p,q)=DKLS(p,q)+DKLf0(p,q) is the sum of the upper-bound KLD for the spectrum and pitch parameters between two multi-space probability distribution (MSD)-HMM states:
in which w0, and w1 may represent prior probabilities of the discrete and continuous sub-space (for DKLS(p,q), w0≡0 and w1≡1), and μ and Σ may be mean and variance parameters, respectively.
By using equations (1) and (2), spectrum, pitch and duration may be embedded in a single distance measure. Accordingly, the least divergence module 212 may select an optimal sequence of rich context models 120 from the rich context model candidate sausage 302 by minimizing the total distance D(P,Q). In various embodiments, the least divergence module 212 may select the optimal sequence 120 by choosing the best rich context candidate models for every node of the candidate sausage 302 to form the optimal global solution.
The unit pruning module 214, in combination with the cross correlation module 216 and the waveform concatenation module 218, may also determine the optimal sequence 120 from a rich context model candidate sausage, such as the candidate sausage 302 of the input text 106. Thus, in some embodiments, the combination of the unit pruning module 214, the cross correlation module 216, and the wave concatenation module 218, may be implemented as an alternative to the least divergence module 212.
The unit pruning module 214 may prune sequences of candidate sequences of rich context models 118 encompassed in the candidate sausage 302 that are farther than a predetermined distance from the guiding sequence 116. In other words, the unit pruning module 214 may select for one or more candidate sequences 118 with less than a predetermined amount of distortion from the guiding sequence 116.
During operation, the unit pruning module 214 may first consider the spectrum and pitch information to perform pruning within each sausage node of the candidate sausage 302. For example, given sausage node i, and that the guiding sequence 116 is denoted by Pi={pi(1), pi(2), . . . pi(S)}, the corresponding state duration of node i may be represented by Ti={ti(1), ti(2), . . . ti(S)}. Further, for all Ni rich context model candidates Qi1≦j≦N
Thus, the unit pruning module 214 may use the following approximated criterion to measure the distance between the guiding sequence 116 and each of the candidate sequences 118:
D(Pi,Qij)=ΣsDKL(pi(s),qij(s))·ti(s) (3)
in which DKL(p,q)=DKLS(p,q)+DKLf0(p,q) is the sum of the upper-bound KLD for the spectrum and pitch parameters between two multi-space probability distribution (MSD)-HMM states:
and in which w0, and w1 may be prior probabilities of the discrete and continuous sub-space (for DKLS(p,q), w0≡0 and w1≡1), and μ and Σ may be mean and variance parameters, respectively.
Moreover, by using equations (3) and (4), as well as a beam width of β, the unit pruning module 214 may prune those candidate sequences 118 for which:
D(Pi,Qij)>min1≦j≦N
Accordingly, for each sausage node, only the one or more candidate sequences 118 with distortions that are below a predetermined threshold from the guiding sequence 116 may survive pruning. In various embodiments, the distortion may be calculated based not only on the static parameters of the models, but also their delta and delta-delta parameters.
The unit pruning module 214 may also consider duration information to perform pruning within each sausage node of the candidate sausage 302. In other words, the unit pruning module 214 may further prune candidate sequences 118 with durations that do not fall within a predetermined duration interval. In at least one embodiment, for a sausage node i, the target phone-level mean and variance given by a conventional HMM-based duration model may be represented by μi and σi2, respectively. In such an embodiment, the unit pruning module 214 may prune those candidate sequences 118 for which:
|dij−μi|>γσi (6)
in which dij is the duration of the jth candidate sequence, and γ is a ratio controlling the pruning threshold.
In some embodiments, the unit pruning module 214 may perform the calculations in equations (3) and (4) in advance, such as during an off-line training phase, rather than during an actual run-time of the speech synthesis. Accordingly, the unit pruning module 214 may generate a KLD target cost table 230 during the advance calculation that stores the target cost data. The target cost table 230 may be further used during a search for an optimal rich context unit path.
The cross correlation module 216 may search for an optimal rich context unit path through rich context models of the one or more candidate sequences 118 in the candidate sausage 302 that have survived pruning. In this way, the cross correlation module 216 may derive the optimal rich context model sequence 120. The optimal rich model sequence 120 may be the smoothest rich context model sequence. In various embodiments, the cross correlation module 216 may implement the search as a search for a path with minimal concatenation cost. Accordingly, the optimal sequence 120 may be a minimal concatenation cost sequence.
The waveform concatenation module 218 may concatenate waveform units along a path of the derived optimal rich context model sequence 120 to form an optimized waveform sequence. The optimized waveform sequence may be further converted into synthesized speech. In various embodiments, the waveform concatenation module 218 may use a normalized cross correlation as the measure of concatenation smoothness. Given two time series x(t), y(t), and an offset of d, the cross correlation module 216 may calculate the normalized cross correlation r(d) as follows:
in which μx, and μy are the mean of x(t) and y(t) within the calculating window, respectively. Thus, at each concatenation point in the sausage 302, and for each waveform pair, the waveform concatenation module 216 may first calculate the best offset d that yields the maximal possible r(d), as illustrated in
Returning to
Following the selection of the optimal sequence of the rich context models 120 or a waveform sequence that is derived from the optimal sequence 120, the text-to-speech engine 102 may further use the synthesis module 220 to process the optimal sequence 120 or the waveform sequence into synthesized speech 108.
The synthesis module 220 may process the optimal sequence 120, or the waveform sequence that is derived from the optimal sequence 120, into synthesized speech 108. In various embodiments, the synthesis module 220 may use the predicted speech data from the input text 106, such as the speech patterns, line spectral pair (LSP) coefficients, fundamental frequency, gain, and/or the like, in combination with the optimal sequence 120 or the waveform sequence to generate the synthesized speech 108.
The user interface module 222 may interact with a user via a user interface (not shown). The user interface may include a data output device (e.g., visual display, audio speakers), and one or more data input devices. The data input devices may include, but are not limited to, combinations of one or more of keypads, keyboards, mouse devices, touch screens, microphones, speech recognition packages, and any other suitable devices or other electronic/software selection methods. The user interface module 222 may enable a user to input or select the input text 106 for conversion into synthesized speech 108.
The application module 224 may include one or more applications that utilize the text-to-speech engine 102. For example, but not as a limitation, the one or more applications may include a global positioning system (GPS) navigation application, a dictionary application, a text messaging application, a word processing application, and the like. Accordingly, in various embodiments, the text-to-speech engine 102 may include one or more interfaces, such as one or more application program interfaces (APIs), which enable the application module 224 to provide input text 106 to the text-to-speech engine 102.
The input/output module 226 may enable the text-to-speech engine 102 to receive input text 106 from another device. For example, the text-to-speech engine 102 may receive input text 106 from at least one of another electronic device, (e.g., a server) via one or more networks. Moreover, the input/output module 226 may also provide the synthesized speech 108 to the audio speakers for acoustic output, or to the data storage module 228.
As described above, the data storage module 228 may store the refined rich context models 112. The data storage module 228 may further store the input text 106, as well as rich context models 110, decision tree-tied HMMs 114, the guiding sequence of HMM models 116, the plurality of candidate sequences of rich context models 118, the optimal sequence 120, and the synthesized speech 108. However, in embodiments in which the target cost table 230 and the concatenation cost able 232 are generated, the data storage module may store tables 230-232 instead of the rich context models 110 and the decision tree-tied HMMs 114. The one or more input texts 106 may be in various forms, such as documents in various formats, downloaded web pages, and the like. The data storage module 228 may also store any additional data used by the text-to-speech engine 102, such as various additional intermediate data produced during the production of the synthesized speech 108 from the input text 106, e.g., waveform sequences.
At block 502, the training module 206 of the text-to-speech engine 102 may derive rich context models 110 and trained decision tree-tied HMMs 114 based on a speech corpus. The speech corpus may be a corpus of one of a variety of languages, such as English, French, Chinese, Japanese, etc.
At block 504, the training module 206 may further estimate the mean parameters of the rich context models 110 based on the trained decision tree-tied HMMs 114. In at least one embodiment, the training module 206 may perform the estimation of the mean parameters via a single pass re-estimation. The single pass re-estimation may use the trained decision tree-tied HMMs 114 to obtain the state level alignment of the speech corpus. The mean parameters of the rich context models 110 may be estimated according this alignment.
At block 506, based on the estimated mean parameters, the training module 206 may set the variance parameters of the rich context models 110 equal to that the trained decision tree-tied HMMs 114. Thus, the training module 206 may produce refined rich context models 112 via blocks 502-506.
At block 508, the text-to-speech engine 102 may generate synthesized speech 108 for an input text 106 using at least some of the refined rich context models 112.
At block 510, the text-to-speech engine 102 may output the synthesized speech 108. In various embodiments, the electronic device 104 on which the text-to-speech engine 102 resides may use speakers to transmit the synthesized speech 108 as acoustic energy to be heard by a user. The electronic device 104 may also store the synthesized speech 108 as data in the data storage module 228 for subsequent retrieval and/or output.
At block 602, the pre-selection module 208 of the text-to-speech engine 102 may perform a pre-selection of the refined rich context models 112. The pre-selection may compose a rich context model candidate sausage 302.
At block 604, the HMM sequence module 210 may obtain a guiding sequence 116 from the decision tree-tied HMMs 114 that corresponds to the input text 106. In various embodiments, the HMM sequence module may obtain the guiding sequence of decision tree-tied HMMs 116 from the set of decision tree-tied HMMs 114 using conventional techniques.
At block 606, the least divergence module 212 may obtain the optimal sequence 120 from a rich context model candidate sausage, such as the candidate sausage 302 of the input text 106. The candidate sausage 302 may encompass the plurality of rich context model candidate sequences 118. In various embodiments, the least divergence module 212 may select the optimal sequence 120 by finding a rich context model sequence with the “shortest” measured distance from the guiding sequence 116 that is included in the plurality of rich context model candidate sequences 118.
At block 608, the synthesis module 220 may generate and output synthesized speech 108 based on the selected optimal sequence 120 of rich context models.
At block 702, the pre-selection module 208 of the text-to-speech engine 102 may perform a pre-selection of the refined rich context models 112. The pre-selection may compose a rich context model candidate sausage 302.
At block 704, the HMM sequence module 210 may obtain a guiding sequence 116 from the decision tree-tied HMMs 114 that corresponds to the input text 106. In various embodiments, the HMM sequence module may obtain the guiding sequence of decision tree-tied HMMs 116 from the set of decision tree-tied HMMs 114 using conventional techniques.
At block 706, the unit pruning module 214 may prune sequences of rich context model candidate sequences 118 of rich context models encompassed in the candidate sausage 302 that are farther than a predetermined distance from the guiding sequence 116. In other words, the unit pruning module 214 may select one or more candidate sequences 118 that are within a predetermined distance from the guiding sequence 116. In various embodiments, the unit pruning module 214 may perform the pruning based on spectrum, pitch, and duration information of the candidate sequences 118. In at least one of such embodiments, the unit pruning module 214 may generate the target cost table 230 in advance of the actual speech synthesis. The target cost table 230 may facilitates the pruning of the sequences of rich context model candidate sequences 118.
At block 708, the cross correlation search module 216 may conduct a cross correlation-based search to derive the optimal rich context model sequence 120 encompassed in the candidate sausage 302 from the one or more candidate sequences 118 that survived the pruning. In various embodiments, the cross correlation module 216 may implement the search for the optimal sequence 120 as a search for a minimal concatenation cost path through the rich context models of the one or more surviving candidate sequences 118. Accordingly, the optimal sequence 120 may be a minimal concatenation cost sequence. In some embodiments, the waveform concatenation module 218 may calculate the normalized cross-correlation in advance of the actual speech synthesis to build a concatenation cost table 232. The concatenation cost table 232 may be used to facilitate the selection of the optimal rich context model sequence 120.
At block 710, the waveform concatenation module 218 may concatenate waveform unit along a path of the derived optimal sequence 120 to form an optimized wave sequence. The synthesis module 220 may further convert the optimized wave sequence into synthesized speech.
In at least one configuration, computing device 800 typically includes at least one processing unit 802 and system memory 804. Depending on the exact configuration and type of computing device, system memory 804 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination thereof. System memory 804 may include an operating system 806, one or more program modules 808, and may include program data 810. The operating system 806 includes a component-based framework 812 that supports components (including properties and events), objects, inheritance, polymorphism, reflection, and provides an object-oriented component-based application programming interface (API), such as, but by no means limited to, that of the .NET™ Framework manufactured by the Microsoft® Corporation, Redmond, Wash. The computing device 800 is of a very basic configuration demarcated by a dashed line 814. Again, a terminal may have fewer components but may interact with a computing device that may have such a basic configuration.
Computing device 800 may have additional features or functionality. For example, computing device 800 may also include additional data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Such additional storage is illustrated in
Computing device 800 may also contain communication connections 824 that allow the device to communicate with other computing devices 826, such as over a network. These networks may include wired networks as well as wireless networks. Communication connections 824 are some examples of communication media. Communication media may typically be embodied by computer readable instructions, data structures, program modules, etc.
It is appreciated that the illustrated computing device 800 is only one example of a suitable device and is not intended to suggest any limitation as to the scope of use or functionality of the various embodiments described. Other well-known computing devices, systems, environments and/or configurations that may be suitable for use with the embodiments include, but are not limited to personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-base systems, set top boxes, game consoles, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and/or the like.
The implementation of text-to-speech engines that uses rich context models in HMM-based synthesis may generate speech with crisper formant structures and richer details than those obtained from conventional HMM-based speech synthesis. Accordingly, the use of rich context models in HMM-based speech synthesis may provide synthesized speech that is more natural sounding. As a result, user satisfaction with embedded systems that present information via synthesized speech may be increased at a minimal cost.
In closing, although the various embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter.
This application claims priority to U.S. Provisional Patent Application No. 61/239,135 to Yan et al., entitled “Rich Context Modeling for Text-to-Speech Engines”, filed on Sep. 2, 2009, and incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5286205 | Inouye et al. | Feb 1994 | A |
5358259 | Best | Oct 1994 | A |
6032116 | Asghar et al. | Feb 2000 | A |
6199040 | Fette et al. | Mar 2001 | B1 |
6453287 | Unno et al. | Sep 2002 | B1 |
6775649 | DeMartin | Aug 2004 | B1 |
7092883 | Gretter et al. | Aug 2006 | B1 |
7496512 | Zhao et al. | Feb 2009 | B2 |
7562010 | Gretter et al. | Jul 2009 | B1 |
7574358 | Deligne et al. | Aug 2009 | B2 |
7603272 | Hakkani-Tur et al. | Oct 2009 | B1 |
8244534 | Qian et al. | Aug 2012 | B2 |
20020029146 | Nir | Mar 2002 | A1 |
20030088416 | Griniasty | May 2003 | A1 |
20030144835 | Zinser, Jr. et al. | Jul 2003 | A1 |
20050057570 | Cosatto et al. | Mar 2005 | A1 |
20070033044 | Yao | Feb 2007 | A1 |
20070212670 | Paech et al. | Sep 2007 | A1 |
20070213987 | Turk et al. | Sep 2007 | A1 |
20070233490 | Yao | Oct 2007 | A1 |
20070276666 | Rosec et al. | Nov 2007 | A1 |
20080059190 | Chu et al. | Mar 2008 | A1 |
20080082333 | Nurminen et al. | Apr 2008 | A1 |
20080195381 | Soong et al. | Aug 2008 | A1 |
20090006096 | Li et al. | Jan 2009 | A1 |
20090048841 | Pollet et al. | Feb 2009 | A1 |
20090055162 | Qian et al. | Feb 2009 | A1 |
20090248416 | Gorin et al. | Oct 2009 | A1 |
20090258333 | Yu | Oct 2009 | A1 |
20090310668 | Sackstein et al. | Dec 2009 | A1 |
20100057467 | Wouters | Mar 2010 | A1 |
20100211376 | Chen et al. | Aug 2010 | A1 |
20120143611 | Qian et al. | Jun 2012 | A1 |
Number | Date | Country | |
---|---|---|---|
20110054903 A1 | Mar 2011 | US |
Number | Date | Country | |
---|---|---|---|
61239135 | Sep 2009 | US |