Field
The disclosure relates to techniques for text-to-speech conversion with emotional content.
Background
Computer speech synthesis is an increasingly common human interface feature found in modern computing devices. In many applications, the emotional impression conveyed by the synthesized speech is important to the overall user experience. The perceived emotional content of speech may be affected by such factors as the rhythm and prosody of the synthesized speech.
Text-to-speech techniques commonly ignore the emotional content of synthesized speech altogether by generating only emotionally “neutral” renditions of a given script. Alternatively, text-to-speech techniques may utilize separate voice models for separate emotion types, leading to the relatively high costs associated with storing separate voice models in memory corresponding to the many emotion types. Such techniques are also inflexible when it comes to generating speech with emotional content for which no voice models are readily available.
Accordingly, it would be desirable to provide novel and efficient techniques for text-to-speech conversion with emotional content.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards techniques for generating speech output having emotional content. In an aspect, a “neutral” representation of a script is prepared using an emotionally neutral model. Emotion-specific adjustments are separately prepared for the script based on a desired emotion type for the speech output, and the emotion-specific adjustments are applied to the neutral representation to generate a transformed representation. In an aspect, the emotion-specific adjustments may be applied on a per-phoneme, per-state, or per-frame basis, and may be stored and categorized (or clustered) by an independent emotion-specific decision tree or other clustering scheme. The clustering schemes for each emotion type may be distinct both from each other and from a clustering scheme used for the neutral model parameters.
Other advantages may become apparent from the following detailed description and drawings.
Various aspects of the technology described herein are generally directed towards a technology for generating speech output with given emotion type. The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary aspects of the invention and is not intended to represent the only exemplary aspects in which the invention can be practiced. The term “exemplary” used throughout this description means “serving as an example, instance, or illustration,” and should not necessarily be construed as preferred or advantageous over other exemplary aspects. The detailed description includes specific details for the purpose of providing a thorough understanding of the exemplary aspects of the invention. It will be apparent to those skilled in the art that the exemplary aspects of the invention may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the novelty of the exemplary aspects presented herein.
In
Based on the processing performed by processor 125, device 120 may generate speech output 126 responsive to speech input 122, using audio speaker 128. Note in alternative processing scenarios, device 120 may also generate speech output 126 independently of speech input 122, e.g., device 120 may autonomously provide alerts or relay messages from other users (not shown) to user 110 in the form of speech output 126.
In
Speech recognition 210 generates a text rendition of spoken words in speech input 122. Techniques for speech recognition may utilize, e.g., Hidden Markov Models (HMM's) having statistical parameters trained from text databases.
Language understanding 220 is performed on the output of speech recognition 210. In an exemplary embodiment, functions such as parsing and grammatical analysis may be performed to derive the intended meaning of the speech according to natural language understanding techniques.
Emotion response decision 230 generates a suitable emotional response to the user's speech input as determined by language understanding 220. For example, if it is determined that the user's speech input calls for a “happy” emotional response by dialog system 200, then output emotion decision 230 may specify an emotion type 230a corresponding to “happy.”
Output script generation 240 generates a suitable output script 240a in response to the user's speech input 220a as determined by language understanding 220, and also based on the emotion type 230a determined by emotion response decision 230. Output script generation 240 presents the generated response script 240a in a natural language format, e.g., obeying lexical and grammatical rules, for ready comprehension by the user. Output script 240a of script generation 240 may be in the form of, e.g., sentences in a target language conveying an appropriate response to the user in a natural language format.
Text-to-speech (TTS) conversion 250 synthesizes speech output 126 having textual content as determined by output script 240a, and emotional content as determined by emotion type 230a. Speech output 126 of text-to-speech conversion 250 may be an audio waveform, and may be provided to a listener, e.g., user 110 in
As mentioned hereinabove, it is desirable in certain applications for speech output 126 to be generated not only as an emotionally neutral rendition of text, but further for speech output 126 to convey specific emotional content to user 110. Techniques for generating artificial speech with emotional content rely on text recordings of speakers delivering speech with the pre-specified emotion type, or otherwise require full speech models to be trained for each emotion type, leading to prohibitive storage requirements for the models and also limited range of emotional output expression. Accordingly, it would be desirable to provide efficient and effective techniques for text-to-speech conversion with emotional content.
In
At block 320, contextual features are further extracted from script 240a to modify phoneme sequence 310a and generate linguistic-contextual feature sequence 320a as (p1, . . . , pt, . . . , pT), wherein pt represents a feature in sequence from t=1 to T. For example, adjustments to the phoneme sequence 310a may be made at block 320 to account for speech variations due to phonetic and linguistic contextual features of the script, thereby generating linguistic-contextual feature sequence 320a. Note the sequence 320a may be based on both the identity of each phoneme as well as other contextual information such as the part of speech of the word each phoneme belongs to, the number of syllables of the previous word the current phoneme belongs to, etc. Accordingly, each element of the sequence 320a may generally be referred to herein as a “linguistic-contextual” phoneme.
Sequence 320a is provided to block 330, wherein the acoustic trajectory 330a of sequence 320a is predicted. In particular, the acoustic trajectory 330a specifies a set of acoustic parameters for sequence 320a including duration (Dur), fundamental frequency or pitch (F0), and spectrum (Spectrum, or spectral coefficients). In an exemplary embodiment, Dur(pt) may be specified for each feature in sequence 320a, while F0(f) and Spectrum(f) may be specified for each frame f of Ft frames for feature pt. In an exemplary embodiment, a duration model predicts how many frames each state of a phoneme may last. Sequences of acoustic parameters in acoustic trajectory 330a are subsequently provided to vocoder 350, which may synthesize a speech waveform corresponding to speech output 126.
As shown in
One such technique includes Hidden Markov Model (HMM)-based speech synthesis, in which speech output is modeled as a plurality of states characterized by statistical parameters such as initial state probabilities, state transition probabilities, and state output probabilities. The statistical parameters of an HMM-based implementation of neutral voice model 332 may be derived from training the HMM to model speech samples found in one or more speech databases having known speech content. The statistical parameters may be stored in a memory (not shown in
In an exemplary embodiment, emotion-specific model 334 generates emotion-specific adjustments 334a that are applied to parameters obtained from neutral voice model 332 to adapt the synthesized speech to have characteristics of given emotion type 230a. In particular, emotion-specific adjustments 334a may be derived from training models based on speech samples having pre-specified emotion type found in one or more speech databases having known speech content and emotion type. In an exemplary embodiment, emotion-specific adjustments 334a are provided as adjustments to the output parameters 332a of neutral voice model 332, rather than as emotion-specific statistical or acoustic parameters independently sufficient to produce an acoustic trajectory for each emotion type. As such adjustments will generally require less memory to store than independently sufficient emotion-specific parameters, memory resources can be conserved when generating speech with pre-specified emotion type according to the present disclosure. In an exemplary embodiment, emotion-specific adjustments 334a can be trained and stored separately for each emotion type designated by the system.
In an exemplary embodiment, emotion-specific adjustments 334a can be stored and applied to neutral voice model 332 on, e.g., a per-phoneme, per-state, or per-frame basis. For example, in an exemplary embodiment, for a phoneme HMM having three states, three emotion-specific adjustments 334a can be stored and applied for each phoneme on a per-state basis. Alternatively, if each state of the three-state phoneme corresponds to two frames, e.g., each frame having duration of 10 milliseconds, then six emotion-specific adjustments 334a can be stored and applied for each phoneme of a per-frame basis. Note an acoustic or model parameter may generally be adjusted distinctly for each individual phoneme based on the emotion type, depending on the emotion-specific adjustments 334a specified by emotion-specific model 334.
In
Emotion-specific model 334.1 generates duration adjustment parameters Dur_adje(p1), . . . , Dur_adje(pT) or 334.1a specific to the emotion type 230a and sequence 320a. Duration adjustments block 410 applies the duration adjustment parameters 334.1a to neutral durations 405a to generate the adjusted duration sequence Dur(p1), . . . , Dur(pT) or 410a.
Based on adjusted duration sequence 410a, neutral trajectories 420a for F0 and Spectrum is predicted at block 420. In particular, neutral acoustic trajectory 420a includes predictions for acoustic parameters F0n(f) and Spectrumn(f) based on F0 and spectrum parameters 332.1b of neutral voice model 332.1, as well as adjusted duration parameters Dur(p1), . . . , Dur(pT) derived earlier from 410a.
At block 430, emotion-specific F0 and spectrum adjustments 334.1b are applied to the corresponding neutral F0 and spectrum parameters of 420a. In particular, F0 and spectrum adjustments F0_adje(1), . . . , F0_adje(FT), Spectrum_adj(1), . . . , Spectrum_adj(FT) 334.1b are generated by emotion-specific model 334.1 based on sequence 320a and emotion type 230a. The output 330.1a of block 430 includes emotion-specific adjusted Duration, F0, and Spectrum parameters.
In an exemplary embodiment, the adjustments applied at blocks 410 and 430 may correspond to the following:
Dur(pt)=Durn(pt)+Dur_adje(pt); (Equation 1)
F0(f)=F0n(f)+F0_adje(f); (Equation 2) and
Spectrum(f)=Spectrumn(f)+Spectrum_adje(f); (Equation 3)
wherein, e.g., Equation 1 may be applied by block 410, and Equations 2 and 3 may be applied by block 430. The resulting acoustic parameters 330.1a, including Dur(pt), F0(f), and Spectrum(f), may be provided to a vocoder for speech synthesis.
It is noted that in the exemplary embodiment described by Equations 1-3, the emotion-specific adjustments are applied as additive adjustment factors to be combined with the neutral acoustic parameters during speech synthesis. It will be appreciated that in alternative exemplary embodiments, emotion-specific adjustments may readily be stored and/or applied in alternative manners, e.g., multiplicatively, using affine transformation, non-linearly, etc. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
It is further noted that while duration adjustments are shown as being applied on a per-phoneme basis in Equation 1, and F0 and Spectrum adjustments are shown as being applied on a per-frame basis in Equations 2 and 3, it will be appreciated that alternative exemplary embodiments can adjust any acoustic parameters on any per-state, per-phoneme, or per-frame bases. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
In
Sequence 320a is further specified as input to emotion-specific model 334.2, along with emotion type 230a. The output 334.2a of emotion-specific model 334.2 specifies emotion-specific model adjustment factors. In an exemplary embodiment, the adjustment factors 334.2a include model adjustment factors αe(p1,s1), . . . , αe(pT,sM), βe(p1,s1), . . . , βe(pT,sM), γe(p1,s1), . . . , γe(pT,sM) specified on a per-state basis, as well as emotion-specific duration adjustment factors ae(p1), . . . , ae(pT), be(p1), . . . , be(pT), on a per-phoneme basis.
Block 520 applies emotion-specific model adjustment factors 334.2a specified by block 334.2 to corresponding parameters of the neutral HMM λn to generate an output 520a. In an exemplary embodiment, the adjustments may be applied as follows:
μ(pt,sm)=αe(pt,sm)μn(pt,sm)+βe(pt,sm); (Equation 4)
Σ(pt,sm)=γe(pt,sm)Σn(pt,sm); (Equation 5) and
Dur(pt)=ae(pt)Durn(pt)+be(pt); (Equation 6)
wherein μ(pt,sm), μn(pt,sm), and βe(pt,sm) are vectors, αe(pt,sm) is a matrix, and αe(pt,sm) μn(pt,sm) represents left-multiplication of μn(pt,sm) by αe(pt,sm), while Σ(pt,sm), γe(pt,sm), and Σn(pt,sm) are all matrices, and γe(pt,sm) Σn(pt,sm) represents left-multiplication of Σn(pt,sm) by γe(pt,sm). It will be appreciated that the adjustments of Equations 4 and 6 effectively apply affine transformations (i.e., a linear transformation along with addition by a constant) to the neutral mean vector μn(pt,sm) and duration Durn(pt) to generate new model parameters μ(pt,sm) and Dur(pt). In this Specification and in the claims, μ(pt,sm), Σ(pt,sm), and Dur(pt) are generally denoted the “transformed” model parameters. Note alternative exemplary embodiments need not apply affine transformations to generate the transformed model parameters, and other transformations such as non-linear transformations may also be employed. Such alternative exemplary embodiments are contemplated to be within the scope of the present disclosure.
Based on the transformed model parameters, the acoustic trajectory (e.g., F0 and spectrum) may subsequently be predicted at block 530, and predicted acoustic trajectory 330.2a is output to the vocoder to generate the speech waveform. Based on choice of the emotion-specific adjustment factors, it will be appreciated that acoustic parameters 330.2a are effectively adapted to generate speech having emotion-specific characteristics.
In an exemplary embodiment, clustering techniques may be used to reduce the memory resources required to store emotion-specific state model or acoustic parameters, as well as enable estimation of model parameters for states wherein training data is unavailable or sparse. In an exemplary embodiment employing decision tree clustering, a decision tree may be independently built for each emotion type to cluster emotion-specific adjustments. It will be appreciated that providing independent emotion-specific decision trees in this manner may more accurately model the specific prosody characteristics associated with a target emotion type, as the questions used to cluster emotion-specific states may be specifically chosen and optimized for each emotion type. In an exemplary embodiment, the structure of an emotion-specific decision tree may be different from the structure of a decision tree used to store neutral model or acoustic parameters.
In
On the other hand, emotion-specific decision tree 620 categorizes state s into one of a plurality of emotion-specific leaf nodes E1, E2, E3, etc., based on a plurality of emotion-specific questions q1_e, q2_e , etc., applied to state s and its context. Associated with each leaf node of emotion-specific decision tree 610 may be corresponding emotion-specific adjustment factors, e.g., αe(p,s), βe(p,s), γe(p,s), and/or other factors to be applied to as emotion-specific adjustments, e.g., as specified in Equations 1-6. Note the structure of the emotion-specific leaf nodes and the choice of emotion-specific questions for emotion-specific decision tree 620 may generally be entirely different from the structure of the neutral leaf nodes and choice of neutral questions for neutral decision tree 610, i.e., the neutral and emotions-specific decision trees may be “distinct.” The difference in structure of the decision trees allows, e.g., each emotion-specific decision tree to be optimally constructed for a given emotion type to more accurately capture the emotion-specific adjustment factors.
In an exemplary embodiment, each transform decision tree may be constructed based on various criteria for selecting questions, e.g., a series of questions may be chosen to maximize a model auxiliary function such as the weighted sum of log-likelihood functions for the leaf nodes, wherein the weights applied may be based on state occupation probabilities of the corresponding states. Per iterative algorithms known for constructing decision trees, the choosing of questions may proceed and terminate based on a metric such as specified by minimum description length (MDL) or other cross-validation methods.
In
The output of the selected one of the emotion-specific decision trees 730.1 through 730.N is provided as 730a, which includes emotion-specific adjustment factors for the given emotion type 230a.
Adjustment block 740 applies the adjustment factors 730a to the neutral model parameters 710a, e.g., as earlier described hereinabove with reference to Equations 4 and 5, to generate the transformed model or acoustic parameters.
In
Training audio 802 corresponding to training script 801 is further provided to block 830. Training audio 802 corresponds to a rendition of the text in training script 801 with a pre-specified emotion type 802a. Training audio 802 may be generated, e.g., by pre-recording a human speaker instructed to read the training script 801 with the given emotion type 802a. From training audio 802, acoustic features 830a are extracted at block 830. Examples of acoustic features 830a may include, e.g., duration, F0, spectral coefficients, etc.
The extracted acoustic features 830a are provided (e.g., as observation vectors) to block 840, which generates a set of parameters for a speech model, also denoted herein as the “initial emotion model,” corresponding to training audio 802 with pre-specified emotion type 802a. Note block 840 performs analysis on the extracted acoustic features 830a to derive the initial emotion model parameters, since block 840 may not directly be provided with the training script 801 corresponding to training audio 802. It will be appreciated that deriving an optimal set of model parameters, e.g., HMM output probabilities and state transition probabilities, etc., for training audio 802 may be performed using, e.g., an iterative procedure such as the expectation-maximization (EM) algorithm (Baum-Welch algorithm) or a maximum likelihood (ML) algorithm. To aid in convergence, the parameter set used to initialize the iterative algorithm at block 840 may be derived from neutral model parameters 820a.
Block 840 generates emotion-specific model parameters λμ,Σ(p,s) 840a, along with state occupation probabilities 840b for each state s, e.g.:
Occupation statistic for state s=Occ[s]=P(O,s|λμ,Σ(p,s); (Equation 7)
wherein O represents the total set of observation vectors. In an exemplary embodiment, occupation statistics 840b may aid in the generation of a decision tree for the emotion-specific model parameters, as previously described hereinabove.
At block 850, a decision tree is constructed for context clustering of the emotion-specific adjustments. It will be appreciated that in view of the present disclosure, the decision tree may be constructed using any suitable techniques for clustering the emotion-specific adjustments. In an exemplary embodiment, the decision tree may be constructed directly using the emotion-specific model parameters λμ,Σ(p,s) 840a. In an alternative exemplary embodiment, the decision tree may be constructed using a version of the transformed model, e.g., by applying the equations specified in Equations 4-6 hereinabove to the parameters of neutral model λnμ,Σ(p,s) 820a to generate transformed model parameters. In such an exemplary embodiment, the corresponding adjustment factors (e.g., αe(pt,sm), β(pt,sm), and γe(p,s), as well as duration adjustments) to be applied for the transformation may be estimated by applying linear regression techniques to obtain a best linear fit of transformed parameters of neutral model λnμ,Σ(p,s) 820a to the emotion-specific model λμ,Σ(p,s) 840a, as necessary.
It will be appreciated that construction of the decision tree (based on, e.g., the emotion-specific model or the transformed model) may proceed by, e.g., selecting appropriate questions to maximize the weighted sum of the log-likelihood ratios of the leaf nodes of the tree. In an exemplary embodiment, the weights applied in the weighted sum may include the occupancy statistics Occ[s] 840b. The addition of branches and leaf nodes may proceed until terminated based on a metric, e.g., such as specified by minimum description length (MDL) or other cross-validation techniques.
Referring to
At block 870, the structure of the constructed decision tree and the adjustment factors for each leaf node are stored in memory, e.g., for later use as emotion-specific model 334.3. Storage of this information in memory at block 870 completes the training phase. During speech synthesis, e.g., per the exemplary embodiment shown in
In
At block 920, the at least one parameter is adjusted distinctly for each of the plurality of phonemes based on an emotion type to generate a transformed representation.
Computing system 1000 includes a processor 1010 and a memory 1020. Computing system 1000 may optionally include a display subsystem, communication subsystem, sensor subsystem, camera subsystem, and/or other components not shown in
Processor 1010 may include one or more physical devices configured to execute one or more instructions. For example, the processor may be configured to execute one or more instructions that are part of one or more applications, services, programs, routines, libraries, objects, components, data structures, or other logical constructs. Such instructions may be implemented to perform a task, implement a data type, transform the state of one or more devices, or otherwise arrive at a desired result.
The processor may include one or more processors that are configured to execute software instructions. Additionally or alternatively, the processor may include one or more hardware or firmware logic machines configured to execute hardware or firmware instructions. Processors of the processor may be single core or multicore, and the programs executed thereon may be configured for parallel or distributed processing. The processor may optionally include individual components that are distributed throughout two or more devices, which may be remotely located and/or configured for coordinated processing. One or more aspects of the processor may be virtualized and executed by remotely accessible networked computing devices configured in a cloud computing configuration.
Memory 1020 may include one or more physical devices configured to hold data and/or instructions executable by the processor to implement the methods and processes described herein. When such methods and processes are implemented, the state of memory 1020 may be transformed (e.g., to hold different data).
Memory 1020 may include removable media and/or built-in devices. Memory 1020 may include optical memory devices (e.g., CD, DVD, HD-DVD, Blu-Ray Disc, etc.), semiconductor memory devices (e.g., RAM, EPROM, EEPROM, etc.) and/or magnetic memory devices (e.g., hard disk drive, floppy disk drive, tape drive, MRAM, etc.), among others. Memory 1020 may include devices with one or more of the following characteristics: volatile, nonvolatile, dynamic, static, read/write, read-only, random access, sequential access, location addressable, file addressable, and content addressable. In some embodiments, processor 1010 and memory 1020 may be integrated into one or more common devices, such as an application specific integrated circuit or a system on a chip.
Memory 1020 may also take the form of removable computer-readable storage media, which may be used to store and/or transfer data and/or instructions executable to implement the herein described methods and processes. Removable computer-readable storage media 1030 may take the form of CDs, DVDs, HD-DVDs, Blu-Ray Discs, EEPROMs, and/or floppy disks, among others.
It is to be appreciated that memory 1020 includes one or more physical devices that stores information. The terms “module,” “program,” and “engine” may be used to describe an aspect of computing system 1000 that is implemented to perform one or more particular functions. In some cases, such a module, program, or engine may be instantiated via processor 1010 executing instructions held by memory 1020. It is to be understood that different modules, programs, and/or engines may be instantiated from the same application, service, code block, object, library, routine, API, function, etc. Likewise, the same module, program, and/or engine may be instantiated by different applications, services, code blocks, objects, routines, APIs, functions, etc. The terms “module,” “program,” and “engine” are meant to encompass individual or groups of executable files, data files, libraries, drivers, scripts, database records, etc.
In an aspect, computing system 1000 may correspond to a computing device including a memory 1020 holding instructions executable by a processor 1010 to generate an emotionally neutral representation of a script, the emotionally neutral representation including at least one parameter associated with a plurality of phonemes. The memory 1020 may further hold instructions executable by processor 1010 to adjust the at least one parameter distinctly for each of the plurality of phonemes based on an emotion type to generate a transformed representation. Note such a computing device will be understood to correspond to a process, machine, manufacture, or composition of matter.
An adjustment block 1120 is configured to adjust the at least one parameter in the emotionally neutral representation 1110a distinctly for each of the plurality of frames, based on an emotion type 1120b. The output of adjustment block 1120 corresponds to the transformed representation 1120a. In an exemplary embodiment, adjustment block 1120 may apply, e.g., a linear or affine transformation to the at least one parameter as described hereinabove with reference to, e.g., blocks 440 or 520, etc. The transformed representation may correspond to, e.g., transformed model parameters such as described hereinabove with reference to Equations 4-6, or transformed acoustic parameters such as described hereinabove with reference to Equations 1-3. Transformed representation 1120a may be further provided to a block (e.g., block 530 in
In an exemplary embodiment, the adjustment block 1120 may be configured to retrieve an adjustment factor corresponding to the state of the HMM from an emotion-specific decision tree.
In this specification and in the claims, it will be understood that when an element is referred to as being “connected to” or “coupled to” another element, it can be directly connected or coupled to the other element or intervening elements may be present. In contrast, when an element is referred to as being “directly connected to” or “directly coupled to” another element, there are no intervening elements present. Furthermore, when an element is referred to as being “electrically coupled” to another element, it denotes that a path of low resistance is present between such elements, while when an element is referred to as being simply “coupled” to another element, there may or may not be a path of low resistance between such elements.
The functionality described herein can be performed, at least in part, by one or more hardware and/or software logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Program-specific Integrated Circuits (ASICs), Program-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
6950798 | Beutnagel | Sep 2005 | B1 |
7280968 | Blass | Oct 2007 | B2 |
8036899 | Sobol-Shikler | Oct 2011 | B2 |
8065150 | Eide | Nov 2011 | B2 |
8224652 | Wang et al. | Jul 2012 | B2 |
9472182 | Luan | Oct 2016 | B2 |
20030093280 | Oudeyer | May 2003 | A1 |
20060095264 | Wu | May 2006 | A1 |
20060136213 | Hirose | Jun 2006 | A1 |
20070213981 | Meyerhoff | Sep 2007 | A1 |
20080044048 | Pentland | Feb 2008 | A1 |
20080235024 | Goldberg et al. | Sep 2008 | A1 |
20080294741 | Dos Santos | Nov 2008 | A1 |
20090037179 | Liu | Feb 2009 | A1 |
20090063154 | Gusikhin et al. | Mar 2009 | A1 |
20090177474 | Morita | Jul 2009 | A1 |
20130041669 | Ben et al. | Feb 2013 | A1 |
20130054244 | Bao et al. | Feb 2013 | A1 |
20130218568 | Tamura | Aug 2013 | A1 |
20130262109 | Latorre-Martinez | Oct 2013 | A1 |
20130262119 | Latorre-Martinez | Oct 2013 | A1 |
20140067397 | Radebaugh | Mar 2014 | A1 |
20160078859 | Luan | Mar 2016 | A1 |
Number | Date | Country |
---|---|---|
2650874 | Oct 2013 | EP |
Entry |
---|
Latorre et al, “Training a supra-segmental parametric F0 model without interpolating F0,” May 2013, In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, Vancouver, BC, 2013, pp. 6880-6884. |
Tooher et al, “Transformation of LF parameters for speech synthesis of emotion: regression trees”,2008, in Proceedings of the 4th International Conference on Speech Prosody, Campinas, Brazil, ISCA, 2008, pp. 705-708. |
Tao et al, “Prosody conversion from neutral speech to emotional speech,” Jul. 2006, in IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, No. 4, pp. 1145-1154. |
Pribilova et al, “Spectrum Modification for Emotional Speech Synthesis,” 2009, In Multimodal Signals: Cognitive and Algorithmic Issues, pp. 232-241. |
Latorre et al, “Speech factorization for HMM-TTS based on cluster adaptive training,” 2012, in Proc. Interspeech, 2012. |
Latorre et al, “Training a parametric-based logf0 model with the minimum generation error criterion,”, 2010, in Proc. Interspeech, 2010, pp. 2174-2177. |
Aihara et al, “GMM-based emotional voice conversion using spectrum and prosody features,” , 2012, In American Journal of Signal Processing, vol. 2, No. 5. |
Erro et al, “Emotion Conversion Based on Prosodic Unit Selection,” Jul. 2010, in IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, No. 5, pp. 974-983. |
Jia, et al., “Emotional Audio-Visual Speech Synthesis Based on PAD”, In IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, Issue 3, Mar. 2011, pp. 570-582. |
Chandak, et al., “Text to Speech Synthesis with Prosody feature: Implementation of Emotion in Speech Output using Forward Parsing”, In International Journal of Computer Science and Security, vol. 4, Issue 3, Mar. 2013, pp. 352-360. |
Bhutekar, et al., “Corpus Based Emotion Extraction to Implement Prosody Feature in Speech Synthesis Systems”, In International Journal of Computer and Electronics Research, vol. 1, Issue 2, Aug. 2012, pp. 67-75. |
Albrecht, et al., ““May I talk to you?:-)”—Facial Animation from Text”, In Proceedings 10th Pacific Conference on Computer Graphics and Applications, Oct. 9, 2002, 10 pages. |
Cen, et al., “Generating Emotional Speech from Neutral Speech”, In Proceedings of 7th International Symposium on Chinese Spoken Language Processing, Nov. 29, 2010, pp. 383-386. |
Zen, et al., “Statistical Parametric Speech Synthesis,” Preprint submitted to Speech Communication, Apr. 6, 2009. |
Tamura, et al., “Adaptation of Pitch and Spectrum for HMM-Based Speech Synthesis Using MLLR,” Proc. ICASSP, 2001, pp. 805-808. |
Yamagishi, Junichi, “An Introduction to HMM-Based Speech Synthesis,” Oct. 2006, available at https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/TrajectoryModelling/HTS-Introduction.pdf. |
“International Search Report and Written Opinion Issued in PCT Application No. PCT/US2015/048755”, dated Nov. 19, 2015, 12 pages. |
Qin, et al., “HMM-Based Emotional Speech Synthesis Using Average Emotion Model”, In Lecture Notes in Computer Science on Chinese Spoken Language Processing, vol. 4274, Jan. 1, 2006, pp. 233-240. |
Yamagish, Junichi, “Average-Voice-Based Speech Synthesis”, Retrieved from <<http://www.kbys.ip.titech.ac.jp/yamagishi/pdf/Yamagishi-D—thesis.pdf>>, Mar. 1, 2006, 177 Pages. |
Yamagishi, et al., “Speaking Style Adaptation Using Context Clustering Decision Tree for Hmm-Based Speech Synthesis”, In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, vol. 1, May 17, 2004, 4 Pages. |
“Second Written Opinion Issued in PCT Application No. PCT/US2015/048755”, dated Apr. 20, 2016, 04 Pages. |
“International Preliminary Report on Patentability Issued in PCT Application No. PCT/US2015/048755”, dated Nov. 24, 2016, 8 Pages. |
Number | Date | Country | |
---|---|---|---|
20160078859 A1 | Mar 2016 | US |