An audio-to-video engine is a software program that generates a video of facial movements (e.g., a virtual talking head) from inputted speech audio. An audio-to-video engine may be useful in multimedia communication applications, such video conferencing, as it generating video in environments where direct video capturing is either not available or places an undesirable burden on the communication network. The audio-to-video engine may also be useful for increasing the intelligibility of speech.
In prior implementations, audio-to-video methods generally apply maximum likelihood estimation (MLE)-based conversion processes to a Gaussian Mixture Model (GMM) to estimate video feature vectors given a set of audio feature vectors. However, the MLE-based conversion processes typically include conversion errors since an audiovisual GMM with maximum likelihood on the training data does not necessarily result in converted visual trajectories that have minimized error in human perception.
Described herein are techniques and systems for providing an audio-to-video engine that utilizes a Minimum Converted Trajectory Error (MCTE)-based process to refine a Gaussian Mixture Model (GMM). The refined GMM may then be used to convert input speech into realistic output video. Unlike previous methods which apply a maximum likelihood estimation (MLE)-based conversion process directly to the GMM to generate the video output, the MCTE-based process focuses on minimizing conversion errors of the MLE-based conversion process.
The MCTE-based process may refine the GMM in two steps. First, the MCTE-based process may weigh the audio data and the video data of the GMM separately using a log likelihood function. The MCTE-based process may then apply a generalized probabilistic descent (GPD) algorithm to refine the visual parameters of the GMM.
The audio-to-video engine may use the refined GMM to convert input speech into realistic output video. First, the audio-to-video engine may recognize the input speech as a source feature vector. The audio-to-video engine may then determine a Maximum A Posterior (MAP) mixture sequence based on the source feature vector and the refined GMM. Finally, the audio-to-video engine may estimate the video feature parameters using the MAP mixture sequence. The video feature parameters may be stored or may be output as a video of facial movements (e.g., a virtual talking head). Other embodiments will become more apparent from the following detailed description when taken in conjunction with the accompanying drawings.
This Summary is provided to introduce a selection of concepts in a simplified form that is further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The detailed description is described with reference to the accompanying Figures. In the Figures, the left-most digit(s) of a reference number identifies the Figure in which the reference number first appears. The use of the same reference number in different Figures indicates similar or identical items.
The embodiments described herein pertain to a Minimum Converted Trajectory Error (MCTE)-based audio-to-video engine that focuses on minimizing conversion errors of traditional MLE-based conversion processes. Accordingly, the audio-to-video engine may provide better user experience in comparison to other audio-to-video engines.
The processes and systems described herein may be implemented in a number of ways. Example implementations are provided below with reference to the following figures.
Illustrative Scheme
The audio-to-video engine 102 may be implemented on a computing device 104. The computing device 104 may be a computing device that includes one or more processors that provide processing capabilities and memory that provides data storage and retrieval capabilities. In various embodiments, the computing device 104 may be a general purpose computer, such as a desktop computer, a laptop computer, a server, or the like. However, in other embodiments, the computing device 104 may be a mobile phone, set-top box, game console, personal digital assistant (PDA), portable media player (e.g., portable video player) and digital audio player), net book, tablet PC, and other types of computing device. Further, the computing device 104 may have network capabilities. For example, the computing device 104 may exchange data with other computing devices (e.g., laptops computers, servers, etc.) via one or more networks, such as the Internet.
The audio-to-video engine 102 may convert an input speech 106 into facial movement 108. In various embodiments, the input speech 106 is inputted into the audio-to-video engine as digital data (e.g., audio data). The audio-to-video engine 102 may recognize the input speech 106 as a source feature vector where each time slice includes static and dynamic feature parameters which are each of one or more dimensions. In some instances, the dynamic feature parameters may be represented as a linear transformation of the static feature parameters. The input speech 106 may be of any linguistic content such as a Western speaking language (e.g., English, French, Spanish, etc.), an Asian language (e.g., Chinese, Japanese, and Korean etc), other known languages, numerical speech, input speech of which the linguistic content is unknown, or non-linguistic speech such as laughing, coughing, sneezing, etc.
During the conversion of input speech 106 into facial movement 108, the audio-to-video engine 102 may employ a Gaussian Mixture Model (GMM) 110. The GMM may be a joint GMM that contains a training set of video feature vectors, ŷ, 216 and corresponding audio feature vectors, X, 218. Unlike previous methods which convert input speech directly to output video using a maximum likelihood estimation (MLE)-based conversion process, the audio-to-video engine 102 may employ a Minimum Converted Trajectory Error (MCTE)-based process to refine the GMM. For example, the MCTE-based process may weigh an audio space of the GMM and a video space of the GMM separately using a log likelihood function. The MCTE-based process may then apply a generalized probabilistic descent (GPD) algorithm to replace the visual parameters of the GMM with updated visual parameters to generate the refined GMM.
The audio-to-video engine 102 may use the refined GMM to convert the input speech 106 into video feature parameters. The video feature parameters may be a feature vector Y=[y1, y2, . . . yT] where each time slice may include static and dynamic feature parameters (i.e., YT=[yt; Δyt]) which are each of one or more dimensions, Dy. The dynamic feature parameters, Δyt, of the target feature vector may be represented as a linear transformation of the static vectors
The video feature parameters may be stored or may be processed into facial movements (e.g., a virtual talking head).
MLE-Based Conversion
The memory 204 may store components and/or modules. The components, or modules, may include routines, programs instructions, objects, and/or data structures that perform particular tasks or implement particular abstract data types. The selected components include the audio-to-video engine 102, a user interface module 206 to enable input and/or output communications, an application module 208 to utilize the audio-to-video engine 102, an input/output module 210 to facilitate the input and/or output communications, and a data storage module 212 to store data to the memory 204. The user interface module 206, application module 208, and input/output module 210 are described further below.
The data storage module 212 may store a training set 214 of video feature vectors, ŷ, 216 and corresponding audio feature vectors, X, 218 (i.e., speech data) to generate and refine a model for converting the input speech 106 into the facial movements 108.
The audio-to-video engine 102 may be operable to convert the input speech 106 into facial movement 108. In various embodiments, the audio-to-video engine 102 utilizes the video feature vectors, ŷ, 216 and corresponding audio feature vectors, X, 218 of the training set 214 to generate a Gaussian Mixture Model (GMM) 220. A GMM can be regarded as a type of unsupervised learning or clustering that estimates probabilistic densities using a mixture distribution.
The audio-to-video engine 102 may utilize a maximum likelihood estimation (MLE)-based conversion process 222 to convert the audio feature vectors, X, 218 to target feature vectors, Y, 224. The target feature vectors, Y, 224 may be a time sequence, Y=[y1, y2, . . . yT], where each time slice includes static and dynamic feature parameters (i.e., YT=[yt; Δyt]) which are each of one or more dimensions, Dy. The dynamic feature parameters may be represented as a linear transformation of the static vectors
A Minimum Converted Trajectory Error (MCTE) process 226 may refine the GMM 220 to generate a refined GMM 228. The audio-to-video engine 102 may then use the refined GMM 228 to convert the input speech 106 to the facial movement 108.
As noted above, the audio-to-video engine 102 may utilize the MLE-based conversion process 222 to convert the audio feature vectors, X, 218 to the target feature vectors, Y, 224. The MLE-based conversion process 222 used to convert the audio feature vectors, X, 218 to the target feature vectors Y 224 may be formulated as shown in equation (1) as follows:
ŷ=argmax P(Y|X)≈argmax P(Y|X,θ) (1)
in which X is the audio feature vectors 218, and θ is the Gaussian Mixture Models (GMM) 220 derived using an expectation maximization (EM) for the probability P(Xt, Yt). In other words, P(Xt, Yt) is the probability density of the audio feature vectors, X, 218 and the target feature vectors, Y, 224. The audio feature vectors, X, 218 may be expressed as a time sequence vector X=[x1, x2, . . . xT] where each time slice, xt, may include static and dynamic feature parameters (i.e., XT=[xt; ΔXt]) which are each of one or more dimensions, D. In some instances, the dynamic feature parameters, Δxt, may be represented as a linear transformation of the static feature parameters
In some instances, the GMM, ⊖, 220 may have multiple mixture components. Given that the GMM, ⊖, 220 has M mixture components, the maximum likelihood estimation (MLE) of the target feature vector Y 224 based on the audio feature vectors, X, 218 may be determined as shown in equation (2) as follows:
The first product term of equation (2) may be written as shown in equation (3):
in which (X; μ, Σ) is generally a vector with Gaussian distribution where μ is the mean matrix and Σ is the covariance matrix. In addition, ω, is a continuous weight for individual clusters according to the source feature vector.
The second product term of equation (2) may be written as shown in equations (4), (5), and (6):
P(Yt|Xt,mt,θ)=(Yt;Em
In which
Em
Dm
As noted above, the audio feature vectors, X, 218 and the target feature vectors, Y, 224 may include static and dynamic feature parameters (i.e., XT=[xt; Δxt] and YT=[yt; Δyt], respectively). Accordingly, the target feature vectors, Y, 224 may be expressed as a linear transformation of the static feature parameters, Y=Wy, such that
Similarly, the audio feature vectors, X, 218 may be expressed as X=Wx, such that
Thus, equation (1) may be written as shown in equation (7):
ŷ≈argmax P(Wy|X,θ) (7)
In some instances, the complexity of solving equation (5) can be significantly reduced using two reasonable approximations. First, the summation over all mixture components, M, in equation (2) can be approximated with a single component sequence, {circumflex over (m)}, as shown in equation (8):
P(Y|X,θ)≈P({circumflex over (m)}|X,θ)P(Y|X,{circumflex over (m)},θ) (8)
in which {circumflex over (m)} is a Maximum A Posterior (MAP) single component sequence (i.e., {circumflex over (m)}=argmaxmP(m|X,θ)). Using this first approximation, equation (8) can be used to solve equation (7) in a closed form as shown in equations (9), (10), and (11):
ŷ=(WTD{circumflex over (m)}(Y)−1W)−1WTD{circumflex over (m)}(Y)−1E{circumflex over (m)}(Y) (9)
in which
E{circumflex over (m)}(Y)=[E{circumflex over (m)}
D{circumflex over (m)}(Y)−1=diag[D{circumflex over (m)}
The second approximation that may be applied to the MLE-based conversion process 222 is based on the observation that in a given mixture component, mo, the full covariance matrix in the space of the audio feature vectors, X, and the target feature vectors, Y, can be portioned into Σm
Em
Dm
Using the MLE-based conversion process 222 and the discussed assumptions, equation (1) may be written as shown in equation (14):
ŷ≈argmax Πt=1TP({circumflex over (m)}|Xt,θ)(Yt;μm
Equation (14) can be solved as discussed above with respect to equation (9).
In summary, the MLE-based conversion process 222 utilizes equations (1)-(14) to generate the target feature vectors, Y, 224.
Audio-to-Video Conversion with MCTE
Although the above MLE-based conversion process 222 is effective, it does not necessarily optimize the audio-to-video conversion error. In other words, a comparison of the target feature vectors, Y, 224 (graphically depicted in
The MCTE-based process may refine the GMM 220 using two steps. First, the MCTE-based process may refine the GMM 220 using a minimum generation error (MGE) 236 which analyzes the spaces of the audio feature vectors, X, 218 and the target feature vectors, Y, 224 separately. Second, the MCTE-based process may apply a generalized probabilistic descent (GPD) algorithm to further refine the GMM.
In general, the MLE-based conversion process imposes equal weights on all the feature dimensions (i.e., Dx=Dy). Although such restriction may be satisfactory for audio-to-audio conversions where the input audio signal and the output audio signal have similar dimensions, this is not necessarily satisfactory for audio-to-video conversions where the dimensions of the video feature vectors, ŷ, and the audio feature vectors, X, 218 are not necessarily of the same order. Accordingly, the MCTE-based process may first refine the GMM 220 using the MGE 236 which analyzes the spaces of the audio feature vectors, X, 218 and the target feature vectors, Y, 224 separately.
In some instances, the MGE 236 weighs the audio space of the audio feature vectors, X, 218 and the video space of the target feature vectors, Y, 224 separately with parameters αx and αy respectively. Specifically, a log likelihood function approximated with a single mixture component is used to define the minimum generation error (MGE) 236 as shown in equation (15) as follows:
Weighing the audio space of the audio feature vectors, X, 218 and the video space of the target feature vectors, Y, 224 separately reduces the mean square error of the MLE-based conversion process 222 results. In some instances, heavier weighting on the audio space of the audio feature vectors, X, 218 in equation (15) leads to more distinguishable mixture components in the P(m|X, θ) component of equation (2) but increased perplexity of P(Y|X, m, θ) component. In such instances, the P(m|X, θ) component may dominate the approximation quality of equation (2). In some non-limiting instances, the weighting parameters may be selected to be αx=1 and αy=1.
Second, the MCTE-based process may apply a generalized probabilistic descent (GPD) algorithm to further refine the GMM. A GPD algorithm 238 may further refine the GMM by minimizing the conversion error 234 of the MLE-based conversion process. In general, the conversion error 234 may be defined as the Euclidean distance, D, between the target feature vectors, Y, 224 (graphically depicted in
D(y,ŷ)=Σt=1T∥yt−ŷt∥ (16)
With the approximation using the MAP mixture component sequence adopted in equation (8), the conversion problem, i.e., maximizing P(Y|X, θ), may include the following two steps. First, given the sequence of audio feature vectors, X, 218, a MAP mixture sequence is estimated, {circumflex over (m)}=argmaxmP (m|X, θ)). Second, given the MAP mixture sequence, the corresponding target feature vectors, Y, 224 are estimated by maximizing P(Y|X, {circumflex over (m)}, θ). Note that the second step is the same as a parameter generation problem for a single component sequence {circumflex over (m)}. In other words, the conversion problem is solved by generating features from a corresponding hidden Markov model (HMM), which has a sequence of states and Gaussian kernels {circumflex over (m)} determined by the MAP process. The following cost function, L(θ), shown in equation (17) may be used to minimize the conversion error 234 between the target feature vectors, Y, 224 (graphically depicted in
in which N is the number of training utterances.
Using the GPD algorithm 238, given the nth training utterance, the updating rule for the parameters of the mixtures on the MAP sequence is shown in equation (18) as follows:
Applying equation (9) to equation (18) yields equation (19) as follows:
in which E{circumflex over (m)}
In some instances, Σm
In contrast to the MGE, which directly estimates the parameters in the involved HMMs, the Minimum Converted Trajectory Error (MCTE)-based process 226 uses the generalized probabilistic descent (GPD) algorithm 238 to update the target feature vectors of the MAP mixture component sequence. In other words, the MCTE-based process replaces the video parameters of the GMM with updated video parameters to generate the refined GMM 228.
Audio-to-Video Mapping
After the Minimum Converted Trajectory Error (MCTE)-based process refines the GMM 220, the refined GMM 228 may be used to convert the input speech 106 to the corresponding facial movement 108. First, the audio-to-video engine 102 may recognize the input speech 106 as a source feature vector X=[x1, x2, xT] where each time slice, xt, is a temporal frame of audio feature vector. As discussed above in
Next, the audio-to-video engine 102 may determine a MAP mixture sequence 240 of the input speech, {circumflex over (m)}=argmaxmP(m|X,θ)). In some instances, the audio-to-video engine 102 utilizes techniques similar to the GPD algorithm 238 to determine the MAP mixture sequence 240. Next, the audio-to-video engine 102 may estimate video feature parameters, Y, 242 using the MAP mixture sequence 240 by maximizing P(Y|X, {circumflex over (m)}, θ). Finally, the video feature parameters 242 may be stored or may be output as a video of facial movements (e.g., a virtual talking head).
In various embodiments, referring to
The application module 208 may include one or more applications that utilize the audio-to-video engine 102. For example, but not as a limitation, the one or more application may include a mobile device application of a talking head that reads any text such as news stories or electronic mail (e-mail). In some instances, the one or more application may include a multimedia communication applications such as video conferencing that use voice to drive a talking head. In other instances, the one or more application may include speech conversion applications which outputs the converted speech via a talking head. In further instances, the one or more application may include remote educational applications that convert text-based education material to a talking head instructor. The one or more application may even include applications utilized to increase the intelligibility of speech, and the like. Accordingly, in various embodiments, the audio-to-video engine 102 may include one or more interfaces, such as one or more application program interfaces (APIs), which enable the application module 208 to provide input speech 106 to the audio-to-video engine 102.
The input/output module 210 may enable the audio-to-video engine 102 to receive input speech 106 from another device. For example, the audio-to-video engine 102 may receive input speech 106 from at least one of another electronic device, (e.g., a server) via one or more networks.
As described above, the data storage module 212 may store the training set 214 of video feature vectors, ŷ, 216 and corresponding audio feature vectors, X, 218 (i.e., speech data). The data storage module 212 may further store one or more input speeches 106, as well as one or more video feature parameters 242 and/or facial movements 108. The data storage module 212 may also store any additional data used by the audio-to-video engine 102, such as, but not limited to, the weighting parameters αx and αy.
Illustrative Processes
At block 302, the audio-to-video engine 102 may receive an input speech 106 and recognize the input speech as one or more source feature vectors X=[x1, x2, . . . xT]. The source feature vectors may include static and dynamic feature parameters which are each of one or more dimensions. The audio-to-video engine 102 may generate the static feature parameters from a phoneme structure of the input speech.
At block 304, the audio-to-video engine 102 may determine a Maximum A Posterior (MAP) mixture sequence 240 based on the source feature vectors. In some instances, the MAP mixture sequence 240 is a function of the refined Gaussian Mixture Model (GMM) 228 which includes both audio parameters and updated video parameters. The updated video parameters of the refined GMM 228 may be updated based on the Minimum Converted Trajectory Error (MCTE) process 226 described above in
In some instances, the audio-to-video engine 102 refines the GMM 220 by weighing the video space of the video feature vectors and the audio space of the of the audio feature vectors separately as illustrated in equation (15). The audio-to-video engine 102 may further refine the GMM 220 using the generalized probabilistic descent (GPD) algorithm 238 as illustrated in equations (16)-(20).
At block 306, the audio-to-video engine 102 may estimate the video feature parameters 242 using the MAP mixture sequence 240.
At block 308, the audio-to-video engine 102 may generate the facial movement 108 based on the estimated video feature parameters 242.
At block 310, the audio-to-video engine 102 may output (e.g., render) the facial movement 108. In various embodiments, the computing device 104 on which the audio-to-video engine 102 resides may include a display device to display the facial movement 108 as video to a user. The computing device 104 may also store the facial movement 108 as data in the data storage module 212 for subsequent retrieval and/or output.
At block 402, the audio-to-video engine 102 may generate a minimum generation error (MGE) 236 based on the GMM 220. The audio-to-video engine 102 may apply a log likelihood function approximated with a single mixture component as illustrated in Equation 15 to generate the MGE 236. In some instances, the a log likelihood function weighs the audio space of the audio feature vectors, X, 218 and the video space of the target feature vectors, Y, 224 separately with parameters αx and αy respectively.
At block 404, the audio-to-video engine 102 may apply the generalized probabilistic descent (GPD) algorithm 238 as illustrated in equations (16)-(20) to refine the GMM 220. Applying the GPD algorithm at 404 may include estimating the Maximum A Posterior (MAP) mixture sequence at 406 and estimating the video feature parameters 242 at 408. In contrast to previous processes, which directly estimate the parameters in the involved HMMs, the MCTE process of process 400 uses the GPD algorithm 238 to update the video parameters of the GMM 220. In turn, the updated video parameters replace the corresponding video parameters in the GMM 220 to generate the refined GMM 228.
Illustrative Computing Device
The computing device 104 may be operable to generate facial movement from input speech. For instance, the computing device 104 may be operable to input the input speech 106, recognize the input speech as one or more source feature vectors, determine a Maximum A Posterior (MAP) mixture sequence-based on the source feature vectors, estimate the video feature parameters 242 using the MAP mixture sequence, and generate the facial movement-based on the estimated video feature parameters.
In at least one configuration, the computing device 104 comprises one or more processors 502 and memory 504. The computing device 104 may also include one or more input devices 506 and one or more output devices 508. The input devices 506 may be a keyboard, mouse, pen, voice input device, touch input device, etc., and the output devices 508 may be a display, speakers, printer, etc. coupled communicatively to the processor 502 and the memory 504. The computing device 104 may also contain communications connection(s) 510 that allow the computing device 104 to communicate with other computing devices 512 such as via a network.
The memory 504 of the computing device 104 may store an operating system 514, one or more program modules 516, and may include program data 518. The memory 504, or portions thereof, may be implemented using any form of computer-readable media that is accessible by the computing device 104. Computer-readable media includes, at least, two types of computer-readable media, namely computer storage media and communications media
Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computing device.
In some instances, the program modules 516 may be configured to generate facial movement from input speech using the process 300 illustrated in
Conclusion
In closing, although the various embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended representations is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claimed subject matter.
Number | Name | Date | Kind |
---|---|---|---|
5608839 | Chen | Mar 1997 | A |
5880788 | Bregler | Mar 1999 | A |
5983190 | Trower et al. | Nov 1999 | A |
6366885 | Basu et al. | Apr 2002 | B1 |
6735566 | Brand | May 2004 | B1 |
6813607 | Faruquie et al. | Nov 2004 | B1 |
7123262 | Francini et al. | Oct 2006 | B2 |
7433490 | Huang et al. | Oct 2008 | B2 |
7454342 | Nefian et al. | Nov 2008 | B2 |
7587318 | Seshadri | Sep 2009 | B2 |
7933772 | Cosatto et al. | Apr 2011 | B1 |
20020116197 | Erten | Aug 2002 | A1 |
20020194006 | Challapali | Dec 2002 | A1 |
20050270293 | Guo et al. | Dec 2005 | A1 |
20060204060 | Huang et al. | Sep 2006 | A1 |
Entry |
---|
Huang et al. “Real-Time Lip-Synch Face Animation Driven by Human Voice”, IEEE Workshop on Multimedia Signal Processing, 1998. |
Choi et al. “Hidden Markov Model Inversion for Audio-to-Visual Conversion in an MPEG-4 Facial Animation System”, Journal of VLSI Signal Processing 29, 51-61, 2001. |
Tao et al. “Speech Driven Face Animation Based on Dynamic Concatenation Model”, ournal of Information & Computational Science 3: 4, 2006. |
Chen, “Audiovisual Speech Processing”, retrieved on Aug. 10, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=911195>>, IEEE Signal Processing Magazine, Jan. 2001, pp. 9-21. |
Chen, et al., “Speech-Assisted Lip Synchronization in Audio-Visual Communications”, retrieved on Aug. 10, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=537545>>, IEEE Computer Society, Proceedings of International Conference on Image Processing (ICIP), vol. 2, Oct. 1995, pp. 579-582. |
Fu, et al., “Audio Visual Mapping With Cross-Modal Hidden Markov Models”, retrieved on Aug. 10, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1407897>>, IEEE Transactions on Multimedia, vol. 7, No. 2, Apr. 2005, pp. 243-252. |
Hong, et al., “Real-Time Speech-Driven Face Animation With Expressions Using Neural Networks”, retrieved on Aug. 10, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1021892>>, IEEE Transaction on Neural Networks, vol. 13, No. 4, Jul. 2002, pp. 916-927. |
Lavagetto, “Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People”, retrieved on Aug. 11, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00372898>>, IEEE Transactions on Rehabilitation Engineering, vol. 3, No. 1, Mar. 1995, pp. 90-102. |
Nakamura, et al., “Speech-To-Lip Movement Synthesis Maximizing Audio-Visual Joint Probability Based on EM Algorithm”, retrieved on Aug. 12, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00738912>>, IEEE Workshop on Multimedia Signal Processing, Redondo Beach, California, Dec. 1998, pp. 53-58. |
Sako et al., “HMM-Based Text-to-Audio-Visual Speech Synthesis”, Intl Conf on Speech and Language Processing, vol. 3, Oct. 2000, p. 25-28. |
Toda, et al., “Voice Conversion Based on Maximum-Likelihood Estimation of Speech Parameter Trajectory”, retrieved on Aug. 12, 2010 at <<http://ee602.wdfiles.com/local--files/report-presentations/Group—14>>, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, No. 8, Nov. 2007, pp. 2222-2235. |
Wu, et al., “Minimum Generation Error Training for HMM-Based Speech Synthesis”, retrieved on Aug. 10, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=01659964>>, IEEE Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, May 2006, pp. 89-92. |
Xie et al., “A Coupled HMM Approach to Video-Realistic Speech Animation”, Pattern Recognition, vol. 40, No. 8, Aug 2007, a special issue on Visual Information Processing, pp. 2325-2340. |
Yamamoto, et al., “Lip Movement Synthesis from Speech Based on Hidden Markov Models”, retrieved on Aug. 11, 2010 at <<http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=670941>>, Elsevier Science Publishers, Speech Communication, vol. 26, No. 1-2, Oct. 1998, pp. 105-115. |
Number | Date | Country | |
---|---|---|---|
20120116761 A1 | May 2012 | US |