Image sequences with lip movements synchronized with speech are commonly called “talking heads.” Talking heads are useful in applications of human-machine interaction, e.g. reading emails, news or eBooks, acting as an intelligent voice agent or a computer assisted language teacher, etc. A lively talking head can attract the attention of a user, make the human/machine interface more engaging or add entertainment to an application.
Generating talking heads that look like real people is challenging. A talking head needs to be not just photo-realistic in a static appearance, but exhibit convincing plastic deformations of the lips synchronized with the corresponding speech, because the most eye-catching region of a talking face involves the “articulators” (around the mouth including lips, teeth, and tongue).
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Audiovisual data of an individual reading a known script is obtained and stored in an audio library and an image library. The audiovisual data is processed to extract feature vectors used to train a statistical model, such as a context dependent hidden Markov model, in which a single Gaussian mixture model (GMM) is used to characterize state outputs. An input audio feature vector corresponding to desired speech with which a synthesized image sequence will be synchronized is provided. This input audio feature vector may be derived from text or from a speech signal.
The statistical model is used to generate a trajectory of visual feature vectors that corresponds to the input audio feature vector. These visual feature vectors are used to identify a matching image sequence from the image library. The matching process takes into account both a target cost and a concatenation cost. The target cost represents a measure of the difference (or similarity), between feature vectors of images in the image library and the feature vectors in the trajectory. For example, the target cost may be a Euclidean distance between pairs of feature vectors. The concatenation cost represents a measure of the difference (or similarity) between adjacent images in the output image sequence. For example, the concatenation cost may be a correlation between adjacent images in the output image sequence. The resulting sequence of images, concatenated from the image library, provides a photorealistic image sequence with lip movements synchronized with the desired speech.
In the following description, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific example implementations of this technique. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
The following section provides an example system environment in which photorealistic image sequence generation can be used.
Referring now to
The application 100 can use a talking head for a variety of purposes. For example, the application 100 can be a computer assisted language learning applications, a language dictionary (e.g., to demonstrate pronunciation), an email reader, a news reader, a book reader, a text-to-speech system, an intelligent voice agent, an avatar of an individual for a virtual meeting room, a virtual agent in dialogue system, video conferencing, online chatting, gaming, movie animation, or other application that provides visual and speech-based interaction with an individual.
In general, such an application 100 provides an input, such as text 110, or optionally speech 112, to a synthesis module 104, which in turn generates an image sequence 106 with lip movements synchronized with speech that matches the text or the input speech. The synthesis module 104 relies on a model 108, described in more detail below. The operation of the synthesis module also is described in more detail below.
When text is provided by the application 100, the text 110 is input to a text-to-speech conversion module 114 to generate speech 112. The application 100 also might provide a speech signal 112, in which case the text-to-speech conversion is not used and the synthesis module generates an image sequence 106 using the speech signal 112.
The speech signal 112 and the image sequence 106 are played back using a synchronized playback module 120, which generates audiovisual signals 122 that output to the end user through an audiovisual output device 102. The synchronized playback module may reside in a computing device at the end user's location, or may be in a remote computer.
Having now described the application environment in which the synthesis of image sequences may be used, how such image sequences are generated will now be described.
Referring now to
The model 204 is used by a synthesis module 206 to generate a visual feature vector sequence corresponding to an input set of feature vectors for speech with which the facial animation is to be synchronized. The input set of feature vectors for speech is derived from input 208, which may be text or speech. The visual feature vector sequence is used to select an image sample sequence from an image library (part of the model 204). This image sample sequence is processed to provide the photo-realistic image sequence 210 to be synchronized with speech signals corresponding to the input 208 of the synthesis module.
The training module, in general, would be used once for each individual for whom a model is created for generating photorealistic image sequences. The synthesis module is used each time a new text or speech sequence is provide for which a new image sequence is to be synthesized from the model. It is possible to create, store and re-use image sequences from the synthesis module instead of recomputing them each time.
Training of the statistical model will be described first in connection with
In
Because a reader typically moves his or her head naturally during recording, the images can be normalized for head position by a head pose normalization module 302. For example, the poses of each frame of the recorded audio visual content are normalized and aligned to a full-frontal view. An example implementation of head pose normalization is to use the techniques found in Q. Wang, W. Zhang, X. Tang, H. Y. Shum, “Real-time Bayesian 3-d pose tracking,” IEEE Transactions on Circuits and Systems for Video Technology 16(12) (2006), pp. 1533-1541. Next, the images of just the articulators (i.e., the mouth, lips, teeth, tongue, etc.) are cropped out with a fixed rectangle window and a library of lips sample images is made. These images also may be stored in the audiovisual database 300 and/or passed on to a visual feature extraction module 304.
Using the library of lips sample images, visual feature extraction module 304 generates visual feature vectors for each image. In one implementation, eigenvectors of each lips image are obtained by applying principal component analysis (PCA) to each image. From experiments, the top twenty eigenvectors contained about 90% of the accumulated variance. Therefore, twenty eigenvectors are used for each lips image. Thus the visual feature vector for each lips image ST is described by its PCA vector,
VT=STW (1)
where W is the projection matrix made by the top 20 eigenvectors of the lips images.
Acoustic feature vectors for the audio data related to each of the lips sample images also are created, using conventional techniques such as by computing the Mel-frequency cepstral coefficients (MFCCs).
Next, the audio and video feature vectors 305 (which also may be stored in the audiovisual library) are used by a statistical model training module 307 to generate a statistical model 306. In one implementation, acoustic vectors At=[atT, ΔatT, ΔΔatT]T and visual vectors Vt=[vtT, ΔvtT, ΔΔvtT]T are used, which are formed by augmenting the static features and their dynamic counterparts to represent the audio and video data. Audio-visual hidden Markov models (HMMs), λ, are trained by maximizing the joint probability p(A, V|λ) over the acoustic and visual training vectors. In order to capture the contextual effects, context dependent HMMs are trained and tree-based clustering is applied to acoustic and visual feature streams separately to improve the corresponding model robustness. For each audiovisual HMM state, a single Gaussian mixture model (GMM) is used to characterize the state output. The state q has a mean vectors μq(A) and μq(V). In one implementation, the diagonal covariance matrices for Σq(AA) and Σq(VV), null covariance matrices for Σq(AV) and Σq(VA), are used by assuming the independence between audio and visual streams and between different components. Training of an HMM is described, for example, in Fundamentals of Speech Recognition by Lawrence Rabiner and Biing-Hwang Juang, Prentice-Hall, 1993.
Referring now to
Having now described how a statistical model is trained using audiovisual data, the process of synthesizing an image sequence using this model will now be described in more detail.
Referring now to
An implementation of module 500 is as follows. Given a continuous audiovisual HMM λ, and acoustic feature vectors A=[A1T, A2T, . . . , ATT]T, the module identifies a visual feature vector sequence V=[V1T, V2T, . . . , VTT]T such that the following likelihood function is maximized:
p(V|A,λ)=Σall Qp(Q|A,λ)·p(V|A,Q,λ), (2)
Equation (2) is maximized with respect to V, where Q is the state sequence. In particular, at frame t, p(Vt|At, qt, λ) are given by:
p(Vt|At,qt,λ)=N(Vt;{circumflex over (μ)}q
{circumflex over (μ)}q
{circumflex over (Σ)}q
We consider the optimal state sequence Q by maximizing the likelihood function p(Q|A,λ) with respect to the given acoustic feature vectors A and model λ. Then, the logarithm of the likelihood function is written as
The constant K is independent of V. The relationship between a sequence of the static feature vectors C=[v1T, v2T, . . . , vTT]T and a sequence of the static and dynamic feature vectors V can be represented as a linear conversion,
V=WcC, (9)
where Wc is a transformation matrix, such as described in K. Tokuda, H. Zen, etc., “The HMM-based speech synthesis system (HTS),” http://hts.ics.nitech.ac.jp/. By setting
{circumflex over (V)}opt that maximizes the logarithmic likelihood function is given by
{circumflex over (V)}opt=WcCopt=Wc(WcTÛ(VV)
The visual feature vector sequence 506 is a compact description of articulator movements in the lower rank eigenvector space of the lips images. However, the lips image sequence to which it corresponds, if used as an output image sequence, would be blurred due to: (1) dimensionality reduction in PCA; (2) maximum likelihood (ML)-based model parameter estimation and trajectory generation. Therefore, this trajectory is used to guide selection of the real sample images, which in turn are concatenated to construct the output image sequence. In particular, an image selection module 508 receives the visual feature vector sequence 506 and searches the audiovisual database 510 for a real image sample sequence 512 in the library which is closest to the predicted trajectory as the optimal solution. Thus, the articulator movement in the visual trajectory is reproduced and photo-realistic rendering is provided by using real image samples.
An implementation of the image selection module 508 is as follows. First, the total cost for a sequence of T selected samples is the weighted sum of the target and concatenation costs:
C({circumflex over (V)}1T,Ŝ1T=Σi=1TωtCt({circumflex over (V)}i,Ŝi)+Σi=2TωcCc(Ŝi−1,Ŝi) (11)
The target cost of an image sample Ŝi is measured by the Euclidean distance between their PCA vectors.
Ct({circumflex over (V)}i,Ŝi)=∥{circumflex over (V)}i−ŜiTW∥ (12)
The concatenation cost is measured by the normalized 2-D cross correlation (NCC) between two image samples Ŝi and Ŝj, as Equation 13 below shows. Since the correlation coefficient ranges in value from −1.0 to 1.0, NCC is by nature a normalized similarity score.
Assume that the corresponding samples of Ŝi and Ŝj in the sample library are Sp and Sq, i.e., Ŝi=Sp, and Ŝj=Sq, where, p and q are the sample indexes in video recording. And hence Sp and Sp+1, Sq−1 and Sq are consecutive frames in the original recording. As defined in Eq. 14, the concatenation cost between Ŝi and Ŝj is measured by the NCC of the Sp and the Sq−1 and the NCC of the Sp+1 and Sq.
Because NCC(Sp,Sp)=NCC(Sq,Sq)=1, then Cc(Sp,Sp+1)=Cc(Sq−1,Sq)=0, so that the selection of consecutive frames in the original recording is encouraged.
The sample selection procedure is the task of determining the set of image sample Ŝ1T so that the total cost defined by Equation 11 is minimized, which is represented mathematically by Equation 15:
Ŝ1T=argminŜ
Optimal sample selection can be performed with a Viterbi search. However, to obtain near real-time synthesis on large dataset, containing tens of thousands of samples, the search space is pruned. One example of such pruning is implemented in two parts. First, for every target frame in the trajectory, K-nearest samples are identified according to the target cost. The beam width K can be, for example, between 1 and N (the total number of images). The number K can be selected so as to provide the desired performance. Second, the remaining samples are pruned according to the concatenation cost.
The operation of a system such as shown in
As a result of this image selection technique, a set of real images closely matching the predicted trajectory and smoothly transitioning between each other provide a photorealistic image sequence with lip movements that closely match the provided audio or text.
The system for generating photorealistic image sequences is designed to operate in a computing environment. The following description is intended to provide a brief, general description of a suitable computing environment in which this system can be implemented. The system can be implemented with numerous general purpose or special purpose computing hardware configurations. Examples of well known computing devices that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
With reference to
Device 700 may also contain communications connection(s) 712 that allow the device to communicate with other devices. Communications connection(s) 712 is an example of communication media. Communication media typically carries computer program instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.
Device 700 may have various input device(s) 714 such as a display, a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 716 such as speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
The system for photorealistic image sequence generation may be implemented in the general context of software, including computer-executable instructions and/or computer-interpreted instructions, such as program modules, being processed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and so on, that, when processed by the computing device, perform particular tasks or implement particular abstract data types. This system may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
6028960 | Graf et al. | Feb 2000 | A |
6449595 | Arslan et al. | Sep 2002 | B1 |
6504546 | Cosatto | Jan 2003 | B1 |
6661418 | McMillan et al. | Dec 2003 | B1 |
6735566 | Brand | May 2004 | B1 |
6919892 | Cheiky et al. | Jul 2005 | B1 |
7168953 | Poggio et al. | Jan 2007 | B1 |
7689421 | Li et al. | Mar 2010 | B2 |
7990384 | Cosatto et al. | Aug 2011 | B2 |
20030163315 | Challapali | Aug 2003 | A1 |
20040064321 | Cosatto et al. | Apr 2004 | A1 |
20040120554 | Lin et al. | Jun 2004 | A1 |
20050057570 | Cosatto et al. | Mar 2005 | A1 |
20050117802 | Yonaha | Jun 2005 | A1 |
20060012601 | Francini et al. | Jan 2006 | A1 |
20060204060 | Huang et al. | Sep 2006 | A1 |
20070091085 | Wang et al. | Apr 2007 | A1 |
20080221904 | Cosatto et al. | Sep 2008 | A1 |
20080235024 | Goldberg et al. | Sep 2008 | A1 |
20100007665 | Smith et al. | Jan 2010 | A1 |
20100085363 | Smith et al. | Apr 2010 | A1 |
Entry |
---|
Lei Xie et al., “A coupled HMM approach to video-realistic speech animation”, 2006, Pattern Recognition Society, pp. 2325-2340. |
Kang Liu et al., “Optimization of an Image-based Talking Head System”, Jul. 3, 2009, pp. 1-13. |
Potamianos, et al., “An Image Transform Approach for HMM Based Automatic Lipreading”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=999008 >>, Proceedings of the International Conference on Image Processing, 1998, pp. 173-177. |
Bailly, G., “Audiovisual Speech Synthesis”, Retrieved at << http://www.google.co.in/url?sa=t&source=web&cd=5&ved=0CDkQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.25.5223%26rep%3Drep1%26type%3Dpdf&ei=OjtjTZ70Gsms8AOu—I3xCA&usg=AFQjCNHLBrzLXHD3BqweVV5XSVvNPFrKoA >>, International Journal of Speech Technology, vol. 06, 2001, pp. 10. |
Zhuang, et al., “A Minimum Converted Trajectory Error (MCTE) Approach to High Quality Speech-to-Lips Conversion”, Retrieved at << http://www.isle.illinois.edu/sst/pubs/2010/zhuang10interspeech.pdf >>, 11th Annual Conference of the International Speech Communication Association, Sep. 26-30, 2010, pp. 4. |
“Agenda with Abstracts”, Retrieved at << http://research.microsoft.com/en-us/events/asiafacsum2010/agenda—expanded.aspx >>, Retrieved Date: Feb. 22, 2011, pp. 6. |
Cosatto, et al., “Photo-realistic Talking Heads from Image Samples”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=865480 >>, IEEE Transactions on Multimedia, vol. 02, No. 3, Sep. 2000, pp. 152-163. |
Bregler, et al., “Video Rewrite: Driving Visual Speech with Audio”, Retrieved at << http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=6A9DA58ECBE8EA0OBCA13494C68D82E0?doi=10.1.1.162.1921&rep=rep1&type=pdf >>, The 24th International Conference on Computer Graphics and Interactive Techniques, Aug. 3-8, 1997, pp. 1-8. |
Huang, et al., “Triphone based Unit Selection for Concatenative Visual Speech Synthesis”, Retrieved at << http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.13.6042&rep=rep1&type=pdf >>, IEEE International Conference on Acoustics, Speech, and Signal Processing, Apr. 27-30, 1993, pp. II-2037-II-2040. |
Ezzat, et al., “Trainable VideoRealistic Speech Animation”, Retrieved at << http://cbcl.mit.edu/cbcl/publications/ps/siggraph02.pdf >>, The 29th International Conference on Computer Graphics and Interactive Techniques, Jul. 21-26, 2002, pp. 11. |
Mattheyses, et al., “Multimodal Unit Selection for 2D Audiovisual Text-to-Speech Synthesis”, Retrieved at << http://www.esat.kuleuven.be/psi/spraak/cgi-bin/get—file.cgi?/space/mattheyses—mimi08/paper.pdf >>, Machine Learning for Multimodal Interaction, 5th International Workshop, MLMI, Sep. 8-10, 2008, pp. 12. |
Liu, et al., “Realistic Facial Animation System for Interactive Services”, Retrieved at << http://www.tnt.uni-hannover.de/papers/data1692/692—1.pdf >>, 9th Annual Conference of the International Speech Communication Association, Sep. 22-26, 2008, pp. 2330-2333. |
Zen, et al., “The HMM-based Speech Synthesis System (HTS)”, Retrieved at << http://www.cs.cmu.edu/˜awb/papers/ssw6/ssw6—294.pdf >>, 6th ISCA Workshop on Speech Synthesis, Aug. 22-24, 2007, pp. 294-299. |
Sako, et al., “HMM-based Text-To-Audio-Visual Speech Synthesis”, Retrieved at << http://www.netsoc.tcd.ie/˜fastnet/cd—paper/ICSLP/ICSLP—2000/pdf/01692.pdf >>, Proceedings 6th International Conference on Spoken Language Processing, ICSLP, 2000, pp. 4. |
Xie, et al., “Speech Animation using Coupled Hidden Markov Models”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1699088 >>, 18th International Conference on Pattern Recognition (ICPR), Aug. 20-24, 2006, pp. 4. |
Yan, et al., “Rich-context Unit Selection (RUS) Approach to High Quality TTS”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5495150 >>, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Mar. 14-19, 2010, pp. 4798-4801. |
Theobald, et al., “LIPS2008: Visual Speech Synthesis Challenge”, Retrieved at << http://hal.archives-ouvertes.fr/docs/00/33136/55/PDF/bjt—IS08.pdf >>, 2008, pp. 4. |
Chen, Tsuhan., “Audiovisual Speech Processing”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=911195 >>, Jan. 2001, pp. 9-21. |
King, et al., “Creating Speech-synchronized Animation”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1407866 >>, IEEE Transactions on Visualization and Computer Graphics, vol. 11, No. 3, May-Jun. 2005, pp. 341-352. |
Cosatto, et al., “Sample-based Synthesis of Photo-realistic Talking Heads”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=681914 >>, 1998, pp. 8. |
Ezzat, et al., “Miketalk: A Talking Facial Display based on Morphing Visemes”, Retrieved at << http://people.csail.mit.edu/tonebone/publications/ca98.pdf >>, Proceedings of the Computer Animation Conference, Jun. 1998, pp. 7. |
Liu, et al., “Parameterization of Mouth Images by LLE and PCA for Image-based Facial Animation”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1661312&userType=inst >>, 2006, pp. V-461-V-464. |
Wang, et al., “Real-time Bayesian 3-D Pose Tracking”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4016113 >>, IEEE Transactions on Circuits and Systems for Video Technology, vol. 16, No. 12, Dec. 2006, pp. 1533-1541. |
Nakamura, Satoshi., “Statistical Multimodal Integration for Audio-visual Speech Processing”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1021886 >>, IEEE Transactions on Neural Networks, vol. 13, No. 4, Jul. 2002, pp. 854-866. |
Lucey, et al., “Integration Strategies for Audio-visual Speech Processing: Applied to Text-dependent Speaker Recognition”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1430725 >>, IEEE Transactions on Multimedia, vol. 07, No. 3, Jun. 2005, pp. 495-506. |
Masuko, et al., “Speech Synthesis using HMMs with Dynamic Features”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=541114 >>, IEEE International Conference on Acoustics, Speech and Signal Processing, May 7-10, 1996, pp. 389-392. |
Takuda, et al., “Hidden Markov Models based on Multi-space Probability Distribution for Pitch Pattern Modeling”, Retrieved at << http://www.netsoc.tcd.ie/˜fastnet/cd—paper/ICASSP/ICASSP—1999/PDF/AUTHOR/IC992479.PDF >>, IEEE International Conference on Acoustics, Speech, and Signal Processing, Mar. 15-19, 1999, pp. 4. |
Toda, et al., “Spectral Conversion based on Maximum Likelihood Estimation Considering Global Variance of Converted Parameter”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1415037 >>, 2005, pp. I-9-I-12. |
Perez, et al., “Poisson Image Editing”, Retrieved at << http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.133.6932&rep=rep1&type=pdf >>, Special Interest Group on Computer Graphics and Interactive Techniques, Jul. 27-31, 2003, pp. 313-318. |
Huang, et al., “Recent Improvements on Microsoft's Trainable Text-to-speech System—Whistler”, Retrieved at << http://research.microsoft.com/pubs/77517/1997-xdh-icassp.pdf >>, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Apr. 1997, pp. 4. |
Donovan, et al., “The IBM Trainable Speech Synthesis System”, Retrieved at << http://www.shirc.mq.edu.au/proceedings/icslp98/PDF/SCAN/SL980166.PDF >>, Proceedings of the 5th International Conference of Spoken Language Processing, 1998, pp. 4. |
Hirai, et al., “Using 5 ms Segments in Concatenative Speech Synthesis”, Retrieved at << http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.9628&rep=rep1&type=pdf >>, 5th ISCA Speech Synthesis Workshop, 2004, pp. 37-42. |
Hunt, et al., “Unit Selection in a Concatenative Speech Synthesis System using a Large Speech Database”, Retrieved at << http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=541110 >>, IEEE International Conference on Acoustics, Speech, and Signal Processing, May 7-10, 1996, pp. 373-376. |
Lewis, J. P., “Fast Normalized Cross-correlation”, Retrieved at << http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.6062&rep=rep1&type=pdf >>, 1995, pp. 7. |
Graf, et al., “Face Analysis for the Synthesis of Photo-Realistic Talking Heads”, In IEEE International Conference on Automatic Face and Gesture Recognition, 2000, 6 Pages. |
Sheng, et al., “Automatic 3D Face Synthesis using Single 2D Video Frame”, In Electronics Letters, vol. 40, Issue 19, Sep. 16, 2004, 2 Pages. |
Tao, et al., “Speech Driven Face Animation Based on Dynamic Concatenation Model”, In Journal of Information & Computational Science, vol. 3, Issue 4, Dec. 2006, 10 Pages. |
Theobald, et al., “2.5D Visual Speech Synthesis Using Appearance Models”, In Proceedings of the British Machine Vision Conference, 2003, 10 Pages. |
“Non-Final Office Action Issued in U.S. Appl. No. 13/099,387”, Mailed Date: May 9, 2014, 15 Pages. |
“Final Office Action Issued in U.S. Appl. No. 13/099,387”, Mailed Date: Jan. 5, 2015, 16 Pages. |
“Non-Final Office Action Issued in U.S. Appl. No. 13/099,387”, Mailed Date: Apr. 7, 2016. |
Number | Date | Country | |
---|---|---|---|
20120284029 A1 | Nov 2012 | US |