The present invention relates to avatar animation, and more particularly, to facial feature tracking.
Virtual spaces filled with avatars are an attractive the way to allow for the experience of a shared environment. However, animation of a photo-realistic avatar often requires tedious efforts to generate realistic animation information.
Accordingly, there exists a significant need for improved techniques for generating animation information. The present invention satisfies this need.
The present invention is embodied in a method, and related apparatus, for generating facial animation values using a sequence of facial image frames and synchronously captured audio data of a speaking actor. In the method, a plurality of visual facial animation values are provided based on tracking, without using markers on the speaking actor, of facial features in the sequence of facial image frames of the speaking actor, and a plurality of audio facial animation values are provided based on visemes detected using the synchronously captured audio voice data of the speaking actor. The plurality of visual facial animation values and the plurality of audio facial animation values are combined to generate output facial animation values for use in facial animation.
In more detailed features of the invention, the output facial animation values associated with a mouth for a facial animation may be based only on the respective mouth-associated values of the plurality of audio facial animation values. Alternatively, the output facial animation values associated with a mouth for a facial animation may be based on a weighted average of the respective mouth-associated values of the plurality of visual facial animation values and the respective mouth-associated values of the plurality of audio facial animation values. Also, the output facial animation values associated with a mouth for a facial animation may be based on Kalman filtering of the respective mouth-associated values of the plurality of visual facial animation values and the respective mouth-associated values of the plurality of audio facial animation values. Further, the step of combining the plurality of visual facial animation values and the plurality of audio facial animation values to generate output facial animation values may include detecting whether speech is occurring in the synchronously captured audio voice data of the speaking actor and, while speech is detected as occurring, generating the output facial animation values associated with a mouth based only on the respective mouth-associated values of the plurality of audio facial animation values and, while speech is not detected as occurring, generating the output facial animation values associated with the mouth based only on the respective mouth-associated values of the plurality of visual facial animation values.
In other more detailed features of the invention, the tracking of facial features in the sequence of facial image frames of the speaking actor may be performed using bunch graph matching, or using transformed facial image frames generated based on wavelet transformations, such as Gabor wavelet transformations, of the facial images.
Other features and advantages of the present invention should be apparent from the following description of the preferred embodiments taken in conjunction with the accompanying drawings, which illustrate, by way of example, the principles of the invention.
The present invention is embodied in a method, and related apparatus, for generating facial animation values using a sequence of facial image frames and synchronously captured audio data of a speaking actor.
As shown in
The output facial animation values associated with a mouth in the facial animation may be based only on the respective mouth-associated values of the plurality of audio facial animation values. The combination of the visually generated facial animation values and the audio-based mouth animation values provides advantageous display of animated avatars.
The visemes are a visual equivalent of phonemes, i.e., visemes are related to facial expressions that are associated with temporal speech units in audio voice data. For the English language, it is generally agreed that there may be 15 visemes associated with 43 possible phomenes. Speech analysis and viseme detection may be accomplished with analysis products produced by LIPSinc, Inc., of Morrisville, N.C. (www.lipsinc.com).
The facial animation values or tags may be displacement values relative to neutral face values. Advantageously, 8 to 22 (or more) facial animation values may be used to define and animate the mouth, eyes, eyebrows, nose, and the head angle. Representative facial animation values for the mouth may include vertical mouth position, horizontal mouth position, mouth width, lip distance, and mouth corner position (left and right).
With reference to
Alternatively, the combined values may be based on recursive estimates using a series of the animation values. Accordingly, the output facial animation values associated with a mouth in the facial animation are based on Kalman filtering of the respective mouth-associated values of the plurality of visual facial animation values and the respective mouth-associated values of the plurality of audio facial animation values. The Kalman filtering may be accomplished in accordance with Equations 2-7.
With reference to
The switches, S1 and S2, may be controlled by a Speech Activity Detector 22 (SAD). The operation of the SAD is described with reference to FIG. 4. The audio voice data 24 is filtered by a low-pass filter (step 26), and the audio features are computed for separating speech activity from background noise (step 28). The background noise may be characterized to minimize its effect on the SAD. The noise and audio speech indications 30 are temporally smoothed to decrease the effects of spurious detections of audio speech. (step 32).
The tracking of facial features in the sequence official image frames of the speaking actor may be performed using bunch graph matching, or using transformed facial image frames generated based on wavelet transformations, such as Gabor wavelet transformations, of the facial image frames. Wavelet-based tracking techniques are described in U.S. Pat. No. 6,272,231. The wavelet-based sensing allows tracking of a person's natural characteristics without any unnatural elements to interfere with the person's natural characteristics. Existing methods of facial feature sensing typically use markers that are glued to a person's face. The use of markers for facial motion capture is cumbersome and has generally restricted the use of facial motion capture to high-cost applications such as movie production. The entire disclosure, of U.S. Pat. No. 6,272,231 is hereby incorporated herein by reference. The techniques of the invention may be accomplished using generally available image processing systems.
Although the foregoing discloses the preferred embodiments of the present invention, it is understood that those skilled in the art may make various changes to the preferred embodiments without departing from the scope of the invention. The invention is defined only by the following claims.
This is a continuation-in-part of U.S. patent application Ser. No. 09/871,370, filed May 31, 2001, which is a continuation of U.S. patent application Ser. No. 09/188,079, filed Nov. 6, 1998, now U.S. Pat. No. 6,272,231, which claims priority from U.S. Provisional Application No. 60/081,615, filed Apr. 13, 1998.
Number | Name | Date | Kind |
---|---|---|---|
4725824 | Yoshioka | Feb 1988 | A |
4805224 | Koezuka et al. | Feb 1989 | A |
4827413 | Baldwin et al. | May 1989 | A |
5159647 | Burt | Oct 1992 | A |
5168529 | Peregrim et al. | Dec 1992 | A |
5187574 | Kosemura et al. | Feb 1993 | A |
5220441 | Gerstenberger | Jun 1993 | A |
5280530 | Trew et al. | Jan 1994 | A |
5333165 | Sun | Jul 1994 | A |
5383013 | Cox | Jan 1995 | A |
5430809 | Tomitaka | Jul 1995 | A |
5432712 | Chan | Jul 1995 | A |
5511153 | Azarbayejani et al. | Apr 1996 | A |
5533177 | Wirtz et al. | Jul 1996 | A |
5550928 | Lu et al. | Aug 1996 | A |
5581625 | Connell | Dec 1996 | A |
5588033 | Yeung | Dec 1996 | A |
5608839 | Chen | Mar 1997 | A |
5680487 | Markandey | Oct 1997 | A |
5699449 | Javidi | Dec 1997 | A |
5714997 | Anderson | Feb 1998 | A |
5715325 | Bang et al. | Feb 1998 | A |
5719954 | Onda | Feb 1998 | A |
5736982 | Suzuki et al. | Apr 1998 | A |
5764803 | Jacquin et al. | Jun 1998 | A |
5774591 | Black et al. | Jun 1998 | A |
5802220 | Black et al. | Sep 1998 | A |
5809171 | Neff et al. | Sep 1998 | A |
5828769 | Burns | Oct 1998 | A |
5875108 | Hoffberg et al. | Feb 1999 | A |
5917937 | Szeliski et al. | Jun 1999 | A |
5982853 | Liebermann | Nov 1999 | A |
5995119 | Cosatto et al. | Nov 1999 | A |
6011562 | Gagné | Jan 2000 | A |
6044168 | Tuceryan et al. | Mar 2000 | A |
6052123 | Lection et al. | Apr 2000 | A |
6115052 | Freeman et al. | Sep 2000 | A |
6181351 | Merrill et al. | Jan 2001 | B1 |
6320583 | Shaw et al. | Nov 2001 | B1 |
6504546 | Cosatto et al. | Jan 2003 | B1 |
Number | Date | Country | |
---|---|---|---|
20020118195 A1 | Aug 2002 | US |
Number | Date | Country | |
---|---|---|---|
60081615 | Apr 1998 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 09188079 | Nov 1998 | US |
Child | 09871370 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 09871370 | May 2001 | US |
Child | 09929516 | US |