1. Field of the Invention
The present invention relates generally to the field of animation and more particularly to automatic generation of animation using pre-rendered images with lip sync from a text file containing a particular message.
2. Description of the Prior Art
It is well known in the art to animate humans and animals so that they execute various human-like gestures. It is also known in the art to synchronize animated mouth movements when an animated character talks, sings or otherwise makes audible mouth sounds. This is known in the trade as lip synchronization or simply lip sync, and various commercially available software can take an input sound file and return a set of mouth shapes as outputs along with matching time points. These mouth shapes can then be used with an animated character at the specific time points.
Typically, the rules for lip sync are as follows, whether provided by software, or generated by hand:
There are numerous other rules and techniques known in the art for lip sync or causing an animated character to mouth words. In addition to these rules, timing is important. Commercial lip sync software is available that returns a set of mouth shapes and returns a time frame, both based on the sound file. For example, the word “hello” spoken at normal speed might have a time frame output like:
0 sec.-0.1 sec.
As stated, it is also known in the art to cause an animated character to make human-like gestures. For example, in saying “hello”, the character might execute a waving gesture.
Prior art techniques generally use hand-drawn animation for characters and movement with mouth movements also drawn in. Automated prior art uses on-the-fly movement computations and rendering using a rendering engine. This leads to inferior rendering. It would be advantageous to have a system and method that could take an input sound file, and a basic set of animation frames and compose a complete animation sequence including gestures taken from hundreds of pre-rendered images with the correct lip sync mouth movements based on the sound file. It would also be advantageous to have a system and method that could take an input text file along with a choice of an animated character, and generate a complete animated sequence with video and audio components including gesturing and correct lip sync using images from a large set of high quality renderings.
The present invention relates to a system and method for generating animation sequences from either input sound files, input text files or both. One embodiment of the present invention takes an animation sequence of a particular character performing a gesture (such as waving hello) and a sound file, and produces a complete animation sequence with sound and correct lip synchronization using high quality rendered images. Another embodiment takes an input text file, decodes it to determine one or more gestures, produces a sound file, and then outputs a complete animated sequence with a chosen animation character performing the one or more gestures and mouthing the spoken sounds with correct lip synchronization. Still another embodiment allows entry of a sound file containing multiple spoken gesture keywords. This file can be converted to text or searched for keywords as an audio file. Finally, gestures can be sequenced, and the final animation sequence produced with correct lip synchronization for each gesture present.
The present invention can produce very high quality animations very quickly, because it chooses from a stored database of high-quality renderings. These renderings cannot be generated on-the-fly; rather, they take a very large amount of time to produce. With numerous high-quality renderings stored in the database at run time, the composition engine can simply choose the best renderings in a very short amount of time as they are needed. This allows the present invention to produce very high-quality animation files using renderings that took hours to prepare in only seconds.
Several drawings and figures are presented to illustrate features of the present invention:
Several diagrams, animations, and drawings have been presented to aid in understanding the present invention. The scope of the present invention is not limited to what is shown in the figures.
The present invention relates to a system and method for generating animation sequences from either input sound files or input text files. A first embodiment of the present invention takes an animation sequence of a particular character performing a gesture (such as waving hello) and a sound file, and produces a complete animation sequence with sound and correct lip sync.
The present invention chooses from a stored database of high-quality renderings. As previously stated, these renderings cannot be generated on-the-fly; rather, they take a very large amount of time to produce. With numerous high-quality renderings stored in the database at run time, the composition engine can simply choose the best renderings in a very short amount of time as they are needed.
Turning to
As an example, the input sound file might contain the word “Hello”. The basic animation is just a girl waving hello without moving her lips. The animation is typically created by an animator. A user wants to create for his website the same animation, but with the girl saying hello. The user can supply a sound file of a girl saying hello, or one can be recorded. The user can upload the sound file from a remote location over a network like the Internet, choose that particular animation (out of maybe several possible choices) from menus that appear on his computer screen. The system of
After analyzing the uploaded sound file using software such as the lip sync software, and after getting the complete set of frames with all the possible mouth shapes at every point in the gesture, the timing output from the lip sync software is used by the composition engine to calculate which frame has to be picked for each final animation frame.
0 sec.-0.1 sec.
The eight final frames in this example are: 0.00 F1-E, 0.05 F2-E, 0.10 F3-L, 0.15 F4-O, 0.20 F5-O, 0.25 F6-O, 0.30 F7-U, 0.35 F8-U as shown in
A second embodiment of the present invention takes an input text file, decodes it to determine one or more gestures and what sounds should be produced, produces a sound component chooses animation frames from a large set of pre-rendered images, and then outputs a complete animated sequence with a chosen animation character performing the one or more gestures and mouthing the spoken sounds with correct lip sync.
Turning to
A sound file 1 can be separately supplied as in the first embodiment, or it can be generated from the text file using techniques known in the art (text to voice). This basic sound file can be enhanced by adjusting accents or stress points from templates of the predefined keywords. Also, punctuation in the text file can be used to get the accents and rising and falling pitch correct. For example, the text: “Hello! might be pronounced differently then the text “Hello.” The text file can optionally be accented to show stress points, for example: h e l l o′ where the pitch rises on the last syllable or: H e′ l l o where the pitch drops on the last syllable. A question mark in the file might show that the last word has a higher pitch, for example: Do you want the best deal in town? requires the word “town” to have a higher pitch than the other words. In some cases, the sound file 1 may need to be adjusted by a human after it is generated.
Once the sound file 1 is supplied or generated, it can be fed to the composition engine 3 and to the lip sync software 4 as in the previous embodiment. The chosen keywords are used to pick pre-stored animation sequences from the skeleton frame database 2. The composition engine 3 can then produce a final animation with the correct sequence of gestures and the correct mouthed words with lip sync. The complete animation 5 can be stored in an output file 6 as before, and also transmitted to the user over the network for use on their website.
The system of the present invention can be stored on a server that is accessible over a network such as the Internet. A user on a computer with a browser can access the system in order to generate animation sequences. Under the control of various menus, boxes and the like, the user can be guided through the process. The user could first be shown a catalog of possible animation characters. These would be characters that have libraries of gesture frames stored for them. The user could choose one or more such characters.
Next, the user might be asked to enter text into a textbox. Alternatively, the user could be shown a library of pre-stored phrases to choose from. These pre-stored phrases can already have completed animation sequences stored for them, or at least templates that allow basic animation sequences to be generated for a particular chosen animation character. Pre-stored phrases could also have associated sound files ready for use. The server can store numerous large sets of high quality pre-rendered images for various animation characters.
If the user chooses freeform text entry, then the system can attempt to parse it and find gesture keywords using a parsing engine. The gesture keywords can have sound bites associated with them for composition into a final sound file. Also, the user could be asked to upload a sound file.
Once a sound file is present or generated, and the gesture sequence is known, the composition engine 3 can put together a connected sequence of gestures that is timed to the location of the keywords in the sound file. Finally, different parts of the sound file that correspond to different keywords can be fed to the lip sync software 4 to generate mouth shapes and timing for each separate gesture. The composition engine 3 can then create a connected, smooth-flowing complete animation sequence corresponding to the entered text. The final output file 6 can be downloaded to the user's computer in a format usable on a webpage or playable on the user's browser.
A third embodiment of the present invention allows the user to upload a sound file containing multiple gesture keywords. This sound file can be searched for gesture keywords either using filters in the audio domain or by converting the sound file to a text file using techniques known in the art (voice recognition —sound to text). The generated text file can be searched for the keywords. A final animation sequence can then be generated from the keyword list and sound file as in the previous embodiment.
Any of these embodiments can run on any computer, especially a server on a network. The server typically has at least one processor executing stored computer instructions stored in a memory to transform data stored in the memory. A communications module connects the server to a network like the Internet or any other network, by wire, fiber optics, or wirelessly such as by WiFi or cellular telephone. The network can be any type of network including a cellular telephone network.
The present invention transforms simple word data and images from pre-rendered sets of hundreds of images into an completed animation sequence with sound and lip sync. The final product is a totally new form that requires considerable computation to achieve.
Several descriptions and illustrations have been provided to aid in understanding the present invention. One with skill in the art will realize that numerous changes and variations may be made without departing from the spirit of the invention. Each of these changes and variations is within the scope of the present invention.