The following relates generally to computer animation and more specifically to a system and method for animated lip synchronization.
Facial animation tools in industrial practice have remained remarkably static, typically using animation software like MAYA™ to animate a 3D facial rig, often with a simple interpolation between an array of target blend shapes. More principled rigs are anatomically inspired with skeletally animated jaw and target shapes representing various facial muscle action units (FACS), but the onus of authoring the detail and complexity necessary for human nuance and physical plausibility remain tediously in the hands of the animator.
While professional animators may have the ability, budget and time to bring faces to life with a laborious workflow, the results produced by novices using these tools, or existing procedural or rule-based animation techniques, are generally less flattering. Procedural approaches to automate aspects of facial animation such as lip-synchronization, despite showing promise in the early 1990s, have not kept pace in quality with the complexity of the modern facial models. On the other hand, facial performance capture has achieved such a level of quality that it is a viable alternative to production facial animation. As with all performance capture, however, it has several shortcomings, for example: the animation is limited by the capabilities of the human performer, whether physical, technical or emotional; subsequent refinement is difficult; and partly hidden anatomical structures that play a part in the animation, such as the tongue, have to be animated separately.
A technical problem is thus to produce animator-centric procedural animation tools that are comparable to, or exceed, the quality of performance capture, and that are easy to edit and refine.
In an aspect, there is provided a method for animated lip synchronization executed on a processing unit, the method comprising: mapping phonemes to visemes; synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and outputting the viseme action units.
In a particular case, the method further comprising capturing speech input; parsing the speech input into the phonemes; and aligning the phonemes to the corresponding portions of the speech input.
In a further case, aligning the phonemes comprises one or more of phoneme parsing and forced alignment.
In another case, two or more viseme action units are co-articulated such that the respective two or more visemes are approximately concurrent.
In yet another case, the jaw contributions and the lip contributions are respectively synchronized to independent visemes, and wherein the viseme action units are a linear combination of the independent visemes.
In yet another case, the jaw contributions and the lip contributions are each respectively synchronized to activations of one or more facial muscles in a biomechanical muscle model such that the viseme action units represent a dynamic simulation of the biomechanical muscle model.
In yet another case, mapping the phonemes to visemes comprises at least one of mapping a start time of at least one of the visemes to be prior to an end time of a previous respective viseme and mapping an end time of at least one of the visemes to be after a start time of a subsequent respective viseme.
In yet another case, a start time of at least one of the visemes is at least 120 ms before the respective phoneme is heard, and an end time of at least one of the visemes is at least 120 ms after the respective phoneme is heard.
In yet another case, a start time of at least one of the visemes is at least 150 ms before the respective phoneme is heard, and an end time of at least one of the visemes is at least 150 ms after the respective phoneme is heard.
In yet another case, viseme decay of at least one of the visemes begins between seventy-percent and eighty-percent of the completion of the respective phoneme.
In yet another case, an amplitude of each viseme is determined by one or more of lexical stress and word prominence.
In yet another case, the viseme action units further comprise tongue contributions for each of the phonemes.
In yet another case, the viseme action unit for a neutral pose comprises a viseme mapped to a bilabial phoneme.
In yet another case, the method further comprising outputting a phonetic animation curve based on the change of viseme action units over time.
In another aspect, there is provided a system for animated lip synchronization, the system having one or more processors and a data storage device, the one or more processors in communication with the data storage device, the one or more processors configured to execute: a correspondence module for mapping phonemes to visemes; a synchronization module for synchronizing the visemes into viseme action units, the viseme action units comprising jaw and lip contributions for each of the phonemes; and an output module for outputting the viseme action units to an output device.
In a particular case, the system further comprising an input module for capturing speech input received from an input device, the input module parsing the speech input into the phenomes; and an alignment module for aligning the phonemes to the corresponding portions of the speech input.
In another case, the system further comprising a speech analyzer module for analyzing one or more of pitch and intensity of the speech input.
In yet another case, the alignment module aligns the phonemes by at least one of phoneme parsing and forced alignment.
In yet another case, the output module further outputs a phonetic animation curve based on the change of viseme action units over time.
In another aspect, there is provided a facial model for animation on a computing device, the computing device having one or more processors, the facial model comprising: a neutral face position; an overlay of skeletal jaw deformation, lip deformation and tongue deformation; and a displacement of the skeletal jaw deformation, the lip deformation and the tongue deformation by a linear blend of weighted blend-shape action units.
These and other aspects are contemplated and described herein. It will be appreciated that the foregoing summary sets out representative aspects of systems and methods for animated lip synchronization to assist skilled readers in understanding the following detailed description.
The features of the invention will become more apparent in the following detailed description in which reference is made to the appended drawings wherein:
Embodiments will now be described with reference to the figures. For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.
Various terms used throughout the present description may be read and understood as follows, unless the context indicates otherwise: “or” as used throughout is inclusive, as though written “and/or”; singular articles and pronouns as used throughout include their plural forms, and vice versa; similarly, gendered pronouns include their counterpart pronouns so that pronouns should not be understood as limiting anything described herein to use, implementation, performance, etc. by a single gender; “exemplary” should be understood as “illustrative” or “exemplifying” and not necessarily as “preferred” over other embodiments. Further definitions for terms may be set out herein; these may apply to prior and subsequent instances of those terms, as will be understood from a reading of the present description.
Any module, unit, component, server, computer, terminal, engine or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Further, unless the context clearly indicates otherwise, any processor or controller set out herein may be implemented as a singular processor or as a plurality of processors. The plurality of processors may be arrayed or distributed, and any processing function referred to herein may be carried out by one or by a plurality of processors, even though a single processor may be exemplified. Any method, application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media and executed by the one or more processors.
As used herein, the term “viseme” means ‘visible phoneme’ and refers to the shape of the mouth at approximately the apex of a given phoneme. Viseme is understood to mean a facial image that can be used to describe a particular sound. Whereby, a viseme is the visual equivalent of a phoneme or unit of sound in spoken language.
Further, for the purposes of the following disclosure, the relevant phonemic notation is as follows:
about
bin
chin
din
them
fin
gain
hat
jump
kin
limb
mat
nap
pin
ran
sin
shin
tin
thin
van
wet
yet
zoo
The following relates generally to computer animation and more specifically to a system and method for animated lip synchronization.
Generally, prior techniques for computer animation of mouth poses rely on dividing up speech segment into each phoneme, then producing an animation for each one of the phonemes; for example, creating visemes for each phoneme, and then applying the visemes to a given speech segment. Typically, such techniques would unnaturally transition from a neutral face straight to the viseme animation. Additionally, such techniques typically assume each phoneme has a unique physical animation represented by a unique viseme.
However, prior techniques that follow this approach are not accurately representative of realistic visual representations of speech. As an example, a ventriloquist can produce many words and phonemes with very minimal facial movement, and thus, with atypical visemes. As such, these conventional approaches are not able to automatically generate expressive lip-synchronized facial animation that is not only based on certain unique phonetic shapes, but also based on other visual characteristics of a person's face during speech. As an example of the substantial advantage of the method and system described herein, animation of speech can be advantageously based on the visual characteristics of a person's jaw and lip characteristics during speech. The system described herein is able to generate different animated visemes for a certain phonetic shape based on jaw and lip parameters; for example, due to how an audio signal changes the way a viseme looks.
In embodiments of the system and method described herein, technical approaches are provided to solve the technological computer problem of realistically representing and synchronizing computer-based facial animation to sound and speech. In embodiments herein, technical solutions are provided such that given an input audio soundtrack, and in some cases a speech transcript, there is automatic generation of expressive lip-synchronized facial animation that is amenable to further artistic refinement. The systems and methods herein draw from psycholinguistics to capture speech using two visually distinct anatomical actions: those of the jaw and lip. In embodiments herein, there is provided construction of a transferable template 3D facial rig.
Turning to
In the context of speech synchronization, an example of a substantial technical problem is that given an input audio soundtrack and speech transcript, there is a need to generate a realistic, expressive animation of a face with lip and jaw, and in some cases tongue, movements that synchronize with an audio soundtrack. In some cases, beyond producing realistic output, such a system should integrate with the traditional animation pipeline, including the use of motion capture, blend shapes and key-framing. In further cases, such a system should allow animator editing of the output. While preserving the ability of animators to tune final results, other non-artistic adjustments may be necessary in speech synchronization to deal with, for example, prosody, mispronunciation of text, and speech affectations such as slurring and accents. In yet further cases, such a system should respond to editing of the speech transcript to account for speech anomalies. In yet further cases, such a system should be able to produce realistic facial animation on a variety of face rigs.
For the task of speech synchronization, the system 200 can aggregate its attendant facial motions into two independent categories: functions related to jaw motion, and functions related to lip motion (see
Turning to
In some cases, at block 304, the alignment module 204 employs forced alignment to align utterances in the soundtrack to the text, giving an output time series containing a sequence of phonemes.
At block 306, the correspondence module 206 combines audio, text and alignment information to produce text-to-phoneme and phoneme-to-audio correspondences.
At block 308, the synchronization module 208 computes lip-synchronization viseme action units. The lip-synchronization viseme action units are computed by extracting jaw and lip motions for individual phonemes. However, humans do not generally articulate each phoneme separately. Thus, at block 310, the synchronization module 208 blends the corresponding visemes into co-articulated action units. As such, the synchronization module 208 is advantageously able to more accurately track real human speech.
At block 312, the output module 210 outputs the synchronized co-articulated action units to the output device 226.
In some cases, the speech input can include at least one of a speech audio and a speech transcript.
In some cases, as described in greater detail herein, two or more viseme action units can be co-articulated such that the respective two or more visemes are approximately concurrent.
In some cases, jaw behavior and lip behavior can be captured as independent viseme shapes. As such, jaw and lip intensity can be used to modulate the blend-shape weight of the respective viseme shape. In this case, the viseme action units are a linear combination of the modulated viseme shape. In other words, the jaw contributions and the lip contributions can be respectively synchronized to independent visemes, and the viseme action units can be a linear combination of the independent visemes.
In some cases, the jaw contributions and the lip contributions can each respectively be synchronized to activations of one or more facial muscles in a biomechanical muscle model. In this way, the viseme action units represent a dynamic simulation of the biomechanical muscle model.
In some cases, viseme action units can be determined by manually setting jaw and lip values over time by a user via the input device 222. In other cases, the viseme action units can be determined by receiving lip contributions via the input device 22, and having the jaw contributions be determined by determining the modulation of volume of input speech audio. In other cases, the lip contributions and the jaw contributions can be automatically determined by the system 300 from input speech audio and/or input speech transcript.
In some cases, as described in greater detail herein, mapping the phonemes to visemes can include at least one of mapping a start time of at least one of the visemes to be prior to an end time of a previous respective viseme and mapping an end time of at least one of the visemes to be after a start time of a subsequent respective viseme.
In some cases, as described in greater detail herein, a start time of at least one of the visemes is at least 120 ms before the respective phoneme is heard, and an end time of at least one of the visemes is at least 120 ms after the respective phoneme is heard.
In some cases, as described in greater detail herein, a start time of at least one of the visemes is at least 150 ms before the respective phoneme is heard, and an end time of at least one of the visemes is at least 150 ms after the respective phoneme is heard.
In some cases, as described in greater detail herein, viseme decay of at least one of the visemes begins between seventy-percent and eighty-percent of the completion of the respective phoneme.
As follows, Applicant details an exemplary development and validation of the JALI model according to embodiments of the system and method described herein. Applicant then demonstrates how the JALI model can be constructed over a typical FACS-based 3D facial rig and transferred across such rigs. Further, Applicant provides system implementation for an automated lip-synchronization approach, according to an embodiment herein.
Computer facial animation can be broadly classified as procedural, data-driven, or performance-capture. Procedural speech animation segments speech into a string of phonemes, which are then mapped by rules or look-up tables to visemes; typically many-to-one. As an example, / m b p / all map to the viseme MMM in
Procedural animation techniques generally produce compact animation curves amenable to refinement by animators; however, such approaches are not as useful for expressive realism as data-driven and performance-capture approaches. However, neither procedural animation, nor data-driven and performance-capture approaches, explicitly model speech styles; namely the continuum of viseme shapes manifested by intentional variations in speech. Advantageously, such speech styles are modelled by the system and method described herein.
Data-driven methods smoothly stitch pieces of facial animation data from a large corpus, to match an input speech track. Multi-dimensional morphable models, hidden Markov models, and active appearance models (AAM) have been used to capture facial dynamics. For example, AAM-based, Dynamic Visemes uses cluster sets of related visemes, gathered through analysis of the TIMIT corpus. Data-driven methods have also been used to drive a physically-based or statistically-based model. However, the quality of data-driven approaches is often limited by the data available; many statistical models drive the face directly, disadvantageously taking ultimate control away from an animator.
Performance-capture based speech animation transfers acquired motion data from a human performer onto a digital face model. Performance capture approaches generally work based on real-time performance-based facial animation, and while often not specifically focused on speech, are able to create facial animation. One conventional approach uses a pre-captured database to correct performance capture with a deep neural network trained to extract phoneme probabilities from audio input in real time using an appropriate sensor. A substantial disadvantage of performance capture approaches is that it is limited by the captured actor's abilities and is difficult for an animator to refine.
The JALI viseme model, according to an embodiment herein, is driven by the directly observable bioacoustics of sound production using a mixture of diaphragm, jaw, and lip. The majority of variation in visual speech is accounted for by jaw, lip and tongue motion. While trained ventriloquists are able to speak entirely using their diaphragm with little observable facial motion, most people typically speak using a mix of independently controllable jaw and lip facial action. The JALI model simulates visible speech as a linear mix of jaw-tongue (with minimal face muscle) action and face-muscle action values. The absence of any JA (jaw) and LI (lip) action is not a static face but one perceived as poor-ventriloquy or mumbling. The other extreme is hyper-articulated screaming (see, for example,
Conventional animation of human speech is based on a mapping from phonemes to visemes, such as the two labiodental phonemes /f v/ mapping to a single FFF viseme, shown in
Visemes corresponding to five arbitrarily-chosen speaking styles for the phoneme /AO/ in ‘thOUght’ performed by an actor are shown in
Applicant recognized the substantial advantage of using a JALI viseme field to provide a controllable abstraction over expressive speech animation of the same phonetic content. As described herein, the JALI viseme field setting over time, for a given performance, can be extracted plausibly through analysis of the audio signal. In the systems and methods described herein, a combination of the JALI model with lip-synchronization, described herein, can animate a character's face with considerable realism and accuracy.
In an embodiment, as shown in
A conventional facial rig often has individual blend-shapes for each viseme; usually with a many-to-one mapping from phonemes to visemes, or many-to-many using dynamic visemes. In contrast, a JALI-rigged character, according to the system and method described herein, may require that such visemes be separated to capture sound production and shaping as mixed contribution of the jaw, tongue and facial muscles that control the lips. As such, the face geometry is a composition of a neutral face nface, overlaid with skeletal jaw and tongue deformation jd; td, displaced by a linear blend of weighted blend-shape action unit displacements au; thus, face=nface+jd+td+au.
To create a viseme within the 2D field defined by JA and LI for any given phoneme p, the geometric face(p) can be set for any point JA,LI in the viseme field of p to be:
face(p;JA;LI)=nface+JA*(jd(p)+td(p))+LI*au(p)
where jd(p), td(p), and au(p) represent an extreme configuration of the jaw, tongue and lip action units, respectively, for the phoneme p. Suppressing both the JA and LI values here would result in a static neutral face, barely obtainable by the most skilled of ventriloquists. Natural speech without JA, LI activation is closer to a mumble or an amateur attempt at ventriloquy.
For an open-jaw neutral pose and ‘ventriloquist singularity’, a neutral face of the JALI model is configured such that the character's jaw hangs open slightly (for example, see
Advantageously, the neutral face according the system and method described herein is better suited to produce ‘ventriloquist’ visemes (with zero (JA,LI) activation). In some cases, three ‘ventriloquist’ visemes can be used: the neutral face itself (for the bilabials /b m p/), the neutral face with the orbicularis oris superior muscle relaxed (for the labiodentals /f v/), and the neutral face with both orbicularis oris superior and inferior muscles relaxed, with lips thus slightly parted (for all other phonemes). This ‘Ventriloquist Singularity’ at the origin of the viseme field (i.e. (JA,LI)=(0,0)), represents the lowest energy viseme state for any given phoneme.
For any given phoneme p, the geometric face for any point (p, JA, LI) is thus defined as:
face(p;JA;LI)=nface+JA*jd(p)+(vtd(p)+JA*td(p))+(vau(p)+LI*(au(p))
where vtd(p) and vau(p) are the small tongue and muscle deformations necessary to pronounce the ventriloquist visemes, respectively.
For animated speech, the JALI model provides a layer of speech abstraction over the phonetic structure. The JALI model can be phonetically controlled by traditional keyframing or automatic procedurally generated animation (as described herein). The JALI viseme field can be independently controlled by the animator over time, or automatically driven by the audio signal (as described herein). In an example, for various speaking styles, a single representative set of procedural animation curves for the face's phonetic performance can be used, and only the (JA,LI) controls are varied from one performance to the next.
In another embodiment of a method for animated lip synchronization 900 shown in
In the animation phase 906, the aligned phonemes are mapped to visemes by the correspondence module 206. Viseme amplitudes are set (for articulation) 914. Then the visemes are re-processed 916, by the synchronization module 208, for co-articulation to produce viseme timings and resulting animation curves for the visemes (in an example, a Maya MEL script of sparsely keyframed visemes). These phonetic animation curves can be outputted by the output module 210 to demonstrate how the phonemes are changing over time.
In the output phase 906, the output module 210 drives the animated viseme values on a viseme compatible rig 918 such as that represented by
As an example, pseudocode for the method 900 can include:
As an example, the method 900 can be used to animate the word “what”. Before animation begins, the speech audio track must first be aligned with the text in the transcript. This can happen in two stages: phoneme parsing 908 then forced alignment 912. Initially, the word ‘what’ is parsed into the phonemes: w 1UX t; then, the forced alignment stage returns timing information: w(2.49-2.54), 1UX(2.54-2.83), t(2.83-3.01). In this case, this is all that is needed to animate this word.
At block 904, the speech animation can be generated. First, ‘w’ maps to a ‘Lip-Heavy’ viseme and thus commences early; in some cases, start time would be replaced with the start time of the previous phoneme, if one exists. The mapping also ends late; in some cases, the end time is replaced with the end time of the next phoneme: ARTICULATE (‘w’, 7, 2.49, 2.83, 150 ms, 150 ms). Next, the ‘Lexically-Stressed’ viseme ‘UX’ (indicated by a ‘1’ in front) is more strongly articulated; and thus power is set to 10 (replacing the default value of 7): ARTICULATE (‘UX’, 10, 2.54, 2.83, 120 ms, 120 ms). Finally, ‘t’ maps to a Tongue-Only′ viseme, and thus articulates twice: 1) ARTICULATE (‘t’, 7, 2.83, 3.01, 120 ms, 120 ms); and then it is replaced with the previous, which then counts as a duplicate and thus extends the previous, 2) ARTICULATE (‘UX’, 10, 2.54, 3.01, 120 ms, 120 ms).
For the input phase 902, accurate speech transcript is preferable in order to produce procedural lip synchronization, as extra, missing, or mispronounced words and punctuation can result in poor alignment and cause cascading errors in the animated speech. In some cases, automatic transcription tools may be used for, for example, real-time speech animation. In further cases, manual transcription from the speech recording may be used for ease and suitability. Any suitable transcript text-to-phoneme conversion, for various languages, can be used; as an example, speech libraries built into Mac™ OS X™ to convert English text into a phonemic representation.
Forced alignment 912 is then used by the alignment module 204 to align the speech audio to its phonemic transcript. Unlike the creation of speech text transcript, this task requires automation, and, in some cases, is done by training a Hidden Markov Model (HMM) on speech data annotated with the beginning, middle, and end of each phoneme, and then aligning phonemes to the speech features. Several tools can be employed for this task; for example, Hidden Markov Model Toolkit (HTK), SPHINX, and FESTIVAL tools. Using these tools, as an example, Applicant measured alignment misses to be acceptably within 15 ms of the actual timings.
In the animation phase 904, a facial rig is animated by producing sparse animation keyframes for visemes by the correspondence module 206. The viseme to be keyframed is determined by the co-articulation model described herein. The timing of the viseme is determined by forced alignment after it has been processed through the co-articulation model. The amplitude of the viseme is determined by lexical and word stresses returned by the phonemic parser. The visemes are built on Action Units (AU), and can thus drive any facial rig (for example, simulated muscle, blend-shape, or bone-based) that has a Facial Action Coding System (FACS) or MPEG-4 FA based control system.
The amplitude of the viseme can be set based on two inputs: Lexical Stress and Word Prominence. These two inputs are retrieved as part of the phonemic parsing. Lexical Stress indicates which vowel sound in a word is emphasized by convention. For example, the word ‘water’ stresses the ‘a’ not the ‘e’ by convention. One can certainly say ‘watER’ but typically people say ‘WAter’. Word Prominence is the de-emphasis of a given word by convention. For example, the ‘of’ in ‘out of work’ has less word prominence than its neighbours. In an example, if a vowel is lexically stressed, the amplitude of that viseme is set to high (e.g., 9 out of 10). If a word is de-stressed, then all visemes in the word are lowered (e.g., 3 out of 10), if a de-stressed word has a stressed phoneme or it is an un-stressed phoneme in a stressed word, then the viseme is set to normal (e.g., 6 out of 10).
For co-articulation 916, timing can be based on the alignment returned by the forced alignment and the results of the co-articulation model. Given the amplitude, the phoneme-to-viseme conversion is processed through a co-articulation model, or else the lips, tongue and jaw can distinctly pronounce each phoneme, which is neither realistic nor expressive. Severe mumbling or ventriloquism makes it clear that coherent audible speech can often be produced with very little visible facial motion, making co-articulation essential for realism.
In the field of linguistics, “co-articulation” is the movement of articulators to anticipate the next sound or preserving movement from the last sound. In some cases, the representation of speech can have a few simplifying aspects. First, many phonemes map to a single viseme; for example, the phonemes: /AO/ (caught), /AX/ (about), AY/ (bite), and /AA/ (father) all map to the viseme AHH (see, for example,
For the JALI model for audio-visual synchronized speech, the model can be based on three anatomical dimensions of visible movements: tongue, lips and jaw. Each affects speech and co-articulation in particular ways. The rules for visual speech representation can be based on linguistic categorization and divided into constraints, conventions and habits.
In certain cases, there are four particular constraints of articulation:
The above visual constraints are observable and, for all but a trained ventriloquist, likely necessary to physically produce these phonemes.
In certain cases, there are three speech conventions which influence articulation:
Generally, it takes conscious effort to break the above speech conventions and most common visual speaking styles are influenced by them.
In certain cases, there are nine co-articulation habits that generally shape neighbouring visemes:
A technical problem for speech motion in computerized animation is to be able to optimize both simplicity (for benefit of the editing animator) and plausibility (for the benefit of the unedited performance).
In general, speech onset begins 120 ms before the apex of the viseme, wherein the apex typically coincides with the beginning of a sound. The apex is sustained in an arc to the point where 75% of the phoneme is complete, viseme decay then begins and then it takes another 120 ms to decay to zero. In further cases, viseme decay can advantageously begin between 70% and 80% of the completion of the respective phoneme. However, there is evidence that there is a variance in onset times for different classes of phonemes and phoneme combinations; for example, empirical measurements of specific phonemes /m p b f/ in two different states: after a pause (mean range: 137-240 ms) and after a vowel (mean range: 127-188 ms). The JALI model of the system and method described herein can advantageously use context-specific, phoneme-specific mean-time offsets. Phoneme onsets are parameterized in the JALI model, so new empirical measurements of phonemes onsets can be quickly assimilated.
In some cases, where phoneme durations are very short, then visemes will have a wide influence beyond its direct neighbours. In some cases, visemes can influence mouth shape up to five phonemes away, specifically lip-protrusion. In an embodiment herein, each mouth shape can be actually influenced by both direct neighbours, since the start of one is the end of another and both are keyed at the point. In further embodiments, as shown in
The Arc is a principle of animation and, in some cases, the system and method described herein can fatten and retain the facial muscle action in one smooth motion arc over duplicated visemes. In some cases, all the phoneme articulations have an exaggerated quality in line with another principle of animation, Exaggeration. This is due to the clean curves, the sharp rise and fall of each phoneme, each simplified and each slightly more distinct from its neighbouring visemes than in real-world speech.
For computing JALI values, according to the system and method described herein, from audio, in the animation phase 904, the JA and LI parameters of the JALI-based character can be animated by examining the pitch and intensity of each phoneme and comparing it to all other phonemes of the same class uttered in a given performance.
In some cases, three classes of phonemes can be examined: vowels, plosives and fricatives. Each of these classes requires a slightly different method of analysis to animate the lip parameter. Fricatives (s z f v S Z D T) create friction by pushing air past the teeth with either the lips or the tongue. This creates intensity at high frequencies, and thus they have markedly increased mean frequencies in their spectral footprints compared to those of conversational speech. If greater intensity is detected at a high frequency for a given fricative, then it is known that it was spoken forcefully and heavily-articulated. Likewise, with Plosives (p b d t g k), the air stoppage by lip or tongue builds pressure and the sudden release creates similarly high frequency intensity; whereby the greater the intensity, the greater the articulation.
Unlike fricatives and plosives, vowels are generally always voiced. This fact allows the system to measure the pitch and volume of the glottis with some precision. Simultaneous increases in pitch and volume are associated with emphasis. High mean formant F0 and high mean intensity are correlated with high arousal (for example, panic, rage, excitement, joy, or the like) which are associated with bearing teeth and greater articulation, and exaggerated speech. Likewise, simultaneous decreases are associated with low arousal (for example, shame, sadness, boredom, or the like).
In a particular embodiment, vowels are only considered by the JALI model if they are lexically stressed and fricatives/plosives are only considered if they arise before/after a lexically stressed vowel. This criteria advantageously chooses candidates carefully and keeps animation from being too erratic. Specifically, lexically stressed sounds will be the most effected by the intention to articulate, yell, speak strongly or emphasize a word in speech. Likewise the failure to do so will be most indicative of a mutter, mumble or an intention not to be clearly heard, due for example to fear, shame, or timidity.
Applicant recognized further advantages to the method and system described herein. The friction of air through lips and teeth make high frequency sounds which impair comparison between fricative/plosives and vowel sounds on both the pitch and intensity dimension; such that they must be separated from vowels for coherent/accurate statistical analysis. These three phoneme types can be compared separately because of the unique characteristics of the sound produced (these phoneme-types are categorically different). This comparison is done in a way that optimally identifies changes specific to each given phoneme type. In further cases, the articulation of other phoneme-types can be detected.
In some embodiments, pitch and intensity of the audio can analyzed with a phonetic speech analyzer module 212 (for example, using PRAAT™). Voice pitch is measured spectrally in hertz and retrieved from the fundamental frequency. The fundamental frequency of the voice is the rate of vibration of the glottis and abbreviated as F0. Voice intensity is measured in decibels and retrieved from the power of the signal. The significance of these two signals is that they are perceptual correlates. Intensity is power normalized to the threshold of human hearing and pitch is linear between 100-1000 Hz, corresponding to the common range of the human voice, and non-linear (logarithmic) above 1000 Hz. In a certain case, high-frequency intensity is calculated by measuring the intensity of the signal in the 8-20 kHz range.
In a further embodiment, for vocal performances of a face that is shouting throughout, automatic modulation of the JA (jaw) parameter may not be needed. The jaw value can simply be set to a high value for the entire performance. However, when a performer fluctuates between shouting and mumbling, the automatic full JALI model, as described herein, can be used. The method, as described herein, gathers statistics, mean/max/min and standard deviation for each, intensity and pitch and high frequency intensity.
Table 1 shows an example of how jaw values are set for vowels (the ‘vowel intensity’ is of the current vowel, and ‘mean’ is the global mean intensity of all vowels in the audio clip):
Table 2 shows an example of how lip values are set for vowels (the ‘intensity/pitch’ is of the current vowel, and ‘mean’ is the respective global mean intensity/pitch of all vowels in the audio clip):
Table 3 shows an example of how lip values are set for fricatives and plosives (the ‘intensity’ is the high frequency intensity of the current fricative or plosive, and ‘mean’ is the respective global mean high frequency intensity of all fricatives/plosives in the audio clip):
In a further embodiment, given two input files representing speech audio and text transcript, phonemic breakdown and forced alignment can be undertaken according to the method described herein. In an example, scripts (for example, applescript and praatscript) can be used to produce a phonemic breakdown and forced alignment while using an appropriate utility. This phonemic alignment is then used, by the speech analyzer 212 (for example, using PRAAT™), to produce pitch and intensity mean/min/max for each phoneme. Then, the phonemes can be run through to create animated viseme curves by setting articulation and co-articulation keyframes of visemes, as well as animated JALI parameters, as an appropriate script (for example, Maya Embedded Language (MEL) script). In some cases, this script is able to drive the animation of any JALI rigged character, for example in MAYA™.
As described below, the method and system as described herein can include the advantageous feature of the production of low-dimensionality signals. In an embodiment, the dimensionality of the output phase 906 is matched to a human communication signal. In this way, people can perceive phonemes and visemes, not arbitrary positions of a part of the face. For example, the procedural result of saying the word “water”, as shown in
In an example, a manner for evaluating the success of a realistic procedural animation model can be by comparing that animation to ‘ground truth’; i.e., a live-action source. Using live-action footage, the Applicant has evaluated the JALI model, as described in the system and method herein, by comparing it not only to live-footage, but also to the speech animation output from a dynamic visemes method, and a Dominance model method.
In this evaluation, a facial motion capture tool was utilized to track the face of the live-action face from the live-action footage, as well as the animated faces output from the aforementioned methods. Tracking data is then applied to animate ValleyBoy 704, allowing evaluation of the aforementioned models on a single facial rig. By comparing the JALI model, dynamic visemes and the dominance model to the ‘ground truth’ of the motion-captured live-action footage, a determination can be made regarding the relative success of each method. The exemplary evaluation used ‘heatmaps’ of the displacement errors of each method with respect to the live-action footage.
In
In the map of 1216, accumulated error for the 7-second duration of the actor's speech is shown. The dynamic viseme and JALI models fare significantly better than the dominance model in animating this vocal track. In general, dominance incurs excessive co-articulation of lip-heavy phonemes such as /F/ with adjacent phonemes. The dynamic viseme model appears to under-articulate certain jaw-heavy vowels such as /AA/, and to blur each phoneme over its duration. To a conspicuously lesser extent, the JALI model appears to over-articulate these same vowels at times.
Applicant recognized the substantial advantages of the methods and systems described herein for the automatic creation of lip-synchronized animation. The present approach can produce technological results that are comparable or better than conventional approaches in both performance-capture and data-driven speech, encapsulating a range of expressive speaking styles that is easy-to-edit and refine by animators.
In an example of the application of the advantages of the JALI model, as described herein, the Applicant recruited professional and student animators to complete three editing tasks: 1) adding a missing viseme, 2) fixing non-trivial out-of-sync phrase and 3) exaggerating a speech performance. Each of these tasks were completed with motion capture generated data and with JALI model generated data. All participants reported disliking editing motion capture data and unanimously rated it lowest for ease-of-use, ability to reach expectations and quality of the final edited result for all tasks, especially when compared to the JALI model. Overall, editing with the JALI model was preferred 77% of the time.
As evidenced above, Applicants recognized the advantages of having a model that includes both the benefits of being procedurally generated but still allowing ease of use for animators; such ease of use allows animators to get to an end product faster than conventional methods.
In a further advantage of the method and system described herein, the JALI model does not require marker-based performance capture. This is advantageous because output can be tweaked rather than recaptured. In some cases, for example with the capture of bilabials, the system noticeably outperforms performance capture approaches. Bilabials in particular are very important to get correct, or near correct, because the audience can easily and conspicuously perceive when animation of them is off. Furthermore, the approaches described herein do not require the capturing of voice actors such as in performance capture approaches. Thus, the approaches described herein do not have to rely on such actors who may not always be very expressive when it comes to using facial features, and thus risk the animation not being particularly expressive as a result.
The JALI model advantageously allows for the automatic creation of believable speech-synchronized animation sequences using only text and audio as input. Unlike many data-driven or performance capture methods, the output from the JALI model is animator-centric, and amenable to further editing for more idiosyncratic animation.
Applicant further recognized the advantages of allowing the easy combination of both the JALI model and its output with other animation workflows. As an example, the JALI model lip and jaw animation curves can be easily combined with head motion obtained from performance-capture.
The system and method, described herein, has a wide range of potential applications and uses; for example, in conjunction with body motion capture. Often the face and body are captured separately. One could capture the body and record the voice, then use the JALI model to automatically produce face animation that is quickly synchronized to the body animation via the voice recording. This is particularly useful in a virtual reality or augmented reality setting where facial motion capture is complicated by the presence of head mounted display devices.
In another example of a potential application, the system and method, as described herein, could be used for video games. Specifically, in role playing games, where animating many lines of dialogue is prohibitively time-consuming.
In yet another example of a potential application, the system and method, as described herein, could be used for crowds and secondary characters in film, as audiences' attention is not focused on these characters nor is the voice track forward in the mix.
In yet another example of a potential application, the system and method, as described herein, could be used for animatics or pre-viz, to settle questions of layout.
In yet another example of a potential application, the system and method, as described herein, could be used for animating main characters since the animation produced is designed to be edited by a skilled animator.
In yet another example of a potential application, the system and method, described herein, could be used for facial animation by novice or inexperienced animators.
Other applications may become apparent.
Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the claims appended hereto. The entire disclosures of all references recited above are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
3916562 | Burkhart | Nov 1975 | A |
5286205 | Inouye | Feb 1994 | A |
5613056 | Gasper | Mar 1997 | A |
5878396 | Henton | Mar 1999 | A |
5995119 | Cosatto | Nov 1999 | A |
6130679 | Chen | Oct 2000 | A |
6181351 | Merrill | Jan 2001 | B1 |
6504546 | Cosatto | Jan 2003 | B1 |
6539354 | Sutton | Mar 2003 | B1 |
6665643 | Lande | Dec 2003 | B1 |
6735566 | Brand | May 2004 | B1 |
6839672 | Beutnagel | Jan 2005 | B1 |
7827034 | Munns | Nov 2010 | B1 |
8614714 | Koperwas | Dec 2013 | B1 |
9094576 | Karakotsios | Jul 2015 | B1 |
10217261 | Li | Feb 2019 | B2 |
20050207674 | Fright | Sep 2005 | A1 |
20060009978 | Ma | Jan 2006 | A1 |
20060012601 | Francini | Jan 2006 | A1 |
20060221084 | Yeung | Oct 2006 | A1 |
20070009180 | Huang | Jan 2007 | A1 |
20080221904 | Cosatto | Sep 2008 | A1 |
20100057455 | Kim | Mar 2010 | A1 |
20100085363 | Smith | Apr 2010 | A1 |
20110099014 | Zopf | Apr 2011 | A1 |
20120026174 | McKeon | Feb 2012 | A1 |
20130141643 | Carson | Jun 2013 | A1 |
20140035929 | Matthews | Feb 2014 | A1 |
20170040017 | Matthews | Feb 2017 | A1 |
20170092277 | Sandison | Mar 2017 | A1 |
20170154457 | Theobald | Jun 2017 | A1 |
20170213076 | Francisco | Jul 2017 | A1 |
20170243387 | Li | Aug 2017 | A1 |
20180158450 | Tokiwa | Jun 2018 | A1 |
Entry |
---|
Ostermann, Animation of Synthetic Faces in MPEG-4, 1998, IEEE Computer Animation, pp. 49-55. |
King et al., An Anatomically-based 3D Parametric Lip Model to Support Facial Animation and Synchronized Speech, 2000, Department of Computer and Information Sciences of Ohio State University, pp. 1-19. |
Wong et al., Allophonic Variations in Visual Speech Synthesis for Corrective Feedback in CAPT, 2011, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5708-5711 (Year: 2011). |
Anderson, Robert et al., (2013), Expressive Visual Text-to-Speech Using Active Appearance Models, (pp. 3382-3389). |
Bevacqua, E., & Pelachaud, C., (2004). Expressive Audio-Visual Speech. Computer Animation and Virtual Worlds, 15(3-4), 297-304. |
Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., et al. (1994). Animated Conversation: Rule-Based Generation of Facial Expression, Gesture & Spoken Intonation for Multiple Conversational Agents. Presented at the SIGGRAPH '94: Proceedings of the 21st annual conference on Computer graphics and interactive techniques, ACM Request Permissions. http://doi.org/10.1145/192161.192272. |
Chandrasekaran, C., Trubanova, A., Stillittano, S., Caplier, A., & Ghazanfar, A. A. (2009). The Natural Statistics of Audiovisual Speech. PLoS Computational Biology, 5(7), 1-18. http://doi.org/10.1371/journal.pcbi.1000436. |
Cohen, M. M., & Massaro, D. W. (1993). Modeling Coarticulation in Synthetic Visual Speech. Models and Techniques in Computer Animation, 139-156. |
Deng, Z., Neumann, U., Lewis, J. P., Kim, T.-Y., Bulut, M., & Narayanan, S. (2006). Expressive Facial Animation Synthesis by Learning Speech Coarticulation and Expression Spaces. IEEE Transactions on Visualization and Computer Graphics, 12(6), 1523-1534. http://doi.org/10.1109/TVCG.2006.90. |
Kent, R. D., & Minifie, F. D. (1977). Coarticulation in Recent Speech Production Models. Journal of Phonetics, 5(2), 115-133. |
King, S. A. & Parent, R. E. (2005). Creating Speech-Synchronized Animation. IEEE Transactions on Visualization and Computer Graphics, 11(3), 341-352. http://doi.org/10.1109/TVCG.2005.43. |
Lasseter, J. (1987). Principles of Traditional Animation Applied to 3D Computer Animation. SIGGRAPH Computer Graphics, 21(4), 35-44. |
Marsella, S., Xu, Y., Lhommet, M., Feng, A. W., Scherer, S., & Shapiro, A. (2013). Virtual Character Performance From Speech (pp. 25-36). Presented at the SCA 2013, Anaheim, California. |
Mattheyses, W., & Verhelst, W. (2015). Audiovisual Speech Synthesis: An Overview of the State-of-the-Art. Speech Communication, 66(C), 182-217. http://doi.org/10.1016/j.specom.2014.11.001. |
Ohman, S. E. (1967). Numerical model of coarticulation. Journal of the Acoustical Society of America, 41(2), 310-320. |
Schwartz, J.-L., & Savariaux, C. (2014). No, There Is No 150 ms Lead of Visual Speech on Auditory Speech, but a Range of Audiovisual Asynchronies Varying from Small Audio Lead to Large Audio Lag. PLoS Computational Biology (PLOSCB) 10(7), 10(7), 1-10. http://doi.org/10.1371/journal.pcbi.1003743. |
Sutton, S., Cole, R. A., de Villiers, J., Schalkwyk, J., Vermeulen, P. J. E., Macon, M. W., et al. (1998). Universal Speech Tools: the CSLU Toolkit. Icslp 1998. |
Taylor, S. L., Mahler, M., Theobald, B.-J., & Matthews, I. (2012). Dynamic Units of Visual Speech. Presented at the SCA '12: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Eurographics Association. |
Troille, E., Cathiard, M.-A., & Abry, C. (2010). Speech face perception is locked to anticipation in speech production. Speech Communication, 52(6), 513-524. http://doi.org/10.1016/j.specom.2009.12.005. |
Number | Date | Country | |
---|---|---|---|
20180253881 A1 | Sep 2018 | US |