Facial Performance Synthesis Using Deformation Driven Polynomial Displacement Maps

Description

BACKGROUND

The appearance and expressiveness of facial performances are greatly influenced by complex deformations of the face at several scales. Large-scale deformations are driven by muscles and determine the overall shape of the face. Medium-scale deformations are mainly caused by skin wrinkling, and produce many of the expressive qualities in facial expressions. Finally, at the skin mesostructure there is fine-scale stretching and compression which produces subtle but perceptually significant cues. This complex behavior is challenging to reproduce in virtual characters with any combination of artistry and simulation.

Currently, creating realistic virtual faces often involves capturing textures, geometry, and facial motion of real people. It has proven, however, to be difficult to capture and represent facial dynamics accurately at all scales. Face scanning systems can acquire high-resolution facial textures and geometry, but typically only for static poses. Motion capture techniques record continuous facial motion, but only at a coarse level of detail. Straightforward techniques of driving high-resolution character models by relatively coarse motion capture data often fail to produce realistic motion at medium and fine scales. This limitation has motivated techniques such as wrinkle maps, blend shapes, and real-time 3D scanning. However, these prior art methods either fail to reproduce the non-linear nature of skin deformation, are labor-intensive, or do not capture and represent all scales of skin deformation faithfully.

Several prior art real-time 3D scanning systems/methods exist that are able to capture dynamic facial performances. These systems/methods either rely on structured light, use photometric stereo, or a combination of both. These prior art systems/methods are not suited for acquiring data for facial deformation syntheses, either because they do not attain the necessary acquisition rate to capture the temporal deformations faithfully, or they are too data-intensive, or they do not provide sufficient resolution to model facial details.

Modeling and capturing fine wrinkle details is a challenging problem for which a number of specialized prior art acquisition and modeling techniques have been developed. For instance, while some prior art techniques have modeled static pore detail using texture synthesis these techniques can be suitable to enhance static geometry, but they do not model wrinkle or pore deformations over time. Some other prior art techniques have demonstrated how linear interpolation of artist-modeled wrinkle maps can be used for real-time rendering. These techniques, however, model wrinkle and pore detail either statistically or artistically, making the creation of an exact replica of a subject's skin detail difficult.

A different prior art approach has been to model skin detail by measuring it from live subjects. Some prior art techniques have relied on normal maps to model skin meso-structure, captured using photometric stereo from few static expressions. Dynamic normal variation in skin meso-structure for intermediate facial poses can obtained using trilinear interpolation. Certain prior art techniques record dynamic facial wrinkle behavior from motion capture and video of an actor. A pattern of colored makeup is employed to improve shape-from-shading to detect wrinkle indentations in these regions. A non-linear thin shell model can be used to recreate the buckling of skin surrounding each wrinkle. While these systems estimate realistic facial geometry, they are mostly limited to larger scale wrinkles, and rely on (a form of) linear data interpolation to generate intermediate expressions.

Performance capture techniques use the recorded motion of an actor to drive a performance of a virtual character, most often from a set of tracked motion capture markers attached to the actor's face. Mapping the set of tracked markers to character animation controls is a complex but well-studied problem. Prior art techniques have introduced linear expression blending models. Blend shapes have become an established method for animating geometric deformation, and can be either defined by an artist or estimated automatically. Several techniques have used blend shapes to simulate detailed facial performances by linearly interpolating between a set of images or geometric exemplars with different facial expressions. A drawback of this approach is that it can be difficult to use linear blend shapes to reproduce the highly non-linear nature of skin deformation. Skin tends to stretch smoothly up to a point and then buckle nonlinearly into wrinkles. Furthermore, relating blend shapes to motion capture data is a non-trivial task.

Physically based simulation models use underlying bio-mechanical behavior of the human face to create realistic facial animations. Certain prior art techniques have determined individual muscle activations from sparse motion capture data using an anatomical model of the actor. Synthesizing detailed animations from such performance capture data would require very detailed models of facial structure and musculature, which are difficult to accurately reconstruct for a specific performer.

Thus, while prior art techniques may be suitable for certain situations and applications, they have exhibited limitations for creating realistic virtual faces, including for capturing textures, geometry, and facial motion of real people. What is needed therefore are new techniques that more accurately model and reproduce natural looking facial movements.

SUMMARY

The present disclosure is directed to novel techniques/methods/systems addressing and remedying the limitations noted previously for the prior art. Embodiments of the present disclosure can provide for acquisition, modeling, compression, and synthesis of realistic facial deformations using polynomial displacement maps. These techniques/methods/systems can make use of and include an analysis phase where the relationship between motion capture markers and detailed facial geometry is inferred, and a synthesis phase where detailed animated facial geometry is driven solely by a sparse set of motion capture markers.

An aspect of the present disclosure is directed to methods including an analysis phase for subsequent use in generating realistic facial movements. For such analysis, an actor can be visually recorded while wearing facial markers and performing a set of training expression clips. During the performance, real-time high-resolution facial deformations can be captured, including dynamic wrinkle and pore detail. The recording and capturing can utilize interleaved structured light 3D scanning and photometric stereo. Next, displacements can be computed between a neutral mesh driven by the motion capture markers and the high-resolution captured expressions. These geometric displacements are stored in one or more polynomial displacement maps (“PDMs”), which can be parameterized according to the local deformations of the motion capture dots as described in further detail in the following description. Additionally, generation or synthesis of realistic facial movement can be provided. For such synthesis, polynomial displacement map(s) can be driven with new motion capture data. This allows the recreation of large-scale muscle deformation, medium and fine wrinkles, and dynamic skin pore detail.

A further aspect of the present disclosure is directed to 3D facial deformation rendering systems including a central processing unit (“CPU”), a graphics processing unit (“GPU”), and a plurality of motion capture marker. The systems can be capable of analyzing and/or synthesizing facial deformations.

Another aspect of the present disclosure is directed to training data acquisition systems for real-time 3D image capturing. Such systems can include a stereo pair of high-resolution high-speed cameras. A high-speed digital light projection video projector can also be included. The high-speed cameras can be synchronized to the video projector, and the video projector can output a plurality of grayscale sinusoidal structured light patterns. A spherical gradient illumination device can be included, which can from the stereo camera pair and the structured illumination, calculate or configure a base geometry. The systems can also include a plurality of motion capture markers for placement on the face of an actor. The plurality of motion capture markers can allow each frame of motion to be registered in a common texture space.

Moreover, embodiments of the present disclosure can be implemented in computer-readable medium (e.g., hardware, software, firmware, or any combinations of such), and can be distributed over one or more networks. Steps described herein, including processing functions to derive, learn, or calculate formula and/or mathematical models utilized and/or produced by the embodiments of the present disclosure, can be processed by one or more suitable processors, e.g., central processing units (“CPUs) and/or one or more graphics processing units (“GPUs”) implementing suitable code/instructions.

While aspects of the present disclosure are described herein in connection with certain embodiments, it is noted that variations can be made by one with skill in the applicable arts within the spirit of the present disclosure and the scope of the appended claims.

BRIEF DESCRIPTION OF DRAWINGS

Aspects of the disclosure may be more fully understood from the following description when read together with the accompanying drawings, which are to be regarded as illustrative in nature, and not as limiting. The drawings are not necessarily to scale, emphasis instead being placed on the principles of the disclosure. In the drawings:

FIG. 1 includes five images depicting (a) an actor wearing facial motion capture markers arrayed in a grid, (b) a deformed neutral mesh based on the motion capture markers, (c) an addition of medium frequency displacement to the deformed neutral mesh, (d) an addition high frequency displacement to the deformed neutral mesh and medium frequency displacements, and (e) a ground truth geometry, in accordance with exemplary embodiments of the present disclosure;

FIG. 2 depicts a flow diagram of a process for computing PDMs as part of a training analysis, in accordance with exemplary embodiments of the present disclosure;

FIG. 3 is a collection of images, depicting an actor, wearing motion capture markers, in a neutral expression and expressing a number of identified emotions, along with corresponding calculated local displacement vectors, in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 depicts three images showing a comparison of PDM results with and without positional offsets included in the deformation metric and a ground truth model;

FIG. 5 depicts fitting errors for displacement maps for linear, biquadratic, and bicubic modeling, respectively;

FIG. 6 depicts a comparison 600 of synthesized (row 1) and ground truth (row 2) albedo map and medium-frequency and high-frequency displacement map components (columns a-e) for the “happy” expression, in accordance with exemplary embodiments of the present disclosure;

FIG. 7 includes three images depicting skin pore shapes for facial deformation and a comparison between a static displacement map, a ground truth image, and a polynomial displacement map in accordance with exemplary embodiments of the present disclosure; and

FIG. 8 includes a collection of images depicting results generated using deformation-driven PDMs and sparse motion capture points, in accordance with an exemplary embodiment of the present disclosure.

While certain embodiments depicted in the drawings, one skilled in the art will appreciate that the embodiments depicted are illustrative and that variations of those shown, as well as other embodiments described herein, may be envisioned and practiced within the scope of the present disclosure.

DETAILED DESCRIPTION

Aspects of the present disclosure are, in general terms, directed to methods and systems for modeling and/or synthesizing facial performances with realistic dynamic wrinkles and fine scale facial details. Embodiments of the present disclosure can utilize one or more of the following: (i) deformation-driven polynomial displacement maps, as a compact representation for facial deformations; (ii) novel real-time acquisition systems for acquiring highly detailed geometry based on structured light and photometric stereo; and (iii) novel methods that are able to generate highly detailed facial geometry from motion capture marker locations making use of PDMs describing the subject's appearance.

In exemplary embodiments, a real-time 3D scanning system can record training data of the high resolution geometry and appearance of an actor performing a small set of predetermined facial expressions. A set of motion capture markers can be placed on the actor's face to track large scale deformations. These large scale deformations can be related to the deformations at finer scales. This relation can be represented compactly in the form of two of the previously-mentioned deformation-driven polynomial displacement maps (“PDMs”), encoding variations in medium-scale and fine-scale displacements for a face undergoing motion, as for example shown in FIG. 1, described infra.

Embodiments of the present disclosure can also include an acquisition system is capable of capturing high-resolution geometry of dynamic facial performances at a desired frame rate, e.g., at 30 fps. Not only wrinkles but also dynamic fine-scale pore detail can be captured. The acquired training data can be represented as a biquadratic polynomial function (a PDM), driven by a sparse set of motion capture markers positions. Such representations can be both compact and maintain the non-linear dynamics of the human face. Embodiments of methods/systems according to the present disclosure can utilize one or more sets of captured training expressions. The captured expressions do not have to be directly used during synthesis, but instead a compact representation can be used that encodes the non-linear behavior of the deformations as a function of motion capture marker positions; these compact representations can be deformation-driven polynomial displacement maps, as described in further detail below. Accordingly, methods/systems according to the present disclosure can be used for deriving accurate high-resolution animation from relatively sparse motion capture data that can be utilized for various applications, including the compression of existing performance data and the synthesis of new performances. Technique of the present disclosure can be independent of the underlying geometry capture system and can be used to automatically generate high-frequency wrinkle and pore details on top of many existing facial animation systems.

The deformation-driven PDMs utilized by embodiments of the present disclosure can use biquadratic polynomials stored as textures to model the data. Such deformation driven PDMs differ from polynomial texture maps (“PTMs”) in three significant aspects. First, PDMs model geometric deformations instead of changes in scene radiance. Second, PTMs have never been driven by changes in geometry. Finally, unlike PTMs used to date, the utilized driving parameters (not just the coefficients) can vary over the image space to better model complex facial expressions. The PDM representation can accordingly yield a relatively compact model that allows synthesis of realistic medium-scale and fine-scale facial motion using coarse motion capture data.

Data Acquisition Setup

Embodiments of real-time 3D capture systems can use a combination of structured light and photometric stereo to obtain high-resolution face scans, and consists of a stereo pair of high-resolution high-speed cameras synchronized to a high-speed DLP video projector and a spherical gradient illumination device. In exemplary embodiments, multiple (e.g., six, etc.) grayscale sinusoidal structured light patterns at varying scales and a full-on pattern can be output by the high-speed video projector running at a desired frame rate, e.g., 360 frames per second. From the stereo camera pair and the structured illumination, a base geometry can be triangulated. After each structured light sequence, four gradient illumination patterns and an additional diffuse tracking pattern can be generated, e.g., with a spherical lighting apparatus for computing photometric normals. In an exemplary embodiment, 178 tracking dots were placed on an actor's face so that each frame of motion could be registered in a common texture space; the marker motion also served as the basis for the parameter space for facial detail synthesis. Two lower-resolution cameras were placed to the sides to assist with motion capture marker tracking. Further suitable 3D capture systems and techniques are shown and described in Applicant's co-owned U.S. patent application Ser. No. 12/105,141, entitled “Acquisition of Surface Normal Maps from Spherical Gradient Illumination” filed 17 Apr. 2008, the entire contents of which are incorporated herein by reference; and as also described in Ma et al., “Rapid Acquisition of specular and Diffuse Normal Maps form Polarized Spherical Gradient Illumination,” University of Southern California, (2007), the entire contents of which are incorporated herein by reference.

Geometry Reduction

For geometry reconstruction, geometry can be triangulated based on camera-to-camera correspondences computed from the ratios of the sinusoidal structured light patterns to the full-on pattern. Photometric surface normals can be computed from the spherical gradient patterns and then the photometric normals can be used to add fine-scale detail to the base geometry. This allows details such as dynamic wrinkles and fine-scale stretching and compression of skin pores to be captured in real-time.

Because the gradient illumination patterns are captured at different points in time, subject motion can be corrected for using an optical algorithm flow, e.g., the optical flow algorithm of Brox et al. [2004]. This flow can be computed between the first gradient pattern and the tracking pattern, and then this flow can be used to warp the four gradient-lit images to the same point in time. This allows for accurate calculation of surface normals using ratios of the gradient-lit images. Compensation for motion in the structured light patterns is not necessarily performed because the optical flow would lose stereo correspondences. However, slight errors due to motion in the structured light geometry are acceptable, since it is refined by the photometric normals afterwards, which corrects for these errors.

The complete set of 3D training models can be registered to a common texture space determined by the motion capture tracking dots to achieve the initial alignment. The optical flow algorithm can be re-used to achieve alignment at the level of fine-scale features. Facial skin is often lacking in high-frequency diffuse texture features needed for accurate traditional optical flow. Instead the fact that skin is rich in high-frequency geometric details such as pores, cracks, and wrinkles can be leveraged to achieve accurate optical flow. To do this, the computed normal maps can be integrated to derive fine-scale displacement maps per frame. FIG. 6 depicts a comparison 600 of synthesized (row 1) and ground truth (row 2) albedo map and medium-frequency and high-frequency displacement map components (columns a-e) for the “happy” expression, in accordance with exemplary embodiments of the present disclosure. As can be seen in the two leftmost columns of FIG. 6, these maps contain much more texture information than the original diffusely lit images. After this final warp, surface details become well aligned in a consistent texture space.

Training

To capture the range of facial deformation, several short sequences can be captured as the subject transitions from the neutral expression to various strong expressions such as those seen in the top row of FIG. 3.

In FIG. 3, a collection 300 of images, depicting an actor, wearing motion capture markers, in a neutral expression (row 1, column a) and expressing a number of identified strong emotions (row 1, columns b-g), along with corresponding calculated local displacement vectors (rows 2-3), in accordance with an exemplary embodiment of the present disclosure;

From each transition, a plurality of frames can be selected, e.g., between 10 and 30 frames, to use as input to the PDM fitting process, including the neutral start point, the extreme expression end points, as well as intermediate deformations. This can allow for the non-linear character of wrinkle formation and other fine-scale deformations to be modeled by the PDM.

Deformation-Driven Polynomial Displacement Maps

The use of deformation-driven PDMs in accordance with the present disclosure is based on the observation that medium-scale and fine-scale changes in surface shape correlate with larger-scale deformations in the corresponding facial region. For example, the formation of horizontal forehead wrinkles correlates with the larger-scale compression of the surface in a direction transverse to the wrinkles. Similarly, skin pores and fine wrinkles can become stretched or flattened according to the local stretching of the skin at coarser scales. Further detail of the development of PDMs to represent these deformations based on the high-resolution training data and tracked motion capture markers is described below.

The mathematical form of deformation-driven PDMs is as follows:

D
_u,v(d₁d₂)=a_o(u,v)d₁²+a₁(u,v)d₂²+a₂(u,v)d₁d₂+a₃((u,v)d₁+l₄(u,v)d₂+a₅(u,v) (1)

In Eq. 1, D_u,vis the local displacement at point (u,v), and d₁and d₂are measures of low-frequency deformation evaluated at point (u,v). The measurement of large-scale deformation is limited in Eq. 1 to the two dimensions d1 and d2 in order to keep the number of PDM coefficients as small as possible. A method, according to exemplary embodiments, for computing the best 2D parameterization of large-scale deformation is described below.

An example of computing PDMs based on the captured training set of motion sequences is illustrated in FIG. 2. In the drawing, an optical flow or method 200 is depicted. Displacements D can be expressed relative to the motion of a low-resolution base mesh 210. This base mesh 210 can be derived using a linear thin shell interpolation technique to deform 212 a neutral mesh to the basic shape of the current expression, e.g., as in [Bickel et al. 2007]. However, rather than applying the thin shell deformation to a detailed neutral mesh, a smooth neutral mesh can instead be deformed. Differences between the deformed neutral mesh and the sequences of high-resolution scans 220 can then be encoded using the deformation-driven PDMs 260 and 270. This can significantly reduces the data compared to the original scans as the model becomes a single smooth neutral mesh and a set of deformation-driven PDMs.

With continued reference to FIG. 2, trying to fit both the medium-scale and fine-scale facial dynamics to a single deformation-driven PDM can attenuate fine-scale detail to better fit the medium-scale displacements. For this reason, a combination of two PDMs 260 and 270 are utilized in exemplary embodiments: one 260 for medium-scale deformations of several millimeters and a second one 270 for fine-scale deformations on the order of one millimeter. A medium-scale deformation-driven PDM can be fit (at 240) to the 3D scans, and then a fine-scale deformation-driven PDM can be fit (at 250) to a high-pass filtered version of the residual. This process can also reduce storage, since the medium-scale displacements can be computed at a lower resolution than the fine-scale displacements, and since the fine-scale displacements need only be computed in the 1D direction 224 normal to the mesh. The details of the fitting process are described below.

Scaled to the same resolution, these separately fit deformation-driven PDMs can be combined back into a single deformation-driven PDM by simply adding their respective coefficients. In practice, however, the medium-scale PDM can be applied to the geometric vertices and the evaluated fine-scale PDM can be used for a graphics processing unit (“GPU”) displacement mapping.

Parameterizing Low Frequency Deformation

To generate and make use of PDMs, an input parameter space is created to characterizes local coarse-scale facial motion in a well-conditioned manner. To generate these parameters, a coarse triangle mesh can first be created over the set of motion markers. At each vertex V_iof this coarse mesh, the low-frequency deformation at time t can be defined as S_i(t) by conjoining the 3D positional offset O_i(t) of the vertex with two additional values E_i^u(t) and E_i^u(t) representing the large-scale in-plane strain of the surface. This forms a 5D deformation space:

S
_i(t)={O_i(t),E_i^u(t),E_i^u(t)}.

The vertex position offsets O_i(t) are computed by first applying a rigid transformation R to the coarse mesh to best match the neutral pose, thereby correcting for overall head pose. O_i(t) is then simply the difference between the transformed vertex position R(P_i(t)) and the neutral vertex position P_i^ref.

The large-scale strains E^uand E^vcan be estimated from the coarse mesh vertex positions of all vertices N₂(V_i) connected to V_iby a path of two or fewer edges. The positions of N₂(V_i) can be projected into the local texture coordinate system (u,v). The 2D strain can be approximated as the difference between the standard deviation of the projected positions P_jof V_jεN₂(V_i) in the current deformation and the standard deviation in the reference neutral expression:

E
_i
^u(t)=σ{û·P_j(t)}−σ{û·P_j^ref},

E
_i
^v(t)=σ{{circumflex over (v)}·P_j(t)}−σ{{circumflex over (v)}·P_j^ref}, (2)

To find a suitable 2D parameterization for the PDM domain, principal components analysis (PCA) can be performed on the 5D deformation vectors S_iover all captured deformations. This determines the most important axes of large scale shape variation in the neighborhood of V_i. Prior to PCA, the strain values (E_i^u,E_i^v) can be scaled by √{square root over (|N₂(V_i)|)} to account for the lower noise of this aggregate measure relative to the noise in the single measurement O_i. The principal components, e.g., {circumflex over (Q)}_iand {circumflex over (R)}_i, can be selected. For this, the present inventors found that the eigenvalues decreased very quickly after the first two, indicating most of the variation in S could be well captured by two principal components, e.g., validating a choice to use only a two-dimensional PDM parameterization. Examples of eigenvalues, averaged across the face are shown in Table 1. This analysis shows that most of the eigenvalues at each motion capture marker decay quickly. By choosing the best two dimensions for each motion capture marker, over 90% of the training data can be modeled.

TABLE 1

Energy averaged across the face represented by a subset of

eigenvectors of the 5D deformation vectors, showing that two

eigenvectors can model over 90% of the training data.

Eigenvectors

1
2
3
4
5

Energy
76.84%
92.52%
98.46%
99.76%
100%

Finally, the final PDM domain axes can be derived over the coarse mesh by a smoothing process which assures the deformation bases of adjacent vertices do not differ excessively. This can be accomplished by comparing each basis vector to the average of the corresponding basis vectors at adjacent vertices. Then the worst case outlier vector is successively replaced over the entire mesh with the average of the adjacent basis vectors. These vectors are reorthogonalized and renormalized at each step. This process is repeated until the worst case outlier lies within a threshold angle of the neighborhood-averaged vector. Denoting the result of smoothing {circumflex over (Q)}_iand {circumflex over (R)}_iby {circumflex over (q)}_iand {circumflex over (r)}_i, the input parameters to the PDM at the coarse mesh vertex V_iat time t are then simply:

d
₁(P_i,t)=E_i(t)·{circumflex over (q)}_i,

d2(P_i,t)=E_i(t)·{circumflex over (r)}_i, (3)

To extend these deformation values over the entire mesh, E, {circumflex over (q)}, and {circumflex over (r)} can be interpolated from their values at the vertices V using barycentric interpolation over the coarse triangle mesh.

The choice to include the vertex offsets O_ithemselves in the deformation vector is perhaps counterintuitive, as mechanical properties are typically invariant with respect to simple translation. It has been found by the present inventors, however, that local shape deformation correlates significantly with these vertex offsets. This is believed to be due to the strong influence of the underlying bone structure on the skin deformation. For example, it is expected that a skin patch under a fixed strain will nonetheless change shape as it slides over different bony facial features. The thin shell model does not account for such effects, and they therefore are preferably accounted for by the PDM.

FIG. 4 shows a visual comparison 400 between a mesh synthesized using the results of a 5D PCA and a mesh using just the two dimensions of principle strain. While the differences can be subtle, they are perceptually important and most notable in dynamic sequences. The inclusion of vertex offsets generates particular improvement in the shape of the mouth and lower jaw. Compared to only using 2D strain (without the positional offsets O_i), the 5D PCA does not require any extra storage, adds only a limited amount of preprocessing, and results in lower errors.

An absolute strain formulation for E was used rather than the more traditional relative strain because the units of absolute strain are distance, which facilitates common analysis with the positional offsets O_i. In addition, bending and shear strains were neglected in the described implementation/embodiment(s). Over the restricted domain of facial motion, the five dimensions that were analyzed can function as effective proxies for the omitted dimensions. Examples of d₁and d₂evaluated over the face for different facial expressions can be seen in FIG. 3.

Optimal Fitting of Deformation-Driven PDMS

Optimal polynomial coefficients for EQ. 1 can be calculated at each texture point using the measured displacement values and the derived deformation input parameters. Given the sequence of measured displacement coordinate values f_tat a point, the PDM coefficients can computed as the least-squares solution to the equations:

$\langle \begin{matrix} d_{11}^{2} & d_{21}^{2} & d_{11} d_{21} & d_{11} & d_{21} & 1 \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ d_{1 t}^{2} & d_{2 t}^{2} & d_{1 t} d_{2 t} & d_{1 t} & d_{2 t} & 1 \\ \frac{γ}{σ_{\hat{q}}^{2}} & 0 & 0 & 0 & 0 & 0 \\ 0 & \frac{γ}{σ_{\hat{r}}^{2}} & 0 & 0 & 0 & 0 \\ 0 & 0 & \frac{γ}{σ_{\hat{q}} σ_{\hat{r}}} & 0 & 0 & 0 \\ 0 & 0 & 0 & \frac{γ}{σ_{\hat{q}}} & 0 & 0 \\ 0 & 0 & 0 & 0 & \frac{γ}{σ_{\hat{r}}} & 0 \end{matrix} \rangle \langle \begin{matrix} a_{0} \\ a_{1} \\ a_{2} \\ a_{3} \\ a_{4} \\ a_{5} \end{matrix} \rangle = \langle \begin{matrix} f_{1} \\ ⋮ \\ f_{t} \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{matrix} \rangle$

Regularization terms can be included to account for the possibility that one or both of the input parameters may not have exhibited sufficient variation in the training set, which could make recovery of the non-constant coefficients of the PDM unstable. It was found by the inventors that the regularization was effective for low values of the regularization constant γ, such that no degradation in the fidelity of the fitting was noticeable.

In exemplary embodiments, two deformation-driven PDMs can be recovered for each subject: one for medium-scale 3D displacement, e.g., at 512×512 pixel resolution, and one for fine-scale 1D displacement normal to the mesh, e.g., at 1024×1024 resolution. For the medium-scale displacement, each coordinate of displacement can be fit independently, yielding 18 total PDM coefficients. A deformation-driven PTM can be fit to the time-varying diffuse albedo measurements, yielding an additional 18 PTM coefficients.

Effect of PDM Order on Accuracy

A comparison was made of the results using biquadratic PDMs with those obtainable from linear PDMs and bicubic PDMs. An error comparison 500 for these different cases (columns a-c) are shown in FIG. 5. Although the linear PDMs perform well for much of the face, there are a number of areas where the benefit of biquadratic PDMs can be easily seen. While biquadratic polynomials cannot entirely fit the training data precisely (it represents a vast reduction in data), most of the perceptually important aspects of skin deformation are modeled effectively. Bicubic PDMs and higher-degree polynomials do not capture much more information and require substantially more storage (e.g., 10 coefficients per data channel rather than 6). Furthermore, higher order approximations also carry the danger of over-fitting the data.

Motion Synthesis

Once training data has been captured and deformation-driven PDMs have been derived, highly detailed models can be synthesized according to a performance recorded with standard facial motion capture. In this work, the same markers used in the training sequences can be used to record novel facial performances not in the training set. This makes synthesis and rendering of detailed facial geometry for each frame a straightforward process:

- 1. Deform the low-resolution neutral mesh to the motion capture points using linear thin shell interpolation.
- 2. Evaluate the deformation vector S and the deformation axes {circumflex over (d)}₁and {circumflex over (d)}₂over the motion capture points.
- 3. Interpolate the deformation axes and deformation vectors over the mesh texture space, forming the dot products d₁and d₂at each surface point.
- 4. Evaluate the medium-scale deformation-driven PDM and deform the mesh vertices according to the computed 3D offsets.
- 5. Evaluate the fine-scale deformation-driven PDM to form a 1D displacement map.
- 6. Render the deformed geometry and displacement map on the GPU.

Off-Line and Real-Time Rendering Systems

For embodiments providing for off-line rendering, steps 1-5, supra, can be performed on/with a suitable CPU. For embodiments providing for real-time rendering, the linear thin shell mesh can be generated on/with a CPU with a reduced vertex count, e.g., one of 10 k (versus 200 k for the off-line rendering system), to maintain/facilitate a frame rate of 20+ fps; a suitable system can include an Intel Pentium 4 Xeon and an nVidia 8800GTS. Next, the two PDMs can be evaluated using the GPU and the displacement can be added to the thin shell mesh. Examples of synthesized displacement maps, as well as a synthesized diffuse albedo map, can be seen in FIG. 6. Rendered results, discussed below, can be seen in FIG. 8.

Marker Placement

In practice, the performance data can be captured as desired, e.g., at a different time than the training data. Consequently, this could involve a new application of the motion capture markers. One industry-standard technique for obtaining maximally consistent marker placements uses a thin plastic mold of the actor's face with small holes drilled into it to guide the marker placement, usually to within a millimeter of repeatability. Accommodating marker placements with greater deviation could require a remapping step to evaluate the PDM as follows: First, the new motion capture markers (observed in a neutral position) can be mapped onto the neutral mesh acquired during training. Second, if the density of motion capture markers is different, it is preferable to scale the values E_i^u(t) and E_i^u(t) accordingly. Because such mappings and corrections can involve some error, optimal results may be obtained from using approximately the same marker locations during performance capture and training.

Results from an Exemplary Embodiment

For a demonstration of exemplary embodiments, facial performances were captured (recorded) of two subjects. For each subject, the six training expressions shown in FIG. 3 were captured. In choosing the captured expressions, a focus was made on the inner and outer motion of the brow, motion of the mouth corners, nose wrinkling, cheek dimpling, brow furrowing, and basic jaw movement.

Several facial performances for each subject were then captured. Although the idea is that these performances need only consist of motion capture marker motion, real-time high resolution face scans were continuously acquired to serve as “ground truth” validation data for the synthesized sequences. The derived PDMs were used to reconstruct sequences that were part of the training set (e.g., FIG. 8, rows 1 and 3) as well as novel performances not of the training set (e.g., FIG. 8, rows 2 and 4). It was found that in both cases the synthesis algorithm produces results that are largely indistinguishable from the ground truth geometry sequences. Several of the performances contain significant global head motion, which did not appear to pose problems for model fitting.

With continued reference to FIG. 8, the top two rows (1 and 2) show that while the thin-shell deformation model provides general face shape of the subject, it can fail to optimally reproduce medium-scale details such as brow ridges and facial musculature. The medium-scale PDMs add large wrinkles and definition to the brow. The fine-scale 1D PDM adds the remaining fine wrinkles and pore detail, making the synthesized model a close approximation of the ground truth.

FIG. 7 shows a comparison 700 of the effect of skin mesostructure deformation near the cheek for a “happy” expression. This deformation, seen in the ground truth geometry (b) and reproduced by the PDM in (a), can be seen as pore stretching across the cheek and the formation of a few fine-scale wrinkles below the eye. In contrast, mapping a static displacement map from the neutral expression to the “happy” geometry (c) does not reproduce these effects. These nonlinear changes in mesostructure may be important to synthesizing realistic expressions since they affect aspects of skin appearance.

FIG. 8 includes a collection 800 of images depicting results generated using deformation-driven PDMs and sparse motion capture points, in accordance with an exemplary embodiment of the present disclosure.

In designing the expression set, expressions were not broken down into individual facial action units. The fitting process inherently segments the captured data into usable sub expressions by choosing different PCA parameters for different facial regions (e.g., as in FIG. 3). To test this part of the algorithm, a new motion capture performance was captured, where the actor produced an asymmetric smile and raised a single eyebrow (FIG. 8, row 2). The synthesized geometry effectively combines elements from multiple training expressions to closely approximate the ground truth.

The low-frequency deformation parameters can also be used to model and synthesize other attributes such as facial reflectance. In addition to displacement, a three-channel PTM was fit to the dynamic surface reflectance recorded by the video cameras parameterized by the same facial deformation space. Semiautomatic dot removal was performed, e.g., as in [Guenter et al. 1998], to create a clean texture, though some black smudged remained in the images. The bottom two rows of FIG. 8 and the final examples in the video show results texture-mapped with synthesized PTMs. In these renderings, the PTMs successfully encode changes in surface shading due to self-occlusion. However, the eyes may not be fully realistic. This is because during training, the actor's eyes were not required to remain open or closed and as a result, the eyes can be assigned distorted texture containing both eyelid and eye color. In general, while this technique is successful at generating the shape and appearance for most of the face, realistically modeling the eyes and the skin immediately around them (as well as the inner mouth and lips) may require additional techniques. Additional polynomial texture maps could be used to model other skin properties such as changes in specularity and subsurface scattering caused by facial deformation.

Accordingly, deformation-driven polynomial displacement maps have been described and embodiments demonstrating their application in modeling and synthesizing dynamic high-resolution geometry for facial animation have been described. A high-resolution real-time 3D geometry acquisition system was built that is capable of capturing facial performances at the level of wrinkle and pore details. Furthermore, performance-driven polynomial displacement maps, a novel compact representation for facial deformation, was presented. This compact representation was demonstrated to provide a high level of visual fidelity, comparable to that achievable with hardware-intensive real-time scanning techniques.

Finally, the performance-driven PDMs were shown to be suited to synthesize new expressions that are not part of the original training dataset using only motion capture marker positions of the new facial expression. The techniques yield accurate reconstructions of medium-scale and fine-scale geometry over most of the face.

While certain embodiments have been described herein, it will be understood by one skilled in the art that the methods, systems, and apparatus of the present disclosure may be embodied in other specific forms without departing from the spirit thereof. For example, while PDMs have been described in the context of second-order or bi-quadratic equations, PDMs may be of different order, e.g., cubic, etc., within the scope of the present disclosure. Accordingly, the embodiments described herein, and as claimed in the attached claims, are to be considered in all respects as illustrative of the present disclosure and not restrictive.

Claims

1. A method of synthesizing realistic facial deformations for a computer graphics model of a face, including fine-scale detail such as skin wrinkling, the method comprising: (a) capturing data of a face including fine-scale detail for a plurality of facial expressions;(b) learning a model of how the fine-scale detail observed in the data of (a) can be predicted from coarser-scale facial motion data;(c) synthesizing fine-scale detail for the computer graphics model of the face based on the input of novel coarser-scale facial motion data using the learned model of (b); and(d) displaying the graphics model on a viewer readable screen.
2. The method of claim 1, wherein (b) comprises deriving one or more deformation-driven polynomial displacement maps (PDMs) encoding displacements for a face undergoing motion
3. A method of synthesizing realistic facial deformations, the method comprising: an analysis phase including, (A) capturing training data, and (B) deriving one or more deformation-driven polynomial displacement maps (PDMs) encoding displacements for a face undergoing motion; anda synthesis phase including, (C) synthesizing detailed models according to a performance recorded with standard motion capture, and (D) rendering a deformed geometry using a graphics processing unit (GPU).
4. The method of claim 3, wherein (A) comprises recording an actor wearing facial markers while performing a set of training expressions.
5. The method of claim 4, wherein the training expressions comprise a neutral expression and a plurality of strong expressions.
6. The method of claim 5, wherein (A) comprises recording transitions in expression as the actor transitions from the neutral expression to one of the plurality of strong expressions.
7. The method of claim 6, further comprising for each transition selecting a plurality of frames for use an input to (B).
8. The method of claim 7, wherein the plurality of frames include a neutral starting point and a plurality of extreme expression end points.
9. The method of claim 8, wherein the plurality of frames comprise intermediate deformations.
10. The method of claim 3, wherein (B) includes expressing displacements relative to a low-resolution base mesh.
11. The method of claim 10, further comprising deriving the base mesh using thin shell interpolation.
12. The method of claim 11, further comprising deforming a neutral mesh to the basic shape of an actor's expression.
13. The method of claim 3, wherein each PDM is calculated by a processor, and wherein the PDM is of the form: Du,v(d1d2)=ao(u,v)d12+a1(u,v)d22+a2(u,v)d1d2+a3((u,v)d1+l4(u,v)d2+a5(u,v),
14. The method of claim 3, further comprising for (B) generating an input parameter space that characterizes local course scale facial motion in a well conditioned manner.
15. The method of claim 14, further comprising creating a coarse triangular mesh over the set of motion markers.
16. The method of claim 15, further comprising defining low-frequency deformation at time t as Si(t) at each vertex of the course mesh.
17. The method of claim 16, wherein defining low-frequency deformation comprises conjoining 3D positional offset of the vertex with two additional values representing the large-scale in-plane strain of the surface.
18. The method of claim 17, wherein defining low-frequency deformation comprises forming a 5D deformation space of the following form: Si(t)={Oi(t),Eiu(t),Eiu(t)},
19. The method of claim 18, wherein the vertex position offsets Oi(t) are computed by applying a rigid transformation to the coarse mesh to match the neutral pose.
20. The method of claim 19, wherein Oi(t) is calculated as the difference between the transformed vertex position and the neutral position.
21. The method of claim 18, wherein the large scale strains Eiu(t) and Eiv(t) are estimated from the course mesh vertex positions of all vertices connected to Vi by a path of two or fewer edges.
22. The method of claim 21, wherein the course mesh vertex positions, N2(Vi), are projected into a local texture coordinate system (u,v).
23. The method of claim 22, further comprising approximating 2D strains as the difference between the standard deviation of the projected positions Pj of VjεN2(Vi) in the current deformation and the standard deviation in the reference neutral expression, according to the following: Eiu(t)=σ{û·Pj(t)}−σ{û·Pjref} and Eiv(t)=σ{{circumflex over (v)}·Pj(t)}−σ{{circumflex over (v)}·Pjref}.
24. The method of claim 18, further comprising finding a suitable 2D parameterization of the PDM domain.
25. The method of claim 24, wherein finding a suitable 2D parameterization comprises performing principal component analysis on the 5D deformation over all captured deformations, wherein the two most important axes of large scale shape variation adjacent to Vi are obtained.
26. The method of claim 25, further comprising deriving final PDM domain axes over the course mesh by a smoothing process that assures the deformation bases of adjacent vertices do not differ beyond a desired amount.
27. The method of claim 26, wherein the result of smoothing {circumflex over (Q)}i and {circumflex over (R)}i is {circumflex over (q)}i and {circumflex over (r)}i and the input parameters to the PDM at the course mesh vertex Vi at time t are equal to: d1(Pi,t)=Ei(t)·{circumflex over (q)}i,d2(Pi,t)=Ei(t)·{circumflex over (r)}i.
28. The method of claim 13, wherein given a sequence of measured displacement coordinate values ft at a point, the PDM coefficients are computed as the least-squares solution to the following system of equations:
29. The method of claim 13, wherein (C) further comprises deforming the low-resolution neutral mesh to the motion capture points using linear thin shell interpolation.
30. The method of claim 29, further comprising evaluating the deformation vector S and the deformation axes {circumflex over (d)}1 and {circumflex over (d)}2 over the motion capture points.
31. The method of claim 29, further comprising interpolating the deformation axes and deformation vectors over the mesh texture space and forming the dot products d1 and d2 at each surface point.
32. The method of claim 29, further comprises evaluating the medium-scale deformation-driven PDM and deforming the mesh vertices according to the computed 3D offsets.
33. The method of claim 29, further comprising evaluating the fine-scale deformation-driven PDM to form a 1D displacement map.
34. The method of claim 29, further comprising rendering the deformed geometry and displacement map on the GPU.
35. A training data acquisition system for real-time 3D image capturing, the system comprising: a stereo pair of high-resolution high-speed cameras;a high-speed digital light projection video projector, wherein the high-speed cameras are synchronized to the video projector, and wherein the video projector is configured and arranged to output a plurality of grayscale sinusoidal structured light patterns;a spherical gradient illumination device, wherein from the stereo camera pair and the structured illumination, a base geometry can be configured; anda plurality of motion capture markers for placement on the face of an actor, wherein the plurality of motion capture markers so that each frame of motion can be registered in a common texture space.
36. The system of claim 35, further comprising two or more lower-resolution cameras configured and arranged to assist with motion capture marker tracking.
37. A motion synthesis system for rendering realistic facial deformations, the system comprising: a plurality of motion capture markers;a central processing unit (CPU), wherein the CPU is configured and arranged to (i) deform the low-resolution neutral mesh to the motion capture points using linear thin shell interpolation, (ii) evaluate the deformation vector S and the deformation axes {circumflex over (d)}1 and {circumflex over (d)}2 over the motion capture points, (iii) interpolate the deformation axes and deformation vectors over the mesh texture space, forming the dot products d1 and d2 at each surface point, (iv) evaluate the medium-scale deformation-driven PDM and deform the mesh vertices according to the computed 3D offsets, and (v) evaluate the fine-scale deformation-driven PDM to form a 1D displacement map; anda graphics processing unit (GPU), wherein the GPU is configured and arranged to render the deformed geometry and displacement map.
38. The system of claim 37, further comprising a display configured and arranged to display a rendered image.
39. A computer-executable program product for rendering realistic facial deformations, the program product comprising a computer-readable medium with resident computer-readable instructions, the computer readable instructions comprising instructions for: (a) learning a model of how the fine-scale detail observed in the data of (a) can be predicted from coarser-scale facial motion data;(b) synthesizing fine-scale detail for the computer graphics model of the face based on the input of novel coarser-scale facial motion data using the learned model of (b); and(c) displaying the graphics model on a viewer readable screen.
40. The program product of claim 39, wherein (a) comprises deriving one or more deformation-driven polynomial displacement maps (PDMs) encoding displacements for a face undergoing motion.
41. The program product of claim 49, wherein each PDM is of the form: Du,v(d1d2)=ao(u,v)d12+a1(u,v)d22+a2(u,v)d1d2+a3((u,v)d1+l4(u,v)d2+a5(u,v),

RELATED APPLICATION

This application claims the benefit of U.S. Provisional Patent Application No. 61/025,173, entitled “Facial Performance Synthesis Using Deformation-Driven Polynomial Displacement Maps,” filed 31 Jan. 2008, the entire contents of which are incorporated herein by reference.

STATEMENT REGARDING FEDERALLY FUNDED RESEARCH

This invention was made with government support under Contract No. W911NF-04-D0005 awarded by the National Science Foundation. The government has certain rights in the invention.

Provisional Applications (1)

	Number	Date	Country
	61025173	Jan 2008	US

Facial Performance Synthesis Using Deformation Driven Polynomial Displacement Maps

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

RELATED APPLICATION

STATEMENT REGARDING FEDERALLY FUNDED RESEARCH

Provisional Applications (1)