This patent document contains material subject to copyright protection. The copyright owner has no objection to the reproduction of this patent document or any related materials in the files of the United States Patent and Trademark Office, but otherwise reserves all copyrights whatsoever.
In recent years, several approaches have been proposed for facial expression re-targeting, aimed at transferring facial expressions captured from a real subject to a virtual CG avatar. Facial reenactment goes one step further by transferring the captured source expressions to a different, real actor, such that the new video shows the target actor reenacting the source expressions photo-realistically. Reenactment is a far more challenging task than expression re-targeting as even the slightest errors in transferred expressions and appearance and a human user will notice slight inconsistencies with the surrounding video. Most methods for facial reenactment proposed so far work offline and only few of those produce results that are close to photo-realistic [DALE, K., SUNKAVALLI, K., JOHNSON, M. K., VLASIC, D., MATUSIK, W., AND PFISTER, H. 2011. Video face replacement. ACM TOG 30, 6, 130; GARRIDO, P., VALGAERTS, L., REHMSEN, O., THORMAEHLEN, T., PEREZ, P., AND THEOBALT, C. 2014. Automatic face reenactment. In Proc. CVPR].
However, new applications require, e.g. a multilingual video-conferencing setup in which the video of one participant may be altered in real time to photo-realistically reenact the facial expression and mouth motion of a real-time translator. Application scenarios reach even further as photo-realistic reenactment enables the real-time manipulation of facial expression and motion in videos while making it challenging to detect that the video input is spoofed.
These objects are achieved by a method and a device according to the independent claims. Advantageous embodiments are defined in the dependent claims.
By providing a separate representation of an identity/geometric shape and an expression of a human face, the invention allows re-enacting a facial expression without changing the identity.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
These and other aspects of the invention will be more readily understood when considering the following description of detailed embodiments of the invention, in connection with the drawing, in which
To synthesize and render new human facial imagery according to a first embodiment of the invention, a parametric 3D face model is used as an intermediary representation of facial identity, expression, and reflectance. This model also acts as a prior for facial performance capture, rendering it more robust with respect to noisy and incomplete data. In addition, the environment lighting is modeled to estimate the illumination conditions in the video. Both of these models together allow for a photo-realistic re-rendering of a person's face with different expressions under general unknown illumination.
As a face prior, a linear parametric face model Mgeo(α,δ) is used which embeds the vertices viε3, iε{1, . . . , n} of a generic face template mesh in a lower-dimensional subspace. The template is a manifold mesh defined by the set of vertex positions V=[vi] and corresponding vertex normals N=[ni], with |V|=|N|=n. The Mgeo(α,δ) parameterizes the face geometry by means of a set of dimensions encoding the identity with weights α and a set of dimensions encoding the facial expression with weights δ. In addition to the geometric prior, also a prior is used for the skin albedo Malb(β), which reduces the set of vertex albedos of the template mesh C=[ci], with ciε3 and |C|=n, to a linear subspace with weights β. More specifically, the parametric face model according to the first embodiment is defined by the following linear combinations
M
geo(α,δ)=aid+Eidα+Eexpδ, (1)
M
alb(β)=aalb+Ealbβ. (2)
Here Mgeoε3n and Malbε3n contain the n vertex positions and vertex albedos, respectively, while the columns of the matrices Eid, Eexp, and Ealb contain the basis vectors of the linear subspaces. The vectors α, δ and β control the identity, the expression and the skin albedo of the resulting face, and aid and aalb represent the mean identity shape in rest and the mean skin albedo. While vi and ci are defined by a linear combination of basis vectors, the normals ni can be derived as the cross product of the partial derivatives of the shape with respect to a (u; v)-parameterization.
The face model is built once in a pre-computation step. For the identity and albedo dimensions, one may use of the morphable model of BLANZ, V., AND VETTER, T. 1999. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, ACM Press/Addison-Wesley Publishing Co., 187-194. This model has been generated by non-rigidly deforming a face template to 200 high-quality scans of different subjects using optical flow and a cylindrical parameterization. It is assumed that the distribution of scanned faces is Gaussian, with a mean shape aid, a mean albedo aalb, and standard deviations σoid and σalb. The first 160 principal directions are used to span the space of plausible facial shapes with respect to the geometric embedding and skin reflectance. Facial expressions are added to the identity model by transferring the displacement fields of two existing blend shape rigs by means of deformation transfer [SUMNER, R. W., AND POPOVIĆ, J. 2004. Deformation transfer for triangle meshes. ACM TOG 23, 3, 399-405]. The used blend shapes have been created manually [ALEXANDER, O., ROGERS, M., LAMBETH, W., CHIANG, M., AND DEBEVEC, P. 2009. The Digital Emily Project: photoreal facial modeling and animation. In ACM SIGGRAPH Courses, ACM, 12:1-12:15] or by non-rigid registration to captured scans [CAO, C., WENG, Y., LIN, S., AND ZHOU, K. 2013. 3D shape regression for real-time facial animation. ACM TOG 32, 4, 41]. The space of plausible expressions is parameterized by 76 blendshapes, which turned out to be a good trade-off between computational complexity and expressibility. The identity is parameterized in PCA space with linearly independent components, while the expressions are represented by blend shapes that may be overcomplete.
To model the illumination, it is assumed that the lighting is distant and that the surfaces in the scene are predominantly Lambertian. This allows the use of a Spherical Harmonics (SH) basis [MUELLER, C. 1966. Spherical harmonics. Springer. PIGHIN, F., AND LEWIS, J. 2006. Performance-driven facial animation. In ACM SIGGRAPH Courses] for a low dimensional representation of the incident illumination.
Following RAMAMOORTHI, R., AND HANRAHAN, P. 2001. A signal-s processing framework for inverse rendering. In Proc. SIGGRAPH, ACM, 117-128, the irradiance in a vertex with normal n and scalar albedo c is represented using b=3 bands of SHs for the incident illumination:
with yk being the k-th SH basis function and γ=(γ1, . . . , γb
In order to represent the head pose and the camera projection onto the virtual image plane the origin and the axis of the world coordinate frame anchorcharted to the RGB-D sensor, while assuming that the camera to be calibrated. The model-to-world transformation for the face is then given by Φ(v)=Rv+t where R is a 3×3 rotation matrix and tε3 a translation vector. R is parameterized using Euler angles and, together with t, represents the 6-DOF rigid transformation that maps the vertices of the face between the local coordinates of the parametric model and the world coordinates. The known intrinsic camera parameters define a full perspective projection that transforms the world coordinates to image coordinates. With this, one may define an image formation model S(P), which allows to generate synthetic views of virtual faces, given the parameters P that govern the structure of the complete scene:
P=(α,β,δ,γ,R,t), (4)
with p=160+160+76+27+3+3=429 being the total amount of parameters. The image formation model enables the transfer of facial expressions between different persons, environments and viewpoints, but in order to manipulate a given video stream of a face, one first needs to determine the parameters P that faithfully reproduce the observed face in each RGB-D input frame.
For the simultaneous estimation of the identity, facial expression, skin albedo, scene lighting, and head pose, the image formation model S(P) is fitted to the input of a commodity RGB-D camera recording an actor's performance. In order to obtain the best fitting parameters P that explain the input in real-time, an analysis-through-synthesis approach is used, where the image formation model is rendered for the old set of (potentially non-optimal) parameters and P further optimized by comparing the rendered image to the captured RGB-D input. An overview of the fitting pipeline is shown in
The input for the facial performance capture system is provided by an RGB-D camera and consists of the measured input color sequence CI and depth sequence XI. It is assumed that the depth and color data are aligned in image space and can be indexed by the same pixel coordinates; i.e., the color and back-projected 3D position in an integer pixel location p=(i,j) is given by CI(ρ)ε3 and XI(ρ)ε3, respectively. The range sensor implicitly provides a normal field NI, where NI(ρ)ε3 is obtained as the cross product of the partial derivatives of XI with respect to the continuous image coordinates.
The image formation model S(P), which generates a synthetic view of the virtual face, is implemented by means of the GPU rasterization pipeline. Apart from efficiency, this allows to formulate the problem in terms of 2D image arrays, which is the native data structure for GPU programs. The rasterizer generates a fragment per pixel p if a triangle is visible at its location and barycentrically interpolates the vertex attributes of the underlying triangle. The output of the rasterizer is the synthetic color CS, the 3D position XS and the normal NS at each pixel p. Note that CS(p), XS(p), and NS(p) are functions of the unknown parameters P. The rasterizer also writes out the barycentric coordinates of the pixel and the indices of the vertices in the covering triangle, which is required to compute the analytical partial derivatives with respect to P.
From now on, only pixels belonging to the set V of pixels for which both the input and the synthetic data is valid are considered.
The problem of finding the virtual scene that best explains the input RGB-D observations may be cast as an unconstrained energy minimization problem in the unknowns P. To this end, an energy may be formulated that can be robustly and efficiently minimized:
E(P)=Eemb(P)+wcolEcol(P)+ωlanElan(P)+ωregEreg(P). (5)
The design of the objective takes the quality of the geometric embedding Eemb, the photo-consistency of the re-rendering Ecol, the reproduction of a sparse set of facial feature points Elan, and the geometric faithfulness of the synthesized virtual head Ereg into account. The weights ωcol, ωlan, and ωreg compensate for different scaling of the objectives. They have been empirically determined and are fixed for all shown experiments.
The reconstructed geometry of the virtual face should match the observations captured by the input depth stream. To this end, one may define a measure that quantifies the discrepancy between the rendered synthetic depth map and the input depth stream:
E
emb(P)=ωpointEpoint(P)+ωplaneEplane(P). (6)
The first term minimizes the sum of the projective Euclidean point-to-point distances for all pixels in the visible set: V
with dpoint(p)=XS(p)−XI(p) the difference between the measured 3D position and the 3D model point. To improve robustness and convergence, one may also use a first-order approximation of the surface-to-surface distance [CHEN, Y., AND MEDIONI, G. G. 1992. Object modelling by registration of multiple range images. Image and Vision Computing 10, 3, 145-155]. This is particularly relevant for purely translational motion where a point-to-point metric alone would fail. To this end, one measures the symmetric point-to-plane distance from model to input and input to model at every visible pixel:
with dplane(n,ρ)=nTdpoint(ρ) the distance between the 3D point XS(p) or XI(p) and the plane defined by the normal n.
In addition to the face model being metrically faithful, one may require that the RGB images synthesized using the model are photo-consistent with the given input color images. Therefore, one minimizes the difference between the input RGB image and the rendered view for every pixel ρεV:
where CS(p) is the illuminated (i.e., shaded) color of the synthesized model. The color consistency objective introduces a coupling between the geometry of the template model, the per vertex skin reflectance map and the SH illumination coefficients. It is directly induced by the used illumination model L.
The face includes many characteristic features, which can be tracked more reliably than other points. In addition to the dense color consistency metric, one therefore tracks a set of sparse facial landmarks in the RGB stream using a state-of-the art facial feature tracker [SARAGIH, J. M., LUCEY, S., AND COHN, J. F. 2011. Deformable model fitting by regularized landmark mean-shift. IJCV 91, 2, 200-215]. Each detected feature fj=(uj; vj) is a 2D location in the image domain that corresponds to a consistent 3D vertex vj in the geometric face model. If F is the set of detected features in each RGB input frame, one may define a metric that enforces facial features in the synthesized views to be close to the detected features:
The present embodiment uses 38 manually selected landmark locations concentrated in the mouth, eye, and nose regions of the face. Features are pruned based on their visibility in the last frame and a confidence ωconf is assigned based on its trustworthiness. This allows to effectively prune wrongly classified features, which are common under large head rotations (>30°).
The final component of the objective function is a statistical regularization term that expresses the likelihood of observing the reconstructed face, and keeps the estimated parameters within a plausible range. Under the assumption of Gaussian distributed parameters, the interval [−3σ•,i,+3σ•,i] contains≈99% of the variation in human faces that can be reproduced by the model. To this end, constrain the model parameters α, β and δ are constrained to be statistically small compared to their standard deviation:
For the shape and reflectance parameters, σid,i and aalb,i are computed from the 200 high-quality scans. For the blend shape parameters, σexp,i may be fixed to 1.
In order to minimize the proposed energy, one needs to compute the analytical derivatives of the synthetic images with respect to the parameters P. This is non-trivial, since a derivation of the complete transformation chain in the image formation model is required. To this end, one also emits the barycentric coordinates during rasterization at every pixel in addition to the indices of the vertices of the underlying triangle. Differentiation of S(P) starts with the evaluation of the face model Mgeo and Malb), the transformation to world space via Φ, the illumination of the model with the lighting model L, and finally the projection to image space via Π. The high number of involved rendering stages leads to many applications of the chain rule and results in high computational costs.
The proposed energy E(P): p→ Eq. (5) is non-linear in the parameters P, and finding the best set of parameters P* amounts to solving a non-linear least squares problem in the p unknowns:
Even at the moderate image resolutions used in this embodiment (640×480), the energy gives rise to a considerable amount of residuals: each visible pixel ρεV contributes with 8 residuals (3 from the point-to-point term of Eq. (6), 2 from the point-to-plane term of Eq. (8) and 3 from the color term of Eq. (9)), while the feature term of Eq. (10) contributes with 2·38 residuals and the regularizer of Eq. (11) with p−33 residuals. The total number of residuals is thus m=8|V|+76+ρ−33, which can equal up to 180 K equations for a close-up frame of the face. To minimize a non-linear objective with such a high number of residuals in real-time, a data parallel GPU-based Gauss-Newton solver is proposed that leverages the high computational throughput of modern graphic cards and exploits smart caching to minimize the number of global memory accesses.
The non-linear least-squares energy E(P) is minimized in a Gauss-Newton framework by reformulating it in terms of its residual r:ρ→m, with r(P)=(r1(P), . . . , rm(P))T. If it is assumed that one already has an approximate solution Pk, one seeks for a parameter increment ΔP that minimizes the first-order Taylor expansion of r(P) around Pk. So one may approximate
E(Pk+ΔP)≈∥r(Pk)+J(Pk)ΔP∥22, (13)
for the update ΔP, with J(Pk) the m×p Jacobian of r(Pk) in the current solution. The corresponding normal equations are
J
T(Pk)J(Pk)ΔP=−JT(Pk)r(Pk), (14)
and the parameters are updated as Pk+1=Pk+ΔP. The normal equations are solved iteratively using a preconditioned conjugate gradient (PCG) method, thus allowing for efficient parallelization on the GPU (in contrast to a direct solve). Moreover, the normal equations need not to be solved until convergence since the PCG step only appears as the inner loop (analysis) of a Gauss-Newton iteration. In the outer loop (synthesis), the face is re-rendered and the Jacobian is recomputed using the updated barycentric coordinates. Jacobi preconditioning is used, where the inverse of the diagonal elements of JT J are computed in the initialization stage of the PCG.
Convergence may be accelerated by embedding the energy minimization in a multi-resolution coarse-to-fine framework. To this end, one successively blurs and resamples the input RGB-D sequence using a Gaussian pyramid with 3 levels and applies the image formation model on the same reduced resolutions. After finding the optimal set of parameters on the current resolution level, a prolongation step transfers the solution to the next finer level to be used as an initialization there.
The normal equations (14) are solved using a novel data-parallel PCG solver that exploits smart caching to speed up the computation. The most expensive task in each PCG step is the multiplication of the system matrix JT J with the previous descent direction. Precomputing JT J would take O(n3) time in the number of Jacobian entries and would be too costly for real-time performance, so instead one applies J and JT in succession. For the present problem J is block-dense because all parameters, except for β and γ, influence each residual (see
The key idea to adapting the parallel PCG solver to deal with a dense Jacobian is to write the derivatives of each residual in global memory, while pre-computing the right-hand side of the system. Since all derivatives have to be evaluated at least once in this step, this incurs no computational overhead. J, as well as JT, are written to global memory to allow for coalesced memory access later on when multiplying the Jacobian and its transpose in succession. This strategy allows to better leverage texture caches and burst load of data on modern GPUs. Once the derivatives have been stored in global memory, the cached data can be reused in each PCG iteration by a single read operation.
The convergence rate of this data-parallel Gauss-Newton solver for different types of facial performances is visualized in
As it is assumed that facial identity and reflectance for an individual remain constant during facial performance capture, one does not optimize for the corresponding parameters on-the-fly. Both are estimated in an initialization step by running the optimizer on a short control sequence of the actor turning his head under constant illumination.
In this step, all parameters are optimized and the estimated identity and reflectance are fixed for subsequent capture. The face does not need to be in rest for the initialization phase and convergence is usually achieved between 5 and 10 frames.
For the fixed reflectance, one does not use the values given by the linear face model, but may compute a more accurate skin albedo by building a skin texture for the face and dividing it by the estimated lighting to correct for the shading effects. The resolution of this texture is much higher than the vertex density for improved detail (2048×2048 in the experiments) and is generated by combining three camera views (front, 20° left and 20° right) using pyramid blending [ADELSON, E. H., ANDERSON, C. H., BERGEN, J. R., BURT, P. J., AND OGDEN, J. M. 1984. Pyramid methods in image processing. RCA engineer 29, 6, 33-41]. The final high-resolution albedo map is used for rendering.
The real-time capture of identity, reflectance, facial expression, and scene lighting, opens the door for a variety of new applications. In particular, it enables on-the-fly control of an actor in a target video by transferring the facial expressions from a source actor, while preserving the target identity, head pose, and scene lighting. Such face reenactment, for instance, can be used for video-conferencing, where the facial expression and mouth motion of a participant are altered photo-realistically and instantly by a real-time translator or puppeteer behind the scenes.
To perform live face reenactment, a setup is built consisting of two RGB-D cameras, each connected to a computer with a modern graphics card (see
A new performance for the target actor is synthesized by applying the 76 captured blend shape parameters of the source actor to the personalized target model for each frame of target video. Since the source and target actor are tracked using the same parametric face model, the new target shapes can be easily expressed as
M
geo(αt,δs)=aid+Eidαt+Eexpδs, (15)
where αt are the target identity parameters and δs the source expressions. This transfer does not influence the target identity, nor the rigid head motion and scene lighting, which are preserved. Since identity and expression are optimized separately for each actor, the blend shape activation might be different across individuals. In order to account for person-specific offsets, the blendshape response is subtracted for the neutral expression prior to transfer.
After transferring the blend shape parameters, the synthetic target geometry is rendered back into the original sequence using the target albedo and estimated target lighting as explained above.
Fine-scale transient skin detail, such as wrinkles and folds that appear and disappear with changing expression, are not part of the face model, but are important for a realistic re-rendering of the synthesized face. To include dynamic skin detail in the reenactment pipeline, wrinkles are modeled in the image domain and transferred from the source to the target actor. The wrinkle pattern of the source actor is extracted by building a Laplacian pyramid of the input source frame. Since the Laplacian pyramid acts as a band-pass filter on the image, the finest pyramid level will contain most of the high-frequency skin detail. The same decomposition is performed for the rendered target image and the source detail level is copied to the target pyramid using the texture parameterization of the model. In a final step, the rendered target image is recomposed using the transferred source detail.
The face model only represents the skin surface and does not include the eyes, teeth, and mouth cavity. While the eye motion of the underlying video is preserved, the teeth and inner mouth region are re-generated photo-realistically to match the new target expressions.
This is done in a compositing step, where the rendered face is combined with a teeth and inner mouth layer before blending the results in the final reenactment video (see
To render the teeth, two textured 3D proxies (billboards) are used for the upper and lower teeth that are rigged relative to the blend shapes of the face model and move in accordance with the blend shape parameters. Their shape is adapted automatically to the identity by means of anisotropic scaling with respect to a small, fixed number of vertices. The texture is obtained from a static image of an open mouth with visible teeth and is kept constant for all actors.
A realistic inner mouth is created by warping a static frame of an open mouth in image space. The static frame is recorded in the calibration step and is illustrated in
The three image layers, produced by rendering the face and teeth and warping the inner mouth, need to be combined with the original background layer and blended into the target video. Compositing is done by building a Laplacian pyramid of all the image layers and performing blending on each frequency level separately. Computing and merging the Laplacian pyramid levels can be implemented efficiently using mipmaps on the graphics hardware. To specify the blending regions, binary masks are used that indicate where the face or teeth geometry is. These masks are smoothed on successive pyramid levels to avoid aliasing at layer boundaries, e.g., at the transition between the lips, teeth, and inner mouth.
Face reenactment exploits the full potential of the inventive real-time system to instantly change model parameters and produce a realistic live rendering. The same algorithmic ingredients can also be applied in lighter variants of this scenario where one does not transfer model parameters between video streams, but modify the face and scene attributes for a single actor captured with a single camera. Examples of such an application are face re-texturing and re-lighting in a virtual mirror setting, where a user can apply virtual make-up or tattoos and readily find out how they look like under different lighting conditions. This requires to adapt the reflectance map and illumination parameters on the spot, which can be achieved with the rendering and compositing components described before. Since one only modifies the skin appearance, the virtual mirror does not require the synthesis of a new mouth cavity and teeth. An overview of this application is shown in
A multi-linear PCA model based on [V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, pages 187-194. ACM Press/Addison-Wesley Publishing Co., 1999; O. Alexander, M. Rogers, W. Lambeth, M. Chiang, and P. Debevec. The Digital Emily Project: photoreal facial modeling and animation. In ACM SIGGRAPH Courses, pages 12:1-12:15. ACM, 2009; C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3D facial expression database for visual computing. IEEE TVCG, 20(3)413-425, 2014] is used. The first two dimensions represent facial identity—i.e., geometric shape and skin reflectance—and the third dimension controls the facial expression. Hence, a face is parameterized as:
M
geo(α,β)=aid+Eid·α+Eexp·δ, (16)
M
alb(β)=aalb+Ealb·β. (17)
This prior assumes a multivariate normal probability distribution of shape and reflectance around the average shape aidε3n and reflectance aialbε3n. The shape Eidε3n×80, reflectance Ealbε3n×80, and expression Eexpε3n×76 basis and the corresponding standard deviations σid ε80, σalbε80, and σexpε76 are given. The model has 53 K vertices and 106 K faces. A synthesized image CS is generated through rasterization of the model under a rigid model transformation Φ(v) and the full perspective transformation Π(v). Illumination is approximated by the first three bands of Spherical Harmonics (SH) [23] basis functions, assuming Lambertian surfaces and smooth distant illumination, neglecting self-shadowing.
Synthesis is dependent on the face model parameters α, β, δ the illumination parameters γ, the rigid transformation R, t, and the camera parameters K defining Π. The vector of unknowns P is the union of these parameters.
Given a monocular input sequence, all unknown parameters P are reconstructed jointly with a robust variational optimization. The objective is highly non-linear in the unknowns and has the following components:
The data term measures the similarity between the synthesized imagery and the input data in terms of photoconsistency Ecol and facial feature alignment Elan. The likelihood of a given parameter vector P is taken into account by the statistical regularizer Ereg. The weights wcol, wlan, and wreg balance the three different sub-objectives. In all of the experiments, wcol=1, wlan=10, and wreg=2.5·10−5.
In order to quantify how well the input data is explained by a synthesized image, the photo-metric alignment error may be measured on pixel level:
where CS is the synthesized image, CI is the input RGB image, and pεV denote all visible pixel positions in CS. The l2,1-norm [12] instead of a least-squares formulation is used to be robust against outliers. Distance in color space is based on l2, while in the summation over all pixels an fi-norm is used to enforce sparsity.
In addition, feature similarity may be enforced between a set of salient facial feature point pairs detected in the RGB stream:
To this end, a state-of-the-art facial landmark tracking algorithm by [J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable model fitting by regularized landmark mean-shift. IJCV, 91 (2):200-215, 2011] may be employed. Each feature point fjεF⊂2 comes with a detection confidence ωconf,j and corresponds to a unique vertex vj=Mgeo(α,δ)ε3 of the face prior. This helps avoiding local minima in the highly-complex energy landscape of Ecol(P).
Plausibility of the synthesized faces may be enforced based on the assumption of a normal distributed population. To this end, the parameters are enforced to stay statistically close to the mean:
This commonly-used regularization strategy prevents degenerations of the facial geometry and reflectance, and guides the optimization strategy out of local minima.
The proposed robust tracking objective is a general unconstrained non-linear optimization problem. This objective is minimized in real-time using a data-parallel GPU based Iteratively Reweighted Least Squares (IRLS) solver. The key idea of IRLS is to transform the problem, in each iteration, to a non-linear least-squares problem by splitting the norm in two components:
Here, r(•) is a general residual and Pold is the solution computed in the last iteration. Thus, the first part is kept constant during one iteration and updated afterwards. Each single iteration step is implemented using the Gauss-Newton approach. A single GN step is taken in every IRLS iteration and solve the corresponding system of normal equations JTJδ*=−JT F based on PCG to obtain an optimal linear parameter update δ*. The Jacobian J and the systems' right hand side −JTF are precomputed and stored in device memory for later processing. The multiplication of the old descent direction d with the system matrix JTJ in the PCG solver may be split up into two successive matrix-vector products.
In order to include every visible pixel ρεV in CS in the optimization process all visible pixels in the synthesized image are gathered using a parallel prefix scan. The computation of the Jacobian J of the residual vector F and the gradient JT F of the energy function are then parallelized across all GPU processors. This parallelization is feasible since all partial derivatives and gradient entries with respect to a variable can be computed independently. During evaluation of the gradient, all components of the Jacobian are computed and stored in global memory. In order to evaluate the gradient, a two-stage reduction is used to sum-up all local per pixel gradients. Finally, the regularizer and the sparse feature term are added to the Jacobian and the gradient.
Using the computed Jacobian J and the gradient JT F, the corresponding normal equation JTJΔx=−JTF is solved for the parameter update Δx using a preconditioned conjugate gradient (PCG) method. A Jacobi preconditioner is applied that is precomputed during the evaluation of the gradient. To avoid the high computational cost of JT J, the GPU-based PCG method splits up the computation of JT Jp into two successive matrix-vector products.
In order to increase convergence speed and to avoid local minima, a coarse-to-fine hierarchical optimization strategy is used. During online tracking, only the second and third level are considered, where one and seven Gauss-Newton steps are run on the respective level. Within a Gauss-Newton step, always four PCG iterations are run.
The complete framework is implemented using DirectX for rendering and DirectCompute for optimization. The joint graphics and compute capability of DirectX11 enables the processing of rendered images by the graphics pipeline without resource mapping overhead. In the case of an analysis-by-synthesis approach, this is essential to runtime performance, since many rendering-to-compute switches are required.
For the present non-rigid model-based bundling problem, the non-zero structure of the corresponding Jacobian is block dense (cf.
where Jf is the per-frame Jacobian matrix and Ff the corresponding residual vector.
As for the parameter space, another promoter function ψf is introduced, that lifts a local residual vector to the global residual vector. In contrast to the parameter promoter function, this function varies in every Gauss-Newton iteration since the number of residuals might change. The computation of JT (P)·J(P)·x is split up into two successive matrix vector products, where the second multiplication is analogue to the computation of the gradient. The first multiplication is as follows:
Using this scheme, the normal equations can be efficiently solved.
The Gauss-Newton framework is embedded in a hierarchical solution strategy. This hierarchy allows preventing convergence to local minima.
After optimization on a coarse level, the solution is propagated to the next finer level using the parametric face model. In experiments, the inventors used three levels with 25, 5, and 1 Gauss-Newton iterations for the coarsest, the medium and the finest level respectively, each with 4 PCG steps. The present implementation is not restricted to the number k of used key frames. The processing time is linear in the number of key frames. In the experiments, k=6 key frames were used to estimate the identity parameters resulting in a processing time of a few seconds (˜20 s).
To estimate the identity of the actors in the heavily under-constrained scenario of monocular reconstruction, a non-rigid model-based bundling approach is used. Based on the proposed objective, one jointly estimates all parameters over k key-frames of the input video sequence. The estimated unknowns are the global identity {α,β} and intrinsics K as well as the unknown per-frame pose {δk, Rk, tk}k and illumination parameters {γk}k. A similar data-parallel optimization strategy as proposed for model-to-frame tracking is used, but the normal equations are jointly solved for the entire keyframe set. For the non-rigid model-based bundling problem, the non-zero structure of the corresponding Jacobian is block dense. The PCG solver exploits the non-zero structure for increased performance. Since all keyframes observe the same face identity under potentially varying illumination, expression, and viewing angle, one may robustly separate identity from all other problem dimensions. One may also solve for the intrinsic camera parameters of Π, thus being able to process uncalibrated video footage.
To transfer the expression changes from the source to the target actor while preserving person-specificness in each actor's expressions, a sub-space deformation transfer technique is used of operating directly in the space spanned by the expression blendshapes. This not only allows for the precomputation of the pseudo-inverse of the system matrix, but also drastically reduces the dimensionality of the optimization problem allowing for fast real-time transfer rates. Assuming source identity _S and target identity αT fixed, transfer takes as input the neutral δNS, deformed source δS, and the neutral target δNT expression. Output is the transferred facial expression δT directly in the reduced sub-space of the parametric prior.
One first computes the source deformation gradients Aiε3×3 that transform the source triangles from neutral to deformed. The deformed target {circumflex over (v)}i=Mi(αT,δT) is then found based on the un-deformed state vi=Mi(αT,δNT) by solving a linear least-squares problem. Let (i0, i1, i2) be the vertex indices of the i-th triangle, V=[vi1−vi0, vi2−vi0] and {circumflex over (V)}=[vi1−{circumflex over (v)}i0,{circumflex over (v)}i2−{circumflex over (v)}i0], then the optimal unknown target deformation δT is the minimizer of:
This problem can be rewritten in the canonical least-squares form by substitution:
E(δT)=∥AδT−b∥22. (23)
The matrix Aε6|F|×76 is constant and contains the edge information of the template mesh projected to the expression sub-space. Edge information of the target in neutral expression is included in the right-hand side bε6|F|. b varies with δS and is computed on the GPU for each new input frame. The minimizer of the quadratic energy can be computed by solving the corresponding normal equations. Since the system matrix is constant, one may precompute its Pseudo Inverse using a Singular Value Decomposition (SVD). Later, the small 76×76 linear system is solved in real-time. No additional smoothness term is needed, since the blendshape model implicitly restricts the result to plausible shapes and guarantees smoothness.
In order to synthesize a realistic target mouth regions, one retrieves and warps the best matching mouth image from the target actor sequence. It is assumed that sufficient mouth variation is available in the target video. That the appearance of the target mouth is maintained. This leads to much more realistic results than either copying the source mouth region or using a generic 3D teeth proxy.
The inventive approach first finds the best fitting target mouth frame based on a frame-to-cluster matching strategy with a novel feature similarity metric. To enforce temporal coherence, a dense appearance graph is used to find a compromise between the last retrieved mouth frame and the target mouth frame (cf.
The similarity metric according to the present embodiment is based on geometric and photometric features. The used descriptor K={R,δ,F,L} of a frame is composed of the rotation R, expression parameters δ, landmarks F, and a Local Binary Pattern (LBP) L. These descriptors KS are computed for every frame in the training sequence. The target descriptor KT consists of the result of the expression transfer and the LBP of the frame of the driving actor. The distance between a source and a target descriptor is measured as follows:
D(KT,KtS,t)=Dp(KT,KtS)+Dm(KT,KtS)+Da(KT,KtS,t).
The first term Dp measures the distance in parameter space:
D
p(KT,KtS)=∥δT−δtS∥22+∥RT−RtS∥F2.
The second term Dm measures the differential compatibility of the sparse facial landmarks:
Here Ω, is a set of predefined landmark pairs, defining distances such as between the upper and lower lip or between the left and right corner of the mouth. The last term Da is an appearance measurement term composed of two parts:
D
a(KT,KtS,t)=Dl(KT,KtS)+ωc(KT,KtS)Dc(r,t).
τ is the last retrieved frame index used for the reenactment in the previous frame. Dl(KT, KtS) measures the similarity based on LBPs that are compared via a Chi Squared Distance. Dc(τ,t) measures the similarity between the last retrieved frame τ and the video frame t based on RGB cross-correlation of the normalized mouth frames. The mouth frames are normalized based on the models texture parameterization (cf.
Utilizing the proposed similarity metric, one may cluster the target actor sequence into k=10 clusters using a modified k-means algorithm that is based on the pairwise distance function D. For every cluster, one selects the frame with the minimal distance to all other frames within that cluster as a representative. During runtime, one measures the distances between the target descriptor KT and the descriptors of cluster representatives, and chooses the cluster whose representative frame has the minimal distance as the new target frame.
Temporal coherence may be improved by building a fully-connected appearance graph of all video frames. The edge weights are based on the RGB cross correlation between the normalized mouth frames, the distance in parameter space Dp, and the distance of the landmarks Dm. The graph enables to find an in-between frame that is both similar to the last retrieved frame and the retrieved target frame (see
Finally, the new output frame is composed by alpha blending between the original video frame, the illumination-corrected, projected mouth frame, and the rendered face model.