The present disclosure relates generally to imagery capture and processing and more particularly to face tracking using captured imagery.
Face tracking allows facial expressions and head movements to be used as an input mechanism for virtual reality and augmented reality systems, thereby supporting a more immersive user experience. A conventional face tracking system captures images and depth data of the user's face and fits a generative model to the captured image or depth data. To fit the model to the captured data, the face tracking system defines and optimizes an energy function to find a minimum that corresponds to the correct face pose. However, conventional face tracking systems typically have accuracy and latency issues that can result in an unsatisfying user experience.
The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings. The use of the same reference symbols in different drawings indicates similar or identical items.
The following description is intended to convey a thorough understanding of the present disclosure by providing a number of specific embodiments and details involving estimating a pose of a face by fitting a generative face model mesh to a depth map based on vertices of the face model mesh that are estimated to be visible from the point of view of a depth camera. It is understood, however, that the present disclosure is not limited to these specific embodiments and details, which are examples only, and the scope of the disclosure is accordingly intended to be limited only by the following claims and equivalents thereof. It is further understood that one possessing ordinary skill in the art, in light of known systems and methods, would appreciate the use of the disclosure for its intended purposes and benefits in any number of alternative embodiments, depending upon specific design and other needs.
In some embodiments, the face model mesh is parameterized by a set of identity and expression coefficients that indicate how to non-rigidly deform the vertices of the face model mesh to fit the depth map. In some embodiments, the face tracking module bicubically interpolates the depth map to smooth intersections at pixel boundaries. The face tracking module adjusts the identity and expression coefficients to better match the depth map. The face tracking module then minimizes an energy function based on the distance of each visible vertex of the face model mesh to the depth map to identify the face model mesh that most closely approximates the pose of the face.
The depth camera 105, in one embodiment, uses a modulated light projector (not shown) to project modulated light patterns into the local environment, and uses one or more imaging sensors 106 to capture reflections of the modulated light patterns as they reflect back from objects in the local environment 112. These modulated light patterns can be either spatially-modulated light patterns or temporally-modulated light patterns. The captured reflections of the modulated light patterns are referred to herein as “depth images” 115 and are made up of a three-dimensional (3D) point cloud having a plurality of points. In some embodiments, the depth camera 105 calculates the depths of the objects, that is, the distances of the objects from the depth camera 105, based on the analysis of the depth images 115.
The face tracking module 110 receives a depth image 115 from the depth camera 105 and generates a depth map based on the depth image 115. The face tracking module 110 identifies a pose of the face 120 by fitting a face model mesh to the pixels of the depth map that correspond to the face 120. In some embodiments, the face tracking module 110 estimates parameters θ={α, β, t, q}∈D of the generative face model mesh to explain the data from a RGB-D pair . In some embodiments, the face tracking module 110 leverages the parameters {circumflex over (θ)} inferred in the previous frame received from the depth camera. In some embodiments, the model is parameterized by a set of identity coefficients α(θ) ∈H, a set of expression weights, or coefficients β(θ) ∈K, a three-dimensional (3D) position of the head t(θ) ∈3 and a quaternion indicating the 3D rotation of the head q(θ) ∈4. The identity and expression coefficients indicate how to non-rigidly deform the 3D positions (vertices) of the face model mesh to fit corresponding pixels of the depth map. In some embodiments, the face model mesh is a triangular mesh model. The face tracking module 110 models the deformation of the 3D positions of the face model mesh using a bi-linear (PCA) basis of the N 3D vertex positions where Vμ∈3×N represents the mean face, {Vhid}h=1H⊆3×N are a set of vertex offsets that can change the identity of the face and {Vkexp}k=1K⊆3×N are a set of vertex offsets that can change the expression of the face. Under a set of parameters θ, the face tracking module 110 calculates the deformed and repositioned vertices of the face model mesh as
V(θ)=R(θ)(Vμ+Σk=1Kβk(θ)Vkexp+Σh=1Hαh(θ)Vhid)+t(θ) (1)
where R(θ)=R(q(θ))∈3×3 maps the quaternion in θ into a rotational matrix.
The face tracking module 110 estimates the parameters θ of the face model mesh given the depth image based on a probabilistic inference problem in which it solves
In some embodiments, the face tracking module 110 assumes that the likelihood and prior are functions belonging to the exponential family, and uses the negative log form to rewrite the maximization problem of Equation (2) as
To facilitate increased efficiency of minimizing the energy, the face tracking module 110 includes only the vertices of the face model mesh that are assumed to be visible from the point of view of the depth camera 105, and bicubically interpolates the depth map associated with , allowing the face tracking module 110 to jointly optimize the pose and blendshape estimation of the face 120 using a smooth and differentiable energy. Based on the energy, the face tracking module estimates a current pose 140 of the face 120.
In some embodiments, the face tracking module 110 uses the current pose estimate 140 to update graphical data 135 on a display 130. In some embodiments, the display 130 is a physical surface, such as a tablet, mobile phone, smart device, display monitor, array(s) of display monitors, laptop, signage and the like or a projection onto a physical surface. In some embodiments, the display 130 is planar. In some embodiments, the display 130 is curved. In some embodiments, the display 130 is a virtual surface, such as a three-dimensional or holographic projection of objects in space including virtual reality and augmented reality. In some embodiments in which the display 130 is a virtual surface, the virtual surface is displayed within an HMD of a user. The location of the virtual surface may be relative to stationary objects (such as walls or furniture) within the local environment 112 of the user.
The memory 205 is a memory device generally configured to store data, and therefore may be a random access memory (RAM) memory module, non-volatile memory device (e.g., flash memory), and the like. The memory 205 may form part of a memory hierarchy of the face tracking system 100 and may include other memory modules, such as additional caches not illustrated at
The visibility estimator 210 is a module configured to estimate whether a vertex of the face model mesh is visible from the point of view of the depth camera by determining to what degree the associated normal is facing toward or away from the depth camera.
where Nn(θ) is the normal vector of vertex n. The parameters δ and ν respectively control the curvature and where the value 0.5 is reached.
The energy minimizer 215 is a module configured to formulate and minimize an energy function describing the difference between the face model mesh and the depth map of the face. The energy function may be defined as
where O(θ)⊆{1, . . . , N} is the set of visible vertex indices under θ, Vn(θ)∈3 is the position of the n'th vertex under pose θ, Π:3→2 projects the 2D image domain and (·) returns the depth of the closest pixel in the depth image associated with . However, (Π(Vn(θ))) is a piecewise constant mapping and is usually held fixed during optimization. Thus, obtaining the set O(θ) requires an explicit rendering and endows the function with discontinuities. Accordingly, (6) is only smooth and differentiable once each vertex is associated with a specific depth value. In such a case, rendering and explicit correspondences must be re-established in closed form every time the pose θ is updated.
To facilitate more efficient pose estimation without necessitating rendering, based on the visibility estimates of the visibility estimator 210, the energy minimizer 215 replaces the sum (6) over an explicit set of vertices O(θ) with a sum over all vertices {1, . . . , N} using the visibility term to turn on and off the individual terms as
In some embodiments, the energy minimizer allows (·) to bicubically interpolate the depth map associated with , such that the energy is fully differentiable and well defined. In some embodiments, to handle outliers, the energy minimizer 215 can use any smooth robust kernel ψ:
The landmarks module 220 is a module configured to detect and localize distinctive features (e.g., nose tip, eye corners), referred to as landmarks, of human faces. Landmarks provide strong constraints, both for the general alignment of the face and for estimating the identity and expression of the face. However, detected landmarks can be slightly incorrect, or may be estimated if they are not directly visible by the depth camera 105, resulting in residuals in the image domain. The landmarks module 220 thus defines L facial landmarks {fl}l=1L∈2, confidence weights {wl}l=1L∈, and associated vertex indices {ηl}lL⊆{1, . . . , N}. The landmarks module 220 minimizes an energy function that reduces the variation in landmarks due to the distance of the face from the depth camera 105:
where Md(θ) is the average depth of the face and f is the focal length of the depth camera 105.
The regularizer 225 is a module configured to adjust the identity and expression weights to avoid over-fitting the depth map. The regularizer 225 normalizes the eigen-vectors (blendshapes) to provide a standard normal distribution. While expression coefficients generally do not follow a Gaussian distribution, identity parameters roughly do. The regularizer 225 performs statistical regularization by minimizing the L2 norm of the identity parameters. The weights adjustor 225 therefore effectively encourages solutions to be close to the maximum likelihood estimation (MLE) of the multivariate Normal distribution, which is the mean face. The regularizer 225 performs the statistical regularization using
E
reg(θ)=Σk=1Kβk(θ)2+(Σh=1Hαh2−E[χ|H|2])2 (10)
where the distribution has H degrees of freedom. This constraint effectively encourages solution [α1(θ), . . . , αH(θ)] to remain close to the “shell” at a distance H from the mean face, which is where the vast majority of faces are in high dimensions.
In some embodiments, the regularizer 225 incorporates temporal regularization overall the entries of θ during the joint optimization by adding in the following temporal regularization term to the energy:
E
temp(θ)=ωid∥α({circumflex over (θ)})−α(θ)∥22+ωexp∥β({circumflex over (θ)})−β(θ)∥22
ωtrans∥t({circumflex over (θ)})−t(θ)∥22+ωrot∥q({circumflex over (θ)})−q(θ)∥22 (11)
where q(θ)∈4 is the sub-vector of rotational parameters in quaternion form, and {circumflex over (θ)} is the solution from the previous frame.
In some embodiments, the face tracking module 110 optimizes the energy function of (11) by recasting the energy function of (11) as a sum of M squared residuals
E(θ)=r(θ)Tr(θ) (12)
where r(θ)∈M. In some embodiments, the face tracking module 110 computes the Jacobian J(θ)∈RD×D and performs Levenberg updates (which are variants of Gauss-Newton updates) as
θ←θ+(J(θ)TJ(θ)+λID×D)−1JT(θ)r(θ) (13)
where λ is a damping term that can be increasingly raised when steps fail in order to achieve very small gradient descent like updates. The face tracking module 110 initializes θ←{circumflex over (θ)} using the parameters from the previous frame. In some embodiments, the face tracking module 110 performs this Levenberg optimization on a GPU. The face tracking module 110 generates a current pose estimate 140 of the face based on the optimized energy.
Assuming that the depth camera is facing
Accordingly, for the pose 310, the visibility estimator 210 estimates that the vector 312 is at a 45° angle to the depth camera, that vector 314 is at a 2° angle to the depth camera, that vector 316 is at a 2° angle to the depth camera, and that vector 318 is at a 10° angle to the depth camera. The visibility estimator 210 assigns a value of 1 to the vertices associated with each of vectors 312, 314, 316, and 318, because each vector is estimated to be pointing toward the depth camera.
For the pose 320, however, the face model mesh 305 is rotated to the left, such that the visibility estimator 210 estimates that vector 312 is at a 10° angle to the depth camera, that vector 314 is at a 20° angle to the depth camera, that vector 316 is at a −20° angle to the depth camera, and that vector 318 is at a −45° angle to the depth camera. The visibility estimator 210 therefore assigns a value of 1 to the vertices associated with each of vectors 312 and 314, because vectors 312 and 314 are estimated to be pointing toward the depth camera. However, the visibility estimator assigns a value of 0 to the vertices associated with each of vectors 316 and 318 when the face model mesh 305 is in pose 320, because vectors 316 and 318 are estimated to be pointing away from the depth camera. By assigning values of 0 or 1 to the vertices of the face model mesh 305 based on whether the normal vector associated with each vertex is estimated to face toward or away from the depth camera, the face tracking module 110 can smoothly turn on and off data terms of the energy function without rendering the face model mesh 305.
In some embodiments, certain aspects of the techniques described above may implemented by one or more processors of a processing system executing software. The software comprises one or more sets of executable instructions stored or otherwise tangibly embodied on a non-transitory computer readable storage medium. The software can include the instructions and certain data that, when executed by the one or more processors, manipulate the one or more processors to perform one or more aspects of the techniques described above. The non-transitory computer readable storage medium can include, for example, a magnetic or optical disk storage device, solid state storage devices such as Flash memory, a cache, random access memory (RAM) or other non-volatile memory device or devices, and the like. The executable instructions stored on the non-transitory computer readable storage medium may be in source code, assembly language code, object code, or other instruction format that is interpreted or otherwise executable by one or more processors.
A computer readable storage medium may include any storage medium, or combination of storage media, accessible by a computer system during use to provide instructions and/or data to the computer system. Such storage media can include, but is not limited to, optical media (e.g., compact disc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media (e.g., floppy disc, magnetic tape, or magnetic hard drive), volatile memory (e.g., random access memory (RAM) or cache), non-volatile memory (e.g., read-only memory (ROM) or Flash memory), or microelectromechanical systems (MEMS)-based storage media. The computer readable storage medium may be embedded in the computing system (e.g., system RAM or ROM), fixedly attached to the computing system (e.g., a magnetic hard drive), removably attached to the computing system (e.g., an optical disc or Universal Serial Bus (USB)-based Flash memory), or coupled to the computer system via a wired or wireless network (e.g., network accessible storage (NAS)).
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed. Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims. Moreover, the particular embodiments disclosed above are illustrative only, as the disclosed subject matter may be modified and practiced in different but equivalent manners apparent to those skilled in the art having the benefit of the teachings herein. No limitations are intended to the details of construction or design herein shown, other than as described in the claims below. It is therefore evident that the particular embodiments disclosed above may be altered or modified and all such variations are considered within the scope of the disclosed subject matter. Accordingly, the protection sought herein is as set forth in the claims below.
The present application is related to and claims priority to the following co-pending application, the entirety of which is incorporated by reference herein: U.S. Provisional Patent Application Ser. No. 62/516,646 (Attorney Docket No. 1500-FEN008-PR), entitled “High Speed and High Fidelity Face Tracking,” filed Jun. 7, 2017.
Number | Date | Country | |
---|---|---|---|
62516646 | Jun 2017 | US |