3D modeling of human facial features is commonly used to create realistic 3D representations of people. For instance, virtual human representations such as avatars frequently make use of such models. Conventional applications for generated 3D faces require manual labeling of feature points. While such techniques may employ morphable model fitting, it would be desirable if they permitted automatic facial landmark detection and employed Multi-view Stereo (MVS) technology.
The material described herein is illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. For example, the dimensions of some elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements. In the figures:
One or more embodiments or implementations are now described with reference to the enclosed figures. While specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. Persons skilled in the relevant art will recognize that other configurations and arrangements may be employed without departing from the spirit and scope of the description. It will be apparent to those skilled in the relevant art that techniques and/or arrangements described herein may also be employed in a variety of other systems and applications other than what is described herein.
While the following description sets forth various implementations that may be manifested in architectures such system-on-a-chip (SoC) architectures for example, implementation of the techniques and/or arrangements described herein are not restricted to particular architectures and/or computing systems and may implemented by any architecture and/or computing system for similar purposes. For instance, various architectures employing, for example, multiple integrated circuit (IC) chips and/or packages, and/or various computing devices and/or consumer electronic (CE) devices such as set top boxes, smart phones, etc., may implement the techniques and/or arrangements described herein. Further, while the following description may set forth numerous specific details such as logic implementations, types and interrelationships of system components, logic partitioning/integration choices, etc., claimed subject matter may be practiced without such specific details. In other instances, some material such as, for example, control structures and full software instruction sequences, may not be shown in detail in order not to obscure the material disclosed herein.
The material disclosed herein may be implemented in hardware, firmware, software, or any combination thereof. The material disclosed herein may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). For example, a machine-readable medium may include read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; electrical, optical, acoustical or other forms of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.), and others.
References in the specification to “one implementation”, “an implementation”, “an example implementation”, etc., indicate that the implementation described may include a particular feature, structure, or characteristic, but every implementation may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same implementation. Further, when a particular feature, structure, or characteristic is described in connection with an implementation, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other implementations whether or not explicitly described herein.
Image capture module 102 includes one or more image capturing devices 104, such as a still or video camera. In some implementations, a single camera 104 may be moved along an arc or track 106 about a subject face 108 to generate a sequence of images of face 108 where the perspective of each image with respect to face 108 is different as will be explained in greater detail below. In other implementations, multiple imaging devices 104, positioned at various angles with respect to face 108 may be employed. In general, any number of known image capturing systems and/or techniques may be employed in capture module 102 to generate image sequences (see, e.g., Seitz et al., “A Comparison and Evaluation of Multi-View Stereo Reconstruction Algorithms,” In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2006) (hereinafter “Seitz et al.”).
Image capture module 102 may provide the image sequence to simulation module 110. Simulation module 110 includes at least a face detection module 112, a multi-view stereo (MVS) module 114, a 3D morphable face module 116, an alignment module 118, and a texture module 120, the functionality of which will be explained in greater detail below. In general, as will also be explained in greater detail below, simulation module 110 may be used to select images from among the images provided by capture module 102, perform face detection on the selected images to obtain facial bounding-boxes and facial landmarks, recover camera parameters and obtain sparse key-points, perform multi-view stereo techniques to generate a dense avatar mesh, fit the mesh to a morphable 3D face model, refine the 3D face model by aligning and smoothing it, and synthesize a texture image for the face model.
In various implementations, image capture module 102 and simulation module 110 may be adjacent to or in proximity of each other. For example, image capture module 102 may employ a video camera as imaging device 104 and simulation module 110 may be implemented by a computing system that receives an image sequence directly from device 104 and then processes the images to generate a 3D face model and texture image. In other implementations, image capture module 102 and simulation module 110 may be remote from each other. For example, one or more server computers that are remote from image capture module 102 may implement simulation module 110 where module 110 may receive image sequences from module 102 via, for example, the internet. Further, in various implementations, simulation module 110 may be provided by any combination of software, firmware and/or hardware that may or may not be distributed across various computing systems.
At block 202, multiple 2D images of a face may be captured and various ones of the images may be selected for further processing. In various implementations, block 202 may involve using a common commercial camera to record video images of a human face from different perspectives. For example, video may be recorded at different orientations spanning approximately 180 degrees around the front of a human head for a duration of about 10 seconds while the face remains still and maintains a neutral expression. This may result in approximately three hundred 2D images being captured (assuming a standard video frame rate of thirty frames per second). The resulting video may then be decoded and a subset of about 30 or so facial images may be selected either manually or by using an automated selection method (see, e.g., R. Hartley and A. Zisserman, “Multiple View Geometry in Computer Vision,” Chapter 12, Cambridge Press, Second Version (2003)). In some implementations, the angle between adjacent selected images (as measured with respect to the subject being imaged) may be 10 degrees or smaller.
Face detection and facial landmark identification may then be performed on the selected images at block 204 to generate corresponding facial bounding boxes and identified landmarks within the bounding boxes. In various implementations, block 204 may involve applying known automated multi-view face detection techniques (see, e.g., Kim et al., “Face Tracking and Recognition with Visual Constraints in Real-World Videos”, In IEEE Conf. Computer Vision and Pattern Recognition (2008)) to outline the face contour and facial landmarks in each image using the face bounding-box to restrict the region in which landmarks are identified and to remove extraneous background image content. For instance,
At block 206, camera parameters may be determined for each image. In various implementations, block 206 may include, for each image, extracting stable key-points and using known automatic camera parameter recovery techniques, such as described in Seitz et al., to obtain a sparse set of feature points and camera parameters including a camera projection matrix. In some examples, face detection module 112 of system 100 may undertake block 204 and/or block 206.
At block 208, multi-view stereo (MVS) techniques may be applied to generate a dense avatar mesh from the sparse feature points and camera parameters. In various implementations, block 208 may involve performing known stereo homography and multi-view alignment and integration techniques for facial image pairs. For example, as described in WO2010133007 (“Techniques for Rapid Stereo Reconstruction from Images”), for a pair of images, optimized image point pairs obtained by homography fitting may be triangulated with the known camera parameters to produce a three-dimensional point in a dense avatar mesh. For instance, FIG. 4 illustrates a non-limiting example of multiple recovered cameras 402 (e.g., as specified by recovered camera parameters) as may be obtained at block 206 and a corresponding dense avatar mesh 404 as may be obtained at block 208. In some examples, MVS module 114 of system 100 may undertake block 208.
Returning to the discussion of
In various implementations, block 210 may involve learning a morphable face model from a face data set. For example, a face data set may include shape data (e.g., (x, y, z) mesh coordinates in Cartesian coordinate system) and texture data (red, green and blue color intensity values) specifying each point or vertex in the dense avatar mesh. The shape and texture may be represented by respective column vectors (x1, y1, z1, x2, y2, z2, . . . , xn, yn, zn)t, and (R1, G1, B1, R2, G2, B2, . . . , Rn, Gn, Zn)t (where n is the number of feature points or vertices in a face), respectively.
A generic face may be represented as a 3D morphable face model using the following formula:
where X0 is the mean column vector λi is the ith eigen-value, Ui is the ith eigen-vector, and αi is the reconstructed metric coefficient of the ith eigen-value. The model represented by Eqn. (1) may then be morphed into various shapes by adjusting the set of coefficients {α}n.
Fitting the dense avatar mesh to the 3D morphable face model of Eqn. (1) may involve defining morphable model vertices Smod analytically as
S
mod
=P(X0+αUλ) (2)
where PεR3n×3K is a projection that selects n vertices corresponding to feature points from the complete set K of morphable model vertices. In Eqn. (2) the n feature points are used to measure the reconstructed error.
During fitting, model priors may be applied resulting in the following cost function:
E=∥P(X0+αUλ)−S′rec∥+η∥α∥ (3)
where Eqn. (3) assumes that the probability of representing a qualified shape directly depends on the norm. Larger values for a correspond to larger differences between a reconstructed face and the mean face. The parameter η trades off the prior probability and the fitting quality in Eqn. (3) and may be determined iteratively by minimizing the following cost function:
where δS=∥SmodS′rec∥ and A=PUλ. Applying a singular decomposition to A yields A=Udiag(wi)VT where wi is the singular value of A.
Eqn. (4) may be minimized when the following condition holds:
Using Eqn. (5), a may be iteratively updated as α=α+δα. In addition, in some implementations η may be adjusted iteratively where η may be initially set to w02 (e.g., the largest singular value) and may be decreased to the square of the smaller singular values.
In various implementations, given the reconstructed 3D points provided at block 210 in the form of a reconstructed morphable face mesh, alignment at block 212 may involve searching for both the pose of a face and the metric coefficients needed to minimize the distance from the reconstructed 3D point to the morphable face mesh. The pose of a face may be provided by the transform
from the coordinate frame of the neutral face model to that of the dense avatar mesh, where R is a 3×3 rotation matrix, t is a translation, and s is a global scale. For any 3D vector p, the notation T(p)=sRp+t may be employed.
The vertex coordinates of a face mesh in the camera frame are a function of both the metric coefficients and the face pose. Given metric coefficients {α1, α2, . . . , αn} and pose T, the face geometry in the camera frame may be provided by
In examples where the face mesh is a triangular mesh, any point on the triangle may be expressed as a linear combination of the three triangle vertexes measured in barycentric coordinates. Thus, any point on a triangle may be expressed as a function of T and the metric coefficients. Furthermore, when T is fixed, it may be represented as a linear function of the metric coefficients described herein.
The pose T and the metric coefficients {α1, α2, . . . , αn} may then be obtained by minimizing
where (p1, p2, . . . , pn) represent the points of the reconstructed face mesh, and d(pi, S) represents the distance from a point pi to the face mesh S. Eqn. (7) may be solved using an iterative closed point (ICP) approach. For instance, at each iteration, T may be fixed and, for each point pi, the closest point gi on the current face mesh S may be identified. The error E may then be minimized (Eqn. (7)) and the reconstructed metric coefficients obtained using Eqns. (1)-(5). The face pose T may then be found by fixing the metric coefficients {α1, α2, . . . , αn}. In various implementations this may involve building a kd-tree for the dense avatar mesh points, searching the closed points in dense point for the morphable face model, and using least squares techniques to obtain the pose transform T. The ICP may continue with further iterations until the error E has converged and the reconstructed metric coefficients and pose T are stable.
Having aligned the dense avatar mesh (obtained from MVS processing at block 208) and the reconstructed morphable face mesh (obtained at block 210), the results may be refined or smoothed by fusing the dense avatar mesh to the reconstructed morphable face mesh. For instance,
In various implementations, smoothing the 3D face model may include creating a cylindrical plane around the face mesh, and unwrapping both the morphable face model and the dense avatar mesh to the plane. For each vertex of the dense avatar mesh, a triangle of the morphable face mesh may be identified that includes the vertex, and the barycentric coordinates of the vertex within the triangle may be found. A refined point may then be generated as a weighted combination of the dense point and corresponding points in the morphable face mesh. The refinement of a point pi in dense avatar mesh may be provided by:
where α and β are weights, (q1, q2, q3) are the three vertices of the morphable face mesh triangle containing the point pi, and (c1, c2, c3) is the normalized area of the three sub-triangles as illustrated in
After generation of the smoothed 3D face mesh at block 212, the camera projection matrix may be used to synthesize a corresponding face texture by applying multi-view texture synthesis at block 214. In various implementations, block 214 may involve determining a final face texture (e.g., a texture image) using an angle-weighted texture synthesis approach where, for each point or triangle in the dense avatar mesh, projected points or triangles in the various 2D facial images may be obtained using a corresponding projection matrix.
Texture values for points P1 and P2 may then be weighted by the cosine of the angle between the normal N and the principle axis of the respective cameras. For instance, the texture value of point P1 may be weighted by the cosine of the angle 710 formed between the normal N and the principle axis Z1 of camera C1. Similarly, although not shown in
Process 200 may conclude at block 216 where the smoothed 3D face model and the corresponding texture image may be combined using known techniques to generate a final 3D face model. For instance,
While the implementation of example process 200 as illustrated in
System 900 includes a processor 902 having one or more processor cores 904. Processor cores 904 may be any type of processor logic capable at least in part of executing software and/or processing data signals. In various examples, processor cores 904 may include CISC processor cores, RISC microprocessor cores, VLIW microprocessor cores, and/or any number of processor cores implementing any combination of instruction sets, or any other processor devices, such as a digital signal processor or microcontroller.
Processor 902 also includes a decoder 906 that may be used for decoding instructions received by, e.g., a display processor 908 and/or a graphics processor 910, into control signals and/or microcode entry points. While illustrated in system 900 as components distinct from core(s) 904, those of skill in the art may recognize that one or more of core(s) 904 may implement decoder 906, display processor 908 and/or graphics processor 910. In some implementations, processor 902 may be configured to undertake any of the processes described herein including the example process described with respect to
Processing core(s) 904, decoder 906, display processor 908 and/or graphics processor 910 may be communicatively and/or operably coupled through a system interconnect 916 with each other and/or with various other system devices, which may include but are not limited to, for example, a memory controller 914, an audio controller 918 and/or peripherals 920. Peripherals 920 may include, for example, a unified serial bus (USB) host port, a Peripheral Component Interconnect (PCI) Express port, a Serial Peripheral Interface (SPI) interface, an expansion bus, and/or other peripherals. While
In some implementations, system 900 may communicate with various I/O devices not shown in
System 900 may further include memory 912. Memory 912 may be one or more discrete memory components such as a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, flash memory device, or other memory devices. While
The devices and/or systems described herein, such as example system 100 represent several of many possible device configurations, architectures or systems in accordance with the present disclosure. Numerous variations of systems such as variations of example system 100 are possible consistent with the present disclosure.
The systems described above, and the processing performed by them as described herein, may be implemented in hardware, firmware, or software, or any combination thereof. In addition, any one or more features disclosed herein may be implemented in hardware, software, firmware, and combinations thereof, including discrete and integrated circuit logic, application specific integrated circuit (ASIC) logic, and microcontrollers, and may be implemented as part of a domain-specific integrated circuit package, or a combination of integrated circuit packages. The term software, as used herein, refers to a computer program product including a computer readable medium having computer program logic stored therein to cause a computer system to perform one or more features and/or combinations of features disclosed herein.
While certain features set forth herein have been described with reference to various implementations, this description is not intended to be construed in a limiting sense. Hence, various modifications of the implementations described herein, as well as other implementations, which are apparent to persons skilled in the art to which the present disclosure pertains are deemed to lie within the spirit and scope of the present disclosure.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CN11/01306 | 8/9/2011 | WO | 00 | 7/18/2012 |