Face alignment is a term used to describe a process for locating semantic facial landmarks, such as eyes, a nose, a mouth, and a chin. Face alignment is used for such tasks as face recognition, face tracking, face animation, and 3D face modeling. As these tasks are being applied more frequently in unconstrained environments (e.g., large numbers of personal photos uploaded through social networking sites), fully automatic, highly efficient and robust face alignment methods are increasingly in demand.
Most existing face alignment approaches are optimization-based or regression-based. Optimization-based methods are implemented to minimize an error function. In at least one existing optimization-based method, the entire face is reconstructed using an appearance model and the shape is estimated by minimizing a texture residual. In this example, the learned appearance models have limited expressive power to capture complex and subtle face image variations in pose, expression, and illumination.
Regression-based methods learn a regression function that directly maps image appearance to the target output. Complex variations may be learned from large training data. Many regression-based methods rely on a parametric model and minimize model parameter errors in the training. This approach is sub-optimal because small parameter errors do not necessarily correspond to small alignment errors. Other regression-based methods learn regressors for individual landmarks. However, because only local image patches are used in training and appearance correlation between landmarks is not exploited, such learned regressors are usually weak and cannot handle large pose variation and partial occlusion.
Optimization-based methods and regression-based methods also enforce shape constraint, which is the correlation between landmarks. Most existing methods use a parametric shape model to enforce the shape constraint. Given a parametric shape model, the model flexibility is often heuristically determined.
This document describes face alignment by explicit shape regression. A vectorial regression function is learned to infer the whole facial shape from an image and explicitly minimize alignment errors over a set of training data. The inherent shape constraint is naturally encoded into the regressor in a cascaded learning framework and applied from course to fine, without using a fixed parametric shape model. In one aspect, image features are indexed according to a current estimated shape to achieve invariance. Features are selected to form a regressor based on the features' correlation to randomly projected vectors that represent differences between known face shapes and corresponding estimated face shapes. The correlation-based feature selection results in selection of features that are highly correlated to the differences between the estimated face shapes and the known face shapes, and selection of features that are highly complementary to each other.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The term “techniques,” for instance, may refer to device(s), system(s), method(s) and/or computer-readable instructions as permitted by the context above and throughout the document.
The detailed description is described with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The same numbers are used throughout the drawings to reference like features and components.
Face alignment by explicit shape regression refers to a regression-based approach that does not rely on parametric shape models. Rather, a regressor is trained by explicitly minimizing the alignment error over training data in a holistic manner by which the facial landmarks are regressed jointly in a vectorial output. Each regressed shape is a linear combination of the training shapes, and thus, shape constraint is realized in a non-parametric manner. Using features across the image for multiple landmarks is more discriminative than using only local patches for individual landmarks. Accordingly, from a large set of training data, it is possible to learn a flexible model with strong expressive power.
Face alignment by explicit shape regression, as described herein, includes a two-level boosted regressor to progressively infer the face shape within an image, an indexing method to index pixels relative to facial landmarks, and a correlation-based feature selection method to quickly identify a fern to be used as a second-level primitive regressor.
Regressor training module 104 processes each training image and corresponding known face shape 102 with an initial shape 106 to learn a set of regressors 108, which are output from the regressor training module 104.
The set of regressors 108 are then input to the alignment estimation module 110. Using the set of regressors 108, the alignment estimation module 110 is configured to estimate a face shape for an image having an unknown face shape 112. An estimated face shape 114, is output from the alignment estimation module 110.
Pixel indexing module 202 is configured to determine a number of features for a given image. In the described implementation, a feature is a number that represents the intensity difference between two pixels in an image. In an example implementation, each pixel is indexed relative to the currently estimated shape, rather than being indexed relative to the original image coordinates. This leads to geometric invariance and fast convergence in boosted learning.
Features can vary significantly from one image to another based on differences in scale or rotation. To achieve feature invariance against face scales and rotations, the pixel indexing module first computes a similarity transform to normalize a current shape to a mean shape. In an example implementation, the mean shape is estimated by performing a least squares fitting of all of the facial landmarks. Example facial landmarks may include, but are not limited to, an inner eye corner, an outer eye corner, a nose tip, a chin, a left mouth corner, a right mouth corner, and so on.
While each pixel may be indexed using global coordinates (x, y) with reference to the currently estimated face shape, a pixel at a particular location with regard to a global coordinate system may have different semantic meanings across multiple images. Accordingly, in the techniques described herein, each pixel is indexed by local coordinates (δx, δy) with reference to a landmark nearest the pixel. This technique maintains greater invariance across multiple images, and results in a more robust algorithm.
In contrast, images 302(2) and 304(2) are shown each with two local coordinate systems having been overlaid. In each of these images, the local coordinate systems are defined such that the origin of each coordinate system corresponds to a particular facial landmark. For example, the upper coordinate system in both image 302(2) and image 304(2) is overlaid with its origin corresponding to the inner corner of the left eye. Similarly, the lower coordinate system is overlaid with its origin corresponding to the left corner of the mouth. Pixel “A” in image 302(2) is defined with reference to the upper coordinate system that is originated at the inner corner of the left eye, and has the same coordinates as pixel “A” in image 304(2). Similarly, pixel “B” in image 302(2) is defined with reference to the lower coordinate system that is originated at the left corner of the mouth, and has the same coordinates as pixel “B” in image 304(2).
Based on the local coordinate systems, pixels “A” and “B” in images 302(2) and 304(2) reference similar facial landmarks. For example, in both images, pixel “A” falls within the subject's eyebrow and pixel “B” falls just to the left of the corner of the subject's mouth.
Referring back to
Feature selection module 204 is configured to select F features from the P2 features that are determined by the pixel indexing module 202. The features, F, selected by feature selection module 204 will constitute a fern, which will then be used by the two-level boosted regression module as a second-level primitive regressor.
Two-level boosted regression module 206 is configured to learn a vectorial regression function, Rt, to update a previously-estimated face shape, St-1, to a new estimated face shape, St. The two-level boosted regression module 206 learns the first-level regressor, Rt, based on the image, I, and a previous estimated face shape, St-1. Each Rt, is constructed from the primitive regressor ferns generated by the features selection module 204, which are based on features indexed relative to the previous estimated face shape, St-1.
The two-level boosted regressor includes early regressors, which handle large shape variations, and are very robust, and later regressors, which handle small shape variations, and are very accurate. Accordingly, the shape constraint is automatically and adaptively enforced from coarse to fine.
Face shapes 502(1) and 502(2) illustrate a range of differences in yaw, which accounts for rotation around a vertical axis. In other words, the shape of a face in an image will differ as illustrated by example face shapes 502(1) and 502(2) depending on a degree to which the person's head is turned to the left or to the right.
Face shapes 504(1) and 504(2) illustrate a range of differences in roll, which accounts for rotation around an axis perpendicular to the display. In other words, the shape of a face in an image will differ as illustrated by example face shapes 504(1) and 504(2) depending on a degree to which the person's head is tilted to the left or to the right.
Face shapes 506(1) and 506(2) illustrate a range of differences in scale, which accounts for an overall size of the face. In other words, the shape of a face in an image will differ as illustrated by example face shapes 506(1) and 506(2) depending on a perceived distance between the camera and the person.
Example face shapes 602(1) and 602(2) illustrate a range of subtle differences in the face contour and mouth shape; example face shapes 604(1) and 604(2) illustrate a range of subtle differences in the mouth shape and nose tip; and example face shapes 606(1) and 606(2) illustrate a range of subtle differences in the position of the eyes and the tip of the nose.
An operating system 710, a face alignment application 712, and one or more other applications 714 are stored in memory 708 as computer-readable instructions, and are executed, at least in part, on processor 706.
Face alignment application 712 includes a regressor training module 104, training images 102, initial shapes 106, learned regressors 108, and an alignment estimation module 110. As described above, the regressor training module 104 includes a pixel indexing module 202, a feature selection module 204, and a two-level boosted regression module 206.
In an example implementation, training images 102 are maintained in a data store. Each training image includes an image, I, and a known shape, g. Initial shapes 106 include any number of shapes to be used as initial shape estimates during a training phase to learn the regressors, or when estimating a face shape for a non-training image. In an example implementation, initial shapes 106 are randomly sampled from a set of images with known face shapes. This set of images may be different from the set of training images. Alternatively, the initial shapes 106 may be mean shapes calculated from any number of known shapes. A variety of other techniques may be used to establish a set of one or more initial shapes 106. The initial shapes 106 may be used by the two-level boosted regression module 206 when learning the regressors, and may also be used by the alignment estimation module 110 when estimating a shape for an image with no known face shape.
Learned regressors 108 are output from the two-level boosted regression module 206. The learned regressors 108 are maintained and subsequently used by alignment estimation module 110 to estimate a shape for an image with no known face shape.
Although illustrated in
Computer-readable media includes at least two types of computer-readable media, namely computer storage media and communications media.
Computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store information for access by a computing device.
In contrast, communication media may embody computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave. As defined herein, computer storage media does not include communication media.
Regressors are learned during a training process using a large number of images (e.g., training images 102). For each image in the training data, the actual face shape is known. For example, the face shapes in the training data may be labeled by a human.
A face shape, S, is defined in terms of a number, L, of facial landmarks, each represented by an x and y coordinate, such that:
S=[x
1
,y
1
, . . . ,x
L
,y
L].
Given an image of a face, the goal of face alignment is to estimate a shape, S, that is as close as possible to the true shape, Ŝ, thereby minimizing the value of:
∥S−Ŝ∥2 (1)
At block 802, for each training image, I, its known shape, Ŝ, is identified. For example, two-level boosted regression module 206 selects training images and corresponding known shapes from training images 102.
At block 804, for each training image, an initial shape estimation, S0, is selected. For example, two-level boosted regression module 206 selects one or more shapes from initial shapes 106.
At block 806, a first level regression parameter, T, is defined. T may be defined as any number. However, selection of a particular value for T may impact both computational cost and accuracy. In an example implementation, T is defined such that T=10.
At block 808, a first level regression index, t, is initialized to t=1. The first level regression index is configured to increment from 1 to T.
At block 810, a second level regression parameter, K, is defined. K may be defined as any number. However, selection of a particular value for K may impact both computational cost and accuracy. In an example implementation, K is defined such that K=500.
At block 812, a number, P, of pixels, which are locally indexed, are randomly sampled from each training image based on estimated shape St-1 and the known shape of each training image. Locally indexed pixels are described above with reference to pixel indexing module 202 and
At block 814, for each training image, two-level boosted regression module 206 initializes a second level initial shape estimation, S20, such that S20=St-1.
At block 816, a second level regression index, k, is initialized to k=1. The second level regression index is configured to increment from 1 to K.
At block 818, a second level regression is performed to construct a second level regressor, rk. The second level regression is described in further detail below with reference to
At block 820, the second level regression index is incremented such that k=k+1.
At block 822, a determination is made as to whether or not a sufficient number of second level regressors have been constructed. If k<=K (the “No” branch from block 822), the processing continues as described above with reference to block 818.
At block 824, the first-level regressor, Rt, is constructed such that Rt=(r1, . . . , rk, . . . , rK).
At block 826, for each training image, a new shape estimation, St, is calculated such that St=S2k.
At block 828, the first level regression index, t, is incremented such that t=t+1.
At block 830, a determination is made as to whether or not t is now greater than T. If t<=T (the “No” branch from block 830), then processing continues as described above with reference to block 812. However, if t>T, indicating that each of regressors R1-RT have been learned (the “Yes” branch from block 830), then processing is complete, as indicated by block 832.
As illustrated in
S
t
=S
t-1
+R
t(1,St-1),t=1, . . . ,T (2)
As described below with reference to
For example, given N training images with known face shapes, {(Ii, Ŝi)}i=1N, where Ii is the ith training image and Ŝi is the known face shape of the ith training image, the regressors (R1, . . . , Rt, . . . , RT) are sequentially learned until the training error no longer decreases. That is, each regressor Rt is learned by explicitly minimizing the sum of alignment errors such that:
R
t=arg minRΣi=1N∥Ŝi−(Sit-1+R(Ii,Sit-1))∥ (3)
where Sit-1 is the shape estimated in the previous stage.
As discussed above, regressing the entire shape, which may be as large as dozens of landmarks, is a difficult task, especially in the presence of large image appearance variations and rough shape initializations. To address this challenge, each weak regressor, Rt, is learned by a second level boosted regression such that Rt=(r1, . . . , rk, . . . , rK). In this second level, the shape-indexed image features are fixed, such that they are indexed only relative to St-1.
At block 902, for each training image, a regression target, Y, is calculated such that
Y=Ŝ−S
2
k-1.
That is, Y is defined as the difference between the known face shape of the training image and the current estimated face shape.
At block 904, a feature parameter, F, is defined. F represents a number of features to be selected for use as a fern regressor. F may be defined as any number. However, selection of a particular value for F may impact both computational cost and accuracy. In an example implementation, F is defined such that F=5.
At block 906, a feature index, f is initialized to f=1. The feature index is configured to increment from 1 to F.
At block 908, for each training image, the regression target, Y, is projected to a random direction to generate a scalar value.
At block 910, a particular feature is selected from the P2 features calculated for each training image, such that the selected feature has the highest correlation of the calculated features to the scalar values generated at block 908.
At block 912, the feature index is incremented such that f=f+1.
At block 914, a determination is made as to whether or not a sufficient number of features have been selected. If f<=F (the “No” branch from block 914), the processing continues as described above with reference to block 908, to select another feature.
At block 916, when it is determined that f>F, indicating that the desired number of features have been selected (the “Yes” branch from block 914), a fern regressor, rk, is constructed using the F selected features.
At block 918, for each training image, a new second level estimated face shape, S2k, is generated according to rk. Processing then continues as described above with reference to block 820 of
As described with reference to
The random projection (see block 908 of
As described herein, in an example implementation, each primitive regressor, r, is implemented as a fern. A fern is a composition of F features (e.g., F=5) and thresholds that divide the feature space (and all training samples) into 2F bins. Each bin, b, is associated with a regression output δSb that minimizes the alignment error of training samples Ωb falling into the bin such that:
δSb=arg minδSΣiεΩ
where Si denotes the shape estimated in the previous step.
The solution to equation (4) is the mean of shape differences:
In an example implementation, over-fitting may occur if there is insufficient training data in a particular bin. To account for such over-fitting, a free shrinkage parameter, β, is used. When the bin has sufficient training samples, the shrinkage parameter has little effect, but when there is insufficient training data, the estimation is adaptively reduced according to:
The number, F, of features in a fern and the shrinkage parameter, β, adjust the trade-off between fitting power in training and generalization ability when testing. In an example implementation, F=5 and β=1000.
At block 1002, an image is received. For example, as illustrated in
At block 1004, an initial shape estimation, S0, is selected. For i example, alignment estimation module 110 selects an initial shape from initial shapes 106.
At block 1006, a two-level cascaded regression is performed to estimate a face shape. For example, alignment estimation module 110 applies learned regressors 108 to image 112 to determine an estimated face shape 114.
At block 1008, the estimated face shape is output. For example, the alignment estimation module 110 returns the estimated face shape to a calling application.
As described above, shape constraint is defined as the correlation between landmarks. According to the explicit shape regression technique described herein, the correlation between landmarks is preserved by learning a vector regressor and explicitly minimizing the shape alignment error (as given in Equation (1)). Because each shape update is additive and each shape increment is the linear combination of certain training shapes, {Ŝi} (as shown in Equations (5) and (6)), the final regressed shape, S, can be expressed as the initial shape, S0, plus the linear combination of all training shapes, or:
S=S
0+Σi=1NwiŜi (7)
Accordingly, as long as the initial shape, S0, is selected from the training shapes, the regressed shape is constrained to reside in the linear subspace constructed by all of the training shapes. Furthermore, any intermediate shape in the regression also satisfies the constraint. According to the techniques described herein, rather than being heuristically determined, the intrinsic dimension of the subspace is adaptively determined during the learning phase.
Although the subject matter has been described in language specific to structural features and/or methodological operations, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or operations described. Rather, the specific features and acts are disclosed as example forms of implementing the