The present application relates to a method and a system for verifying face images, in particular, to a method and a system for verifying face images based on canonical images.
Face images in the wild will undergo large intra-personal variations, such as in poses, illuminations, occlusions and resolutions. Dealing with variations of face images is the key challenge in many face-related applications.
To deal with face variation, there are methods for face normalization in which an image in a canonical view frontal pose and neutral lighting) is recovered from a face image under a large pose and a different lighting. The face normalization methods can be generally separated into two categories: 3D- and 2D-based face reconstruction methods. The 3D-based methods aim to recover the frontal pose by 3D geometrical transformations. The 2D-based methods infer the frontal pose with graphical models, such as Markov Random Fields (MU), where the correspondences are learned from images in different poses. The above methods have certain limitations, such as capturing 3D data adds additional cost and resources, and 2D face synthesis depends heavily on good alignment, while the results are often not smooth on real-world images. Furthermore, these methods were mostly evaluated on face images collected under controlled conditions, either in employed 3D information or in controlled 2D environment.
Therefore, to address at least one or more of the above problems, it is desirable to provide a system and a method for verifying face images based on canonical images in which the canonical image for each identity can be automatically selected or synthesized so that the intra-person variances are reduced, while the inter-person discriminative capabilities are maintained.
The present application proposes a new face reconstruction network that can reconstruct canonical images from face images in arbitrary wild conditions. These reconstructed images may dramatically reduce the intra-personal variances, while maintaining the inter-personal discriminative capability. Furthermore, the face reconstruction approach can be used for face verification.
In an aspect of the present application, a method for verifying face images based on canonical images is disclosed. The method may comprise:
a step of retrieving, from a plurality of face images of an identity, a face image with a smallest frontal measurement value as a representative image of the identity;
a step of determining parameters of an image reconstruction network based on mappings between the retrieved representative image and the plurality of face images of the identity;
a step of reconstructing, by the image reconstruction network with the determined parameters, at least two input face images into corresponding canonical images respectively; and
a step of comparing the reconstructed canonical images to verify whether they belong to a same identity,
wherein the representative image is a frontal image and the frontal measurement value represents symmetry of each face image and sharpness of the image.
In another aspect of the present application, a system for verifying face images based on canonical images is disclosed. The system may comprise:
a retrieving unit configured to retrieve, from a plurality of face images of an identity, a face image with a smallest frontal measurement value as a representative image of the identity;
an image reconstruction unit configured to reconstruct the input face images into corresponding canonical images respectively;
a determining unit configured to determine parameters of the image reconstruction unit, wherein the parameters is determined based on mappings between the representative image retrieved by the retrieving unit and the plurality of face images of the identity; and
a comparing unit configured to compare the canonical images reconstructed by the image reconstruction network to verify whether they belong to a same identity,
wherein the representative image is a frontal image and the frontal measurement value represents symmetry of each face image and sharpness of the image.
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like part.
As shown in
The retrieving unit 101 may retrieve, from a plurality of face images of an identity, a face image with a smallest frontal measurement value as a representative image of the identity, wherein the representative image is a frontal image and the frontal measurement value represents symmetry of each face image and sharpness of the image. Herein, the sharpness of an image refers to a rank of the matrix of the image.
In an embodiment of the present application, the retrieving unit 101 may include a calculating unit (not shown). The calculating unit may calculate the frontal measurement value of each of the plurality of face images, which will be discussed later. Those face images may be collected from the existing face database or web. In another embodiment of the present application, the retrieving unit 101 may include a ranking unit (not shown) which may rank the frontal measurement value calculated by the calculating unit in accordance with an increasing or decreasing order. Accordingly, the retrieving unit 101 may set the first or the last face image as the representative image of the identity.
The determining unit 103 may determine parameters of the image reconstruction unit 105. The image reconstruction unit 105 may reconstruct any input face image into corresponding canonical image, wherein the canonical images are frontal face images under neutral illumination. As shown in
The comparing unit 107 may compare the canonical images reconstructed by the image reconstruction network 105 to verify whether they belong to a same identity. In one embodiment of the present application, the image reconstruction network 105 may comprise a plurality of layers of sub-networks, and the determining unit 103 may determine preliminary parameters of each layer of the image reconstruction neural network based on the mappings by inputting an image training set, wherein an output of the previous layer of sub-network is inputted into a current layer of sub-network during the determining; compare an output of the last layer of sub-network and an expected target to obtain an error therebetween; and based on the obtained error, to fine-tune the preliminary parameters to concrete all parameters of the image reconstruction network. For example, as shown in
In one embodiment of the present application, as shown in
In one embodiment of the present application, the system 100 may further include a selecting unit (not shown) which may select one or more facial components from each of the reconstructed canonical images respectively to form one or more facial component pairs, each including facial components corresponding to the same face regions in the canonical images respectively. The acquiring unit 106 may acquire similarity between the facial component pairs and the determining unit 103 may determine the parameters of the image verification network 108 based on the similarity between the facial component pairs acquired by the acquiring unit 106. In an embodiment of the present application, the determining unit 103 may determine the parameters of the network 108 both based on the similarity between the reconstructed face images and the similarity between the facial component pairs.
In one embodiment of the present application, the system 100 may include one or more processors (not shown). Processors may include a central processing unit (“CPU”), a graphic processing unit (“GPU”), or other suitable information processing devices. Depending on the type of hardware being used, processors can include one or more printed circuit boards, and/or one or more microprocessor chips. Furthermore, processors can execute sequences of computer program instructions to perform the process 1000 that will be explained in greater detail below.
In summary, the present system has three key contributions. Firstly, to the best of our knowledge, canonical face images can be reconstructed by using only 2D information from face images in the wild. A new deep reconstruction network is introduced that combines representative face selection and face reconstruction, which shows state-of-the-art performance on face verification in the wild. Secondly, the reconstructed images are of high-quality. Significant improvement of the existing approaches can be demonstrated, when they adopt our method as a normalization step. Thirdly, a face dataset six times larger than the LFW dataset can be contributed.
At step S1001, a face image with a smallest frontal measurement value may be retrieved from a plurality of face images of the identity as a representative image of the identity.
At this step, the plurality of face images of the identity may be collected from such as existing image databases or web. The plurality of face images are under arbitrary pose and illumination. Then, for each of the plurality of the face images, a frontal measurement value is calculated. Then the face image with the smallest value may be set to be the representative image of the identity. The representative image is a frontal face image under neutral illumination of the identity. In an embodiment of the application, after the frontal measurement value is calculated, these values may be ranked in decreasing order, and the last one is set to be the representative image of the identity. Alternatively, after the frontal measurement value is calculated, these values may be ranked in increasing order, and the first one is set to be the representative image of the identity.
In particular, a plurality of face images of an identity i are collected in a set of images Di, in which a matrix Yi ∈ Di denotes a face image in the set of face image Di.
The above-mentioned frontal measurement value is formulated as the following equation (1):
Equation (1)
M(Yi)=∥YiP−YiQ∥F−λ∥Yi∥*,
where Yi ∈ R2a×2a, λ is a constant coefficient, ∥·∥F is the Frobenius norm, ∥·∥* denotes the nuclear norm, which is the sum of the singular values of a matrix, P,Q ∈ R2a×2a are two constant matrixes with P=diag([1a, 0a]) and Q=diag([0a, 1a]), where diag(·) indicates a diagonal matrix.
The M(Yi) in Eq.(1) represents symmetry and sharpness of a face image of an identity. The first term in Eq.(1) measures the face's symmetry, which is the difference between the left half and the right half of the face. Obviously, smaller value of the first term indicates the face is more symmetric. The second term in Eq. (1) measures a rank of a matrix of the face image. Rank means the maximum number of linearly independent columns in a matrix. For example, if a face image is blurring or is a side-face (a background appears in another side of the image, which are generally blocks of solid color in a scale similar to “a big close-up”), the number of the linearly independent columns is relatively smaller, thus the value of the second term (with a minus sign) is relatively bigger. Therefore, the smaller value of Eq. (1) indicates the face is more likely to be in frontal view, more symmetrical, more clear and without very littler posture change. With this measurement by combining the symmetry and rank of matrix, a frontal image of the identity under neutral lighting can be automatically obtained with high efficiency.
At step S1002, based on the mappings between the representative image retrieved at step S1001 and the plurality of face images of the identity, parameters of the image reconstruction network 105 (such as illustrated in
It is noted that the step of determining may be performed repeatedly for any identity. For example, in another embodiment of the application, for the identity the representative image Yi may be selected from the set of images Di by a sparse linear combination Yi=αi1Di1+αi2Di2+. . . +αikDik, with Dik being the k-th image in the set Di (herein, also referred to as face selection, shown in
This is to maintain the discriminative ability of the reconstructed frontal view images. Thus, the face selection can be formulated as below:
where M(Yi) is defined in Eq. (1). The optimization problem in Eq. (2) in terms of both Y and α is not convex. However, if Y is fixed, the problem with regard to α is a Lasso problem that is convex and if α is fixed, the function of Y is separated into a convex term plus a concave term, which is the minus nuclear norm. This can be solved by the concave-convex procedure (CCCP).
At step S1003, at least two input face image are reconstructed into their corresponding canonical images by the image reconstruction network. That is, the image reconstruction network may reconstruct any face images under arbitrary pose into corresponding canonical images which are frontal and under neutral illumination (herein also referred to as face recover, shown in
where i is the index of identity and k indicates the k-th sample of identity i, X0 and Y denote a training image and a target image respectively. W is a set of parameters of the image reconstruction network.
In an embodiment of the present application, the parameters of the image reconstruction network may also be determined based on transformation between an input face image and the corresponding canonical images reconstructed by the network 105. Then, any face images can be reconstructed by using the image reconstruction network in which the parameters have been determined. The mapping means a transformation from one vector to another vector. Herein, the mapping may refer to a sequential non-linear mapping to transform an input image of the plurality of face images of the identity to the canonical view image of the same identity.
As shown in
Referring to
Equation (4)
X
q,uv
l+1=σ(Σp=1lWpq,uvl∘(Xpl)uv+bql),
where Wpq,uvl and (Xpi)uv denote the filter and the image patch at the image location (u,v), respectively, p,q are the indexes of input and output channels. For instance, in the first convolutional layer, p=1,q=1, . . . , 32. Thus, Xq,uvl+1 indicates the q-th channel output at the location (u,v); that is the input to the l+1-th layer. σ(x)=max(0,x) is the rectified linear function and ∘ indicates the element-wise product. The bias vectors are denoted as b. At the fully-connect layer, the face image
Equation (5)
L
X
L
+b
L,
In an embodiment of the present application, the face selection and the face recovery may be jointly learned by combining Eq. (2) and Eq. (3) an optimized separately for each identity as below:
where γ,τ,λ,η′ are the balancing parameters of the regularization terms. Eq. (6) indicates that each chosen image Yi must have frontal image, maintain discriminativeness, and minimize the loss error. The values of Yi,αi,W are searched iteratively using the following steps:
where Ũ and {tilde over (V)} are the truncations of U and V to the first rank (Yit) columns, wherein Yit=UΣVT is the SVD of Yit.
In an embodiment of the present application, a simple and practical training procedure as illustrated in the following algorithm is devised to estimate W first by using all the training examples and then to select target for each identity, in order to speed up the above iterating procedure of the three steps.
And then, at step S1004, the canonical images reconstructed at step S1003 are compared to verify whether they belong to a same identity, that is, to verify whether face images corresponding to the canonical images respectively belong to the same identity.
In an embodiment of the application, the method 1000 may further comprise a step of acquiring a similarity between any two reconstructed canonical images to determine parameters of an image verification network, and the architecture of the network is shown in
In another embodiment of the application, the method 1000 may further comprise a step of selecting one or more facial components from each of the reconstructed canonical images respectively to form one or more facial component pairs, each including facial components corresponding to the same face regions in the canonical images respectively. In another embodiment of the application, the method 100 may further comprise a step of acquiring the similarity between the facial component pairs to train the parameters of the image verification network.
According to the present application, the image verification network is developed to learn hierarchical feature representations from pairs of reconstructed canonical face images. These features are robust for face verification, since the reconstructed images already remove large face variations. It also has potential applications to other problems, such as face hallucination, face sketch synthesis and recognition.
As shown in
Referring to
In the image verification network, each CNN learns a joint representation of the facial component pairs or the face images to train the preliminary parameters of each layer of the CNNs. During the training, an output of the previous layer is inputted into a current layer. Then, an output of the last layer and an expected target is compared to obtain an error. Then, based on the obtained error, the logistic regression layer fine-tune the preliminary parameters so as to concatenate all the joint representations as features to predict whether the two face images belong to the same identity.
In particular, as to the training of the image verification network, the filters are firstly trained by the unsupervised feature learning. Then the image verification network is fine-tuned by using the stochastic gradient descent (SGD) combined with back-propagation which is known in the art. Similar to the training of the image reconstruction network, the back-propagation error are passed backwards and then the fully-connect weights or filters are updated in each layer. A entropy error is adopted instead of the loss error because the labels y are needed to be predicted as below:
Equation (8)
Err=ylog
where
The embodiments of the present application may be implemented using certain hardware, software, or a combination thereof. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.
In the foregoing descriptions, various aspects, steps, or components are grouped together in a single embodiment for purposes of illustrations. The disclosure is not to be interpreted as requiring all of the disclosed variations for the claimed subject matter. The following claims are incorporated into this Description of the Exemplary Embodiments, with each claim standing on its own as a separate embodiment of the disclosure.
Embodiments and implementations of the present application have been illustrated and described, and it should be understood that various other changes may be made therein without departing form the scope of the application.
The application is filed under 35 U.S.C. §111(a) as a continuation of International Application No PCT/CN2014/000389, filed Apr. 11, 2014, entitled “Methods and Systems for Verifying Face Images Based on Canonical Images,” which is incorporated herein by reference in its entirety for all purposes canonical images.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2014/000389 | Apr 2014 | US |
Child | 15282851 | US |