The present invention relates to face recognition apparatus and methods and, more particularly to apparatus and methods for automatic face recognition using a computer programmed with software capable of receiving at least two digital images of a person and making a comparison of those images to make a conclusion as to the identity of the persons depicted therein.
Face recognition systems currently exist, however, it remains a challenge for such systems to adequately account for changes in pose, illumination, and face expression changes. As real-world images and videos often present many changes in these elements, it is desired to provide a face recognition system which can adequately account for these changes.
Face recognition has remained an active research topic in computer vision for decades. In recent years, we have witnessed more and more research efforts on face recognition under uncontrolled settings. Face recognition can be categorized into two tasks: face identification and face verification. The former attempts to recognize the identity of a probe face based on a set of gallery face images with known identities. The latter tries to arbitrate if a pair of faces is from the same subject or not.
Currently more and more applications systems are exploiting face recognition technologies, and most of them, if not all, request a controlled environment. When dealing with face recognition in an uncontrolled emironment, various visual complications could affect the robustness of face recognition algorithms, such as changes in pose, illumination, expression, etc. Among these variations, pose variation is one of the most challenging, as shown in
To address this problem, tens of years research have motivated many benchmarks and effective algorithims. In general, the ways in which the previous work on uncontrolled face recognition handle pose variation can be categorized into 4 kinds: (1) explicitly regularize pose changes; (2) align faces to relieve in-plane pose transformation; (3) use higher-level information; (4) build face part correspondence robust to pose variation. These are discussed in turn below.
A straightforward method to relieve the influence of pose changes is to explicitly regularize the pose difference. Prabhu et al. has proposed a pose-invariant face recognition algorithm using 3D generic elastic model. In their work, they generate am a per-identity 3D face model using a frontal enrollment face image. With the 3D face model, they could then synthesize 2D views under different poses for matching. Faces from different viewpoints could then be explicitly regularized using a 3D face model to the same pose for verification.
Yin et al dealt with pose variation in a similar manner by bridging two testing face images to the same pose. They collected a generic identity data set at the offline stage which presents appearance of the same identities under different intra-person settings (pose and illumination). In the testing stage, firstly the probe face is associated to an alike identity in the database. After that the algorithm predicts the new appearance of probe face under different intra-personal settings. Despite the intuitive motivation of these methods, one of their draw-backs are in the data collection step, which required non-trivial efforts to have a per-identity frontal enrollment face image or a generic identity dataset.
Another line of research in dealing with pose changes would be face alignment. In practice, pose changes could introduce two kinds of transformations: in-plane transformation and out-of-plane transformation. Face alignment algorithms usually transform faces into a nomalized pose through a similarity transformation on which one could handle the in-plane transformation. Cao et al. proposed a learning-based descriptor for face recognition which utilized descriptors extracted from fiducial points. Chen et al. also showed that by densely extracting high dimensional descriptors around face land marks could highly improve the face recognition performance.
As reported by Huang et al., a strong face alignment algorithm can be an effective preprocessing step for face recognition algorithms. However, building a face alignment system robust to different poses, illumination, expression and etc. by itself is a very challenging problem which requires a lot of engineering efforts. As a matter of fact, most state-of-the-art face alignment systems, even those with published papers, are often not fully accessible to the research community (an exception is the recent work of Xiong and De la Torre, who shared their code online). As a result, algorithms that require strongly aligned faces may not be practical when one wants to build an end-to-end functioning systems for face recognition. Moreover, face alignment algorithms by design are more effective in handling in-plane transformation, hence the residual misalignments after alignment could still affect the comparison between faces.
Currently, higher-level representations attract researchers' attention. In the area of face recognition, Kumar e al. proposed to use visual attributes to describe faces and to predict identities. Visual attributes are face appearance labels such as age, gender, jaw shape, nose size, etc. Since attributes are higher-level representations, they are robust to low-level appearance changes due to pose variations. However, in practice, collecting a sufficient number of attribute labels could be expensive and the pose variations could still have a negative influence since attributes of the testing image are inferred from the low-level image descriptors.
Without resorting to predefined 3D face models, fine-designed off-line database or expensive high-level representations, another line of research is to treat pose variant face recognition as an invariant matching problem, by exploiting a part based representation for a face, See, e.g., J, Wright and G. Hua, “Implicit elastic matching with randomized projections for pose variant face recognition” CVPR, 2009 and G. Hua and A. Akbarzadeh, “A robust elastic and partial matching metric for face recognition”, ICCV, 2009. Many of these robust matching algorithms are built on a part based representation of the face based on local image descriptors. The algorithms either explicitly identify the correspondences between part descriptors or implicitly build it when computing the part based representation for the face. Although by identifying how parts from two faces should be compared could not perfectly regularize the pose changes, it helps to handle both the in-plane and out-of-plane transformation.
Arashloo et al. proposed a method based on image matching with Markov random fields' model to conduct pose-invariant face recognition. In this method, in verifying two faces, one of them is deformed to match the other one by minimizing the energy of image matching. The algorithm achieved a state-of-the-art method in face identification without requiring face alignment. One disadvantage of this method is that the dense image matching with MRF model could be computationally expensive.
For a more complete understanding of the present invention, reference is made to the following detailed description of an exemplary embodiment considered in conjunction with the accompanying drawings, in which:
In one aspect of the present disclosure, a programmed computer computes a fixed dimensional numerical signature from a single digital facial image or a set/track of face images from a human subject. This signature is invariant to visual variations induced by pose, illumination and face expression changes and can subsequently be used for face verification, identification, and detection, in real-world photos and videos. The face recognition system based on this model, utilizes a probabilistic elastic part model to achieve recognition accuracy using real-world, non-posed facial images datasets.
In another embodiment, a method for automatically categorizing a first digital image of a person and a second digital image of a person as either images of the same person or different persons using a computer programmed with digital processing software, includes the steps of: receiving the first digital image in the programmed computer as input to the digital processing software, the digital processing software partitioning the first digital image into a plurality of sub-parts, each having a plurality of pixels and a location relative to the first digital image, each pixel having a value corresponding to the appearance thereof on a scale of visual values; for each of the plurality of sub-parts of the digital image, extracting a local descriptor based upon the appearance of the sub-part; augmenting each local descriptor with its location in the first digital image, transforming the first image into a set of spatial-appearance descriptors; identifying one descriptor from the set of spatial-appearance descriptors to describe each part in a maximum likelihood sense; concatenating the appearance parts in the spatial-appearance descriptors in an order of the location components to build a probabilistic elastic part (PEP) representation of the first digital image; performing the preceding steps for the second digital image; calculating a similarity measure between the PEP representations of the first digital image and the second digital image to quantify the degree of similarity between the first image and the second image.
In another embodiment, the plurality of sub-parts are overlapping.
In another embodiment, further including the step of reproducing the first digital image at a plurality of scales.
In another embodiment, the local descriptor is a Local Binary Pattern (LBP).
In another embodiment, the local descriptor is a scale-invariant feature transform (SIFT).
In another embodiment, the visual values are greyscale values.
In another embodiment, the first digital image and the second digital image are facial images.
In another embodiment, the first digital image is a set of digital images and the steps A-F are conducted for each of the set of digital images.
In another embodiment, the set of digital images are a plurality of frames from a video clip.
In another embodiment, the first digital image includes a plurality of digital images and further including the step of training a Gaussian mixture model (GMM) with the spatial-appearance descriptors from the plurality of digital images.
In another embodiment, the each mixture component of the GMM is constrained to be a spherical Gaussian.
In another embodiment, the spherical Gaussians balance the impact of appearance and spatial location.
In another embodiment, further including the steps of obtaining a training set of images containing matching and non-matching facial image pairs; training an SVM classifier on the difference vectors associated with the training set of images; subsequently receiving new digital images into the computer; and distinguishing digital images of persons that match images of persons in the training set from digital images of persons that do not match images of persons in the training set.
In another embodiment, further including the step of classifying the first image as either the same person as a person appearing in the second image or a different person based upon the quantified similarity between the first image and the second image.
In another embodiment, the first digital image is a plurality of digital images of a single person.
In another embodiment, the plurality of digital images of the single person are images having differences in at least one of scale, pose, illumination or facial expression.
In another embodiment, further including applying a joint Bayesian adaptation to adapt the PEP-model to better fit the features of the pair of faces/face tracks by Bayesian maximum a posteriori parameter estimation.
In another embodiment, parameters of the PEP model may be learned using Expectation-Maximization (EM) from densely extracted spatial-appearance local descriptors from training face images.
In another embodiment, further including the step of constructing a horizontally flipped face image to be computed inot the PEP representation.
In another embodiment, the step of calculating a similarity measure is by training an SVM on top of an element-wise absolute difference vector of the PEP representations between the first digital image and the second digital image.
Aspects of the present disclosure include face recognition apparatus and methods that account for changes in pose, illumination and face expression in images of the person subject to identification. The present disclosure presents a face recognition system capable of analyzing real-world images and videos captured without regard to pose, facial expression and illumination, which is also adequately flexible so that it can take into account new observations and factor out old representations. The present disclosure addresses the challenges to facial recognition which arise in the context of pose variant face verification under uncontrolled settings.
We propose another robust matching scheme to conduct pose-invariant face verification without requiring a strong face alignment component in H. Li, G. Hua, Z. Lin, J. Brandt and J. Yang, “Probabilistic elastic matching for pose variant face verification” in CVPR, 2013, which article is incorporated by reference herein in it's entirely. Therein a probabilistic elastic matching scheme is disclosed, which could handle both image-based, video-based or even mixed image-video based face verification in a unified framework. The probabilistic elastic matching is achieved based on a pose invariant face representation produced from a probabilistic elastic part (PEP) model.
The present disclosure presents a process and apparatus (software implementation of a new algorithm combined with a computing unit) to compute a fixed dimensional numerical signature from either a single face image or a set/track of face images from a human subject. This signature is invariant to visual variations induced by pose, illumination, and face expression changes, which can subsequently be used for face verification, identification, and detection, in real-world photos and videos. The face recognition system based on this model, namely probabilistic elastic part model, has achieved the top accuracy on several real-world face recognition benchmark datasets.
In accordance with the present disclosure, a probabilistic elastic part (PEP) model is used, which is a Gaussian mixture model (GMM) learned from a pool of local descriptors from all face images in the training corpus to capture the spatial-appearance distribution. Each mixture component of the GMM naturally defines a part. Given a face pair, the PEP-model builds a PEP representation for each face by sequentially concatenating part descriptors identified by each Gaussian component in a maximum likelihood sense. A difference vector is then calculated as the element-wise absolute difference between the two PEP representations. For face verification, we train an SVM on the difference vectors of all the feature pairs to decide if a pair of faces/face tracks is matched or not. We further propose a joint Bayesian adaptation algorithm to adapt the universally trained GMM to better model the pose variations between the target pair of faces/face tracks, which consistently improves face verification accuracy. Experiments show that the method achieved comparable performance to the state-of-the-art in the most restricted protocol on Labeled Faces in the Wild (LFW) and outperformed the best performance on YouTube video face database.
In our framework, each face is represented as a sequence of face part descriptors, through the PEP model. Contrary to where the part model is induced with heuristics, our PEP-model is automatically learned from data in a more principle way. With the PEP-model, we could build the PEP-representation for a face and conduct a robust matching between two faces without relying on strong face alignment algorithms. Moreover, our algorithm complements any state-of-the-art face alignment systems due to its capability in dealing with residual misalignments. Compared to other representations, the PEP-representation is more general in that a single face image or a face track could have an unified representation, i.e., the resulting vector representation is of the same dimension no matter whether the input is a single face image or multiple face images. This is especially compelling for video face verification as it does not need to conduct exhaustive pair-wise comparison of image frames.
In a first embodiment of the present disclosure, a computer may be programmed with software instructions that firstly takes a part based representation for a single digital face image or face track. Each digital face image is densely partitioned into overlapping patches at multiple scales. A local descriptor such as Local Binary Pattern (LBP) or scale-invariant feature transform (SIFT) is subsequently extracted. Each local descriptor is augmented with its location in the face image, and hence a face image is initially transformed to be a set of spacial-appearance descriptors. In the video-based face recognition, a face track is firstly transformed to be a set of spatial-appearance descriptors extracted from all video frames. After that, the PEP model identifies one descriptor from the pool to describe each part in a maximum likelihood sense.
To build the PEP-model for pose variant face verification, given a set of training images, the programmed computer trains a Gaussian mixture model (GMM) with the spatial-appearance descriptors from the training images. In speech recognition, such a GMM is also called Universal Background Model (UBM). In the framework of the present disclosure, each mixture component of the GMM is constrained to be a spherical Gaussian to balance the impact of the appearance and spatial location. As intuitively each Gaussian component in the GMM describes the appearance and spatial distribution of a kind of facial part, the GMM is named the probabilistic elastic part model (PEP-model).
In building the PEP-representation for a digital face image/face track, each component of the PEP-model identifies a spatial-appearance descriptor (extracted out-of an image patch) from the descriptors extracted from the face image or face track. The appearance parts in the identified descriptors are then concatenated in the order of the components to build the PEP-representation. The pose-invariance is introduced in the descriptors-election stage. Since a Gaussian component represents a facial part, it will consistently identify the spatial-appearance descriptor describing the facial part from the probe face with elastic robustness in appearance changes and spatial offsets. When matching two faces for face verification, the element-wise absolute difference vector between their PEP-representations represent the difference between the two faces.
An SVM classifier is then trained on the difference vectors given a set of training matching/non-matching face/face track pairs, which is subsequently used to verify any new face/face track pairs. Since the PEP-model builds a consistent form for a pair of face images or face tracks, the matching framework can be used for both image-to-image and video-to-video face verification without any modification.
As shown in experiments, the proposed robust matching with the probabilistic elastic part model, namely probabilistic elastic matching (PEM), achieved state-of-the-art performance on the LFW (working under the most restricted protocol) and outperformed top algorithms on the YouTube Video Face Dataset. To make PEM to be adaptive to each pair of faces, we further propose a joint Bayesian adaptation scheme to adapt the PEP-model to better fit the features of the pair of faces/face tracks by Bayesian maximum a posteriori parameter estimation.
We call such an adapted matching programmatic algorithm to be adaptive probabilistic elastic matching (APEM). APEM adapt the universally trained PEP-model to each face pair to biased the Gaussian components to the spatial-appearance subspace spanned by the face pair to help build PEP-representation more robust to pose changes. Hence it can achieve better verification accuracy. In our experiments, it consistently improves the face verification performance over PEM at the cost of additional computation. Our experiments even show that our PEM and APEM algorithms, when applied to face verification with unaligned faces, i.e., raw face images extracted from the Viola-Jones face detector, could achieve decent performance or even outperform some state-of-the-art algorithms, such as the bio-inspired V1 features with multiple kernel learning applied to faces aligned with the funneling method under the most restricted protocol in LFW. This provides strong evidence that our PEM and APEM algorithms can better handle pose variations.
Hence, 1) we propose to use a universally trained PEP-model on spatial-appearance features as a bridge to build pose-invariant PEP-representations for both image and video face verification; 2) we show that the joint Bayesian adaptation of the PEP-model on the pair of faces/face tracks to be verified can further improve the matching; and 3) we achieve state-of-the-art face verification accuracy on both LFW (the most restricted protocol in image restricted setting), and the YouTube Faces benchmarks.
The GMM for visual recognition and the current state-of-the-art face verification algorithms and YouTube video face datasets are now discussed. The Gaussian mixture model may be used for various visual recognition tasks including face recognition and scene recognition. While one may focus on modeling the holistic appearance of the face with GMM, one may also exploit the bag of local descriptors representation and use GMM to model the local appearances of the image. In their frameworks, a GMM is the probabilistic representation of an image. Then the GMM is encoded into a super-vector representation for classification. As used, the universally trained GMM is a probabilistic general representation of human face; each Gaussian component models the spatial-appearance distribution of a facial part. In terms of model adaptation, utilizations may also leverage the GMM and Bayesian adaptation paradigm to learn adaptive representations, wherein the super-vector representations are adopted for building the final classification model. In accordance with an embodiment of the present disclosure, joint spatial-appearance modeling may be conducted using spherical Gaussians as the mixture components and their Bayesian adaptation is applied to a single image in contrast to conducting a joint Bayesian adaptation on a pair of faces/face tracks to better build the correspondences of the local descriptors in the two face images/face tracks.
One of the specialties of the PEP-model is the spherical Gaussian components, which explicitly address the unbalanced dimensionality between appearance and spatial constraint for a spatial augmented descriptor. A GMM with regular Gaussian components trained over the special appearance features may not be desirable for building the PEP-model because face structures can be similar in appearance but vary spatially, e.g., left eye and right eye, could be mixed into the same Gaussian component under weak spatial constraint, as the dimensionality of spatial location is relatively smaller comparing to the size of the appearance descriptor. If a GMM with spherical Gaussian component is used as a PEP-model, the strength of spatial constraint can be tuned by scaling the location units which help balance the influence: of appearance and spatial constraint in learning the facial parts.
Previous works on image-based face verification mostly reported their performance over the Labeled Face in the Wild dataset (LFW). The LFW benchmark has three protocols in the Image-Restricted Training setting for a 10 fold cross validation evaluation. The most restricted protocol does not allow any additional datasets to be used for face alignment, feature extraction, or building the recognition model. The less restricted protocol allows use of additional datasets for face alignment and feature extraction, but not for building the recognition model. While the least restricted protocol allows additional datasets to be exploited for all three tasks. The current state-of-the-art on the most restricted protocol is the work of the fisher vector faces presented by Simonyan et al., which achieved an average accuracy of 0.8747±0.0149.
Predominant recent works focused on the less restricted protocols and least restricted protocols, which have pushed the recognition accuracy to be as high as 0 9517±0.0113. They leveraged additional data sources or strong face alignment algorithms trained from external data sources. We focused our experiments on the most restricted protocol on LFW as our interest is the design of a robust matching method for pose variant face verification. Besides the fact that our method does not exploit any outside training data or side information, the method of the present disclosure, on one hand is flexible to local descriptor choices that it could benefit from special local descriptors; on the other hand it could address residual misalignments so that it can complement and benefit from a strong face alignment system. Restricting the evaluation to the most restricted protocol enables objective evaluation of the capacity of our proposed part model and representation. Our method only employed simple visual features such as LBP and SIFT. We also observed consistent improvement when fusing the results from these two types of features together, suggesting that we can further improve face verification accuracy from the proposed method by fusing more types of features, or by feature learning.
While a number of state-of-the-art methods on LFW may not be applied to video-based face verification directly, our work can handle the video-based setting without modification. Wolf et al. published a video face verification benchmark, namely YouTube Faces is widely recognized and evaluated these years. There can be various ways to interpret the video-based setting. Wolf et al. treat each video as a set of face images and compute set-to-set similarity; Zhen et al. takes an spatial-temporal block based representations for each video and utilize multiple face region descriptors with a metric learning framework; in the framework of the present disclosure, we have a consistent PEP-representation for both video and image. Without exploiting temporal information or extra reference dataset, our method uses the PEP-model to build pose-invariant representation and hence identify local correspondences between face parts across frames. Our algorithm outperformed the state-of-the-art methods on the YouTube faces dataset.
According to an embodiment of the invention, the probabilistic elastic part (PEP) model is employed. The PEP model learns a spatial-appearance (i.e., a local descriptor augmented by its location in the face image) Gaussian mixture model. By constraining Gaussian components to be spherical, the PEP model balances the impact of spatial and appearance parts, and forces the allocation of a Gaussian component to each local region of the image. Given densely extracted spatial-appearance local descriptors from training face images, parameters of the PEP model may be learned using Expectation-Maximization (EM). The third column of
The PEP representation presents numerous advantages when compared with existing part based representation for face recognition, for instance, the parts of the PEP model are automatically learned from data instead of hand-crafted. Additionally, the PEP model generates a single fixed-dimension representation given a varying number of face images from a subject. It unifies image and video based face recognition in a single representation. Here, the only difference is that in the video case, the maximum likelihood part descriptor is identified from all video frames. Further, when building the representation from multiple face images, e.g., a track of face images from a video, the PEP representation integrates the visual information from all of these face images together instead of selecting a single best frame to produce it.
Moreover, the PEP representation is additive, i.e., when additional face images of one specific person are available, the representation can be updated without revisiting the images that produced the original representation. This stems from the property of the max operation when identifying the maximum likelihood part descriptors, which can naturally be incrementally performed. Also, when applied for face identification, the PEP representation allows the gallery face database to scale linearly with the number of subjects, instead of number of images.
For image based face verification, we represent each face image as a set of spatial-appearance descriptors. As shown in
The face image is hence initially transformed into an ensemble of these spatial-appearance descriptors, i.e., f{fpi}i=1N.
In video based face verification, the task is to verify if two tracks of faces are from the same person or not (assuming each track of faces is the face of a single person). We adopt the same part-based representation for a face track by repeating the feature extraction pipeline in
The exact steps of the proposed probabilistic elastic matching method are illustrated in
Given a face/face track pair, both of which are represented as a set of spatial-appearance descriptors, we build PEP-representation for each face. Given one of the faces/face tracks, for each Gaussian component in the PEP-model, we identify a spatial-appearance descriptor by looking for the one induces the highest probability on the Gaussian component. We concatenate the descriptors identified by Gaussian components to build the PEP-model for a face/face track. In this process, given a face pair, the descriptors identified by the same Gaussian component should be from the same facial part. We call such a pair of descriptors a corresponding feature pair. The absolute element-wise difference vectors between two PEP-representations incorporated comparisons between all corresponding feature pairs, which is subsequently fed into an SVM classifier for prediction.
An additional improvement is to conduct a joint Bayesian adaptation step to adapt the PEP-model to the union of the spatial-appearance descriptors from both face images/tracks constrained a priori by the parameters of the original PEP-model to form a new adapted PEP-model (APEP-model). Then we could use the APEP-model instead of the universally trained PEP-model to build the PEP-representations. Since the probabilistic distribution described by the APEP-model is biased towards the spatial-appearance subspace spanned by face pair jointly, as a result, the feature correspondences built by APEP-model are more accurate.
We call the proposed approach using universally trained PEP-model to conduct elastic matching to be probabilistic elastic matching (PEM), and the approach using APEP-model to build the corresponding feature pairs to be adaptive probabilistic elastic matching (APEM). We proceed with detailed description of the key steps including the training of the PEP-model (Section 4.1), the building of PEP-representation (Section 4.2), the joint Bayesian adaptation algorithm for the APEM (Section 4.3) and a straightforward multiple feature fusion framework (Section 5).
As we have mentioned, universally trained GMM is widely used in the area of speech recognition[31]. In our method, to balance the impact of the appearance and spatial location, we confine the GMM to be with spherical Gaussian components, i.e.,
where Θ=(ω1,{right arrow over (μ)}1, σ1, . . . , ωK, {right arrow over (μ)}K, σK); K is the number of Gaussian mixture components; I is an identity matrix; ωk is the mixture weight of the k-th Gaussian component; (μk,σk2 I) is a spherical Gaussian with mean μk and variance σk2I, and f is an m-dimensional spatial-appearance feature vector i.e., f=[aT1T]T.
To fit such a GMM over the training set X={f1, f2, . . . , fM}, we resort to the Expectation-Maximization (EM) algorithm to obtain an estimate of the parameters of GMM by maximizing the likelihood of the training descriptors formally,
The EM algorithm consists of the E-step which computes the expected log-likelihood and the M-step which updates parameters to maximize this expected log-likelihood [41]. Specifically, in our case, in the E-step, we calculate
where P(k|fi) is defined as
which is the posterior probability that the k-th Gaussian component generated feature fi, In the M-step, the parameter set Θ is updated as
These two steps are iterated until convergence, at which time we obtain the GMM. Note that variances along different dimensions are indeed taken into consideration through Equation 9.
As shown in
After we obtained the K-components PEP-model trained over training spatial-appearance descriptors, we exploit it to form a PEP-representation in the form of a D=ma×K dimensional vector for a face image/track, where ma is the dimensionality of the appearance descriptor, e.g., LBP or SIFT.
Formally, we first transform a face/face track to a set of spatial-appearance descriptors fF={f1, f2, . . . , fN}.
First we let each Gaussian component (ωk,k({right arrow over (μ)}k, σk2I)) commit one descriptor fgk(F) from fF, such that
The face/face track is then represented as a sequence of K ma-dimensional descriptors, i.e, [ag1 ag2 . . . agk], which is the PEP-representation of . Note in this representation, we keep only the appearance descriptors since the spatial components are already taken into consideration in the descriptor selection stage (Equation 10). As shown in
To present intuitive understanding of the PEP-representation, we visualize the PEP-representations by aligning the image patches associated with the selected descriptors to the mean locations of the facial parts, as shown in
With the PEP-representations, given the i-th faces/face tracks pair ( and ′), we take the difference of the two vectors produced from the PEP-representations, i.e.,
d
i
=[Δa
g1
Δa
g2
. . . Δa
gk
]T (11)
where Δagk=|agk(F)−Δagk(F′)|T, which serves as the matching vector of a pair of faces/face tracks for face verification.
After building the representations for all the training pairs, a kernel SVM classifier, i.e.,
is then trained over C training difference vectors {d1, d2, . . . , dc} with the Gaussian Radial Basis Function (RBF) kernel, i.e.,
k(di,dj)=exp(−γ∥di−dj∥2),γ>0, (13)
where i, j=1, . . . , C. Given the difference vector dt of a testing face/face track pair, the SVM predicts its label.
We employed the LibSVM [42] to train the SVM classifier. We call the overall matching algorithm using PEP-model to be probabilistic elastic matching (REM).
Prior work applying GMMs with Bayesian adaptation to visual recognition [33], [34] has operated either at the class level or at the image level. To make the matching process adaptive for each face/face track pair, we propose a joint Bayesian adaptation on the union of the bag of spatial-appearance descriptors from the faces/face tracks pair. In the joint adaptation process, the parameters of the universally trained GMM build the prior distribution for the parameters of the jointly adapted GMM under a Bayesian maximum a posteriori (MAP) framework.
We denote the universally learned GMM parameter set as Θb and parameter set of the GMM after joint adaptation as Θp′, where Θx={ωx1, {right arrow over (μ)}x1, σx1, . . . , ωxK, {right arrow over (μ)}xK, σxK}, x={b,p}. Given a face/face track pair and , the adaptive GMM is trained over the joint descriptor set ={f1, f2, . . . ,fk} which is the union of descriptor sets of and as q and s, where |p|=|q|+|s|. Upon p, a MAP estimate for Θp, can be obtained by maximizing the log-likelihood (Θp),
(Θp)=InP(xp|Θp)+InP(Θp|Θk). (14)
The conjugate prior distribution of Θp is composed from the PEP-model parameter Θb [33], [34], [41], i.e.,
(ωp1, . . . ,ωpk)˜Dir(Tωb1, . . . ,Tωbk (15)
μpk˜({circumflex over (μ)}bk,σbk2/γ (16)
The prior distribution over the mixture weights is a Dirichlet distribution. The parameter T can be interpreted as the count of descriptors introduced by the universally learned model. The prior distribution for mean μpk is a spherical Gaussian distribution with variance smoothed by parameter γ. We can also use a Normal Wishart distribution over the variance as in [34], [41]. However, in order to stabilize the adapted GMM, we confined the adapted variance to be the same as that of the universal model, i.e., σpk2=σbk2
With these priors, the parameters of the adapted GMM can be estimated by a Bayesian EM algorithm [33], [34], [41], i.e., in the E-step, we calculate
and in M-step, we update Θp as
where
a=N/(N+T),βk=nk/(nk+γ) (22)
The adapted GMM can be interpreted as a mixture of facial part models as the universally learned PEP-model. In our framework, we name the adapted GMM as APEP-model following the same terminology. After we obtain the APEP-model given a pair of faces/face tracks, we conduct APEM to build the PEP-representations and difference vector. We could observe APEM improved some feature correspondences as shown in
As we shown in Section 4.2, the PEP-representations i.e., feature correspondences are the key to handle the pose variations. How well could the spatial-appearance descriptor from the same facial part be located by a Gaussian component highly relied on the construction of the PEP-model. The Gaussian distribution affects the responses as shown in
From the visualization (
Obviously, the strength of spatial constraint plays an important role in the PEP-model learning as well as the PEP-representation building step. With the location augmented descriptor, it is a well-recognized problem that the spatial constraint from the augmented 1 can be too weak to make a difference. Because in practice the dimension ma, of the appearance feature a can be considerably larger than the dimension of the location feature 1 which is ml=2 in our experiments.
Here we argue and demonstrate that confining each mixture component in PEP-model to be a spherical Gaussian can help address this issue, as it establishes a balance between the spatial and appearance constraint. Take the k-th Gaussian component P (f|ωk, μk, σk2I) as an example, the generative probability of descriptor f is
where {right arrow over (μ)}ka and {right arrow over (μ)}kl are the appearance and location part of {right arrow over (μ)}k′, respectively, such that {right arrow over (μ)}k=[{right arrow over (μ)}ka
As illustrated in
Note that if the PEP-model is with regular Gaussian components, one can not address this issue by scaling a. This can be observed by checking the equations in the EM algorithm: if a is scaled, the corresponding means and covariances will be scaled proportionally. Then the probability of f over each of the Gaussian components will be scaled in the same way. As a result, P(k|fi) is unchanged (Equation 6), which means the EM estimates will undesirably remain the same—it only scales the mean and variance estimates. This is not able to help balance the influence of the appearance and the location.
In visual recognition, different kinds of multiple feature fusion techniques are widely adopted [8], [13]. In this paper, we augment our PEM/APEM by a simple multiple feature post-fusion framework to combine the effectiveness of different features using a linear SVM.
To post-fuse multiple features, we repeat the proposed pipeline over all face/face track pairs using D types of different local descriptors to obtain D confidence scores for each face/face track pair pi as a score vector
s
i
=[s
i
,s
i
. . . S
i
], (24)
where si
Extensive experiments were performed over two challenging datasets, Labeled Face in the Wild (LFW) [14] and YouTube Faces Database[15].
Considering the face that human faces are symmetric in general, we generate a horizontally flipped version of for every image in the dataset. As the proposed framework could handle face and face track in a unified representation, a single face image under this setting will be regarded as a two frames video from symmetric viewpoints. Unlike previous work using the same technique [39] which need to repeat the same pipeline over the four possible combinations between flipped and original faces and take the average distance as the measurement, PEP-representation is more suitable in utilizing the flipped face by simply replacing the occluded facial parts with the ones from the flipped faces, as shown in
Labeled Faces in the Wild (LFW) [14] dataset is designed to help address the unconstrained face verification problem. This challenging dataset contains more than 13,000 images from 5749 people. In general there are two training methods over LFW, image-restricted method and image-unrestricted method. By design, image-restricted paradigm does not allow experimenters to use the name of a person to infer two face images are matched or non-matched, while in the image-unrestricted paradigm experimenters may form as many matched or non-matched face pairs as desired for training. Over LFW, researchers are expected to explicitly state the training method they used and report performance over 10-folds cross-validation. In our experiments, we followed the most restricted protocol, in which detected faces are aligned with the funneling method [43].
To better investigate our PEM/APEM approach to pose variant face verification, we introduce a baseline algorithm that shows how well a trivial location-based feature pair matching scheme performs. The baseline algorithm provides a basis of comparison to evaluate the effectiveness of the PEP-model or adapted PEP-model. Formally, and ′ are representations of two faces, both have descriptors, i.e., ={f1 . . . fN} and ′={f1′ . . . fN′}, where fn and fn′ are two spatial-appearance descriptor from the n-th local patch at the same location. Similar to Section 4.2, the difference vector between faces and ′ is d(,′)=[|f1−f1′|T . . . |fN−fN′|T]T. Then we train an SVM classifier over training difference vectors to predict if a testing face/face track pair is matched.
In our experiments, images are center cropped to 150×150 before feature extraction. As shown in
As shown in Table 1 and
Comparing to the baseline algorithm without using PEP-model, the performance improvement could be justified as the PEP-representations alleviated pose variations as shown in
This work is a general framework which can handle both image and video based face verification without modification. Wolf et al. [15] published YouTube Faces Dataset (YTFaces) for studying the problem of unconstrained face recognition in videos. The dataset contains 3,425 videos of 1,595 different people. On average, a face track from a video clip consists of 181.3 frames of faces. Faces are detected by the Viola-Jones detector and aligned by fixing the coordinates of automatically detected facial feature points [15]. Protocols are similar to LFW, for the same purpose, we focus on the restricted video face verification paradigm.
In the video faces experiments, each image frame is center cropped to 100×100. Then descriptors are extracted in the same way in Section 6.2.2 for each frame. For each video, for efficiency, we randomly sampled 10 frames as the face track. In the stage of joint Bayesian adaptation, to ease the computational intensity, 10% descriptors are sampled randomly from each face track to be combined into the joint descriptor set.
As shown in Table 2 and
In this paper, we proposed a probabilistic elastic part model to build pose-invariant probabilistic elastic part representation, with an additional joint Bayesian adaptation component as a general framework for both image and video based face verification. Extensive experiments were performed in which PEM/APEM achieved state-of-the-art performance on two standard face verification benchmark datasets, most restricted LFW and restricted YouTube Faces dataset.
This work is supported by US National Science Foundation Grant IIS-1350763, a Google Research Faculty Award, gift grants from both Adobe Research and NEC Labs, and Stevens Institute of Technology faculty startup funds for Gang Hua.
In this disclosure, various functions and operations may be described as being performed by or caused programmatically by software code. Those skilled in the art will recognize that the functions and calculations described above result from execution of the code/instructions by a processor, such as a microprocessor. Alternatively, or in combination, the functions and operations can be implemented using special purpose circuitry, with or without software instructions, such as using Application-Specific Integrated Circuit (ASIC) or Field-Programmable Gate Array (FPGA).
A machine readable medium can be used to store software and data which when executed by a computer, e.g., the microprocessor thereof, causes the computer to perform methods disclosed above. The executable software and data may be stored in various places including for example ROM, volatile RAM, non-volatile memory, a server, a network and/or cache. The data and instructions required to carry out the above-described methods can be obtained in their entirety prior to execution of the method steps. Alternatively, portions of the data and instructions can be obtained dynamically, just in time, when needed for execution.
The computer-readable media may store the program instructions. In general, a tangible machine readable medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant, etc.).
The present application claims the benefit of U.S. Provisional Patent Application Ser. No. 61/932,532 filed on Jan. 28, 2014, the disclosure of which is incorporated herein by reference in its entirety.
Some of the research performed in the development of the disclosed subject matter was supported by Grant U.S. Pat. No. 1,350,763 from the U.S. National Science Foundation. The U.S. government may have certain rights with respect to this application.
Number | Date | Country | |
---|---|---|---|
61932532 | Jan 2014 | US |