1. Technical Field
This invention is directed towards a system and method for face recognition. More particularly, this invention relates to a system and method for face recognition using synthesized training images.
2. Background Art
Face recognition systems essentially operate by comparing some type of model image of a person's face (or representation thereof to an image or representation of the person's face extracted from an input image. In the past these systems, especially those that attempt to recognize a person at various face poses, required a significant number of training images to train them to recognize a particular person's face. The general approach is to use a set of sample images of the subject's face at different poses to train a recognition classifier. Thus, numerous face images of varying poses of each person to be recognized must be captured and input for training such systems. This requirement for a significant set of sample images is often difficult, if not impossible, to obtain. Capturing sample images may be complicated by the lack of “controlled” capturing conditions, such as consistent lighting and the availability of the subject for generating the sample images. Capturing of numerous training images may be more practical in the cases of security applications or the like, where it is likely that the subject to be recognized is readily available to generate the training image set, but may prove impractical for various consumer applications.
The system and method according to the present invention, however, allows for face recognition even in the absence of a significant amount of training data. Further, it can recognize faces at various pose angles even without actual training images exhibiting the corresponding pose. This is accomplished by synthesizing training images depicting a subject's face at a variety of poses from a small number (e.g., two) of actual images of the subject's face. The present invention overcomes the aforementioned limitations in prior face recognition systems by a system and method that only requires the capture of one or two images of each person being recognized. Although, the capture of two training images of a person sought to be recognized is preferred, one training image will allow for the synthesis of numerous training images.
The system and process according to the present invention requires the input of at least one image of the face of a subject. If more than one image is input, each input should have a different pose or orientation (e.g., the images should differ in orientation by at least 15 degrees or so). Preferably two images are input—one frontal view and one profile view.
The system and process according to the present invention also employs a generic 3-D graphic face model. The generic face model is preferably a conventional polygon model that depicts the surface of the face as a series of vertices defining a “facial mesh”.
Once the actual face image(s) and the generic 3-D graphic face model have been input, an automatic deformation technique is used to create a single, specific 3-D face model of the subject from the generic model and images. More specifically, to deform the generic face model to the specific model, an auto-fitting technique is adopted. In this technique, the feature point sets are extracted from the subject's frontal and profile images. Then the generic face model is modified to the specific face model by virtue of comparison and mapping between the two groups of feature point sets. In the preferred frontal/profile embodiment of the present invention, symmetry of the face is assumed. For example, if the right-side profile is input, it is assumed the left side of the face mirrors the right side. If more than two images are used to create the specific model, it is preferred to use the automatic deformation technique to create a 3-D model using two of the images (preferably the frontal/profile images) and the generic model to create a specific 3-D face model and then to refine the model using the additional images. Alternately, all images could be used to create the 3-D model without the refinement step. However, this would be more time consuming and processing intensive.
A subdivision spline surface construction technique is next used to “smooth” the specific 3-D face model. Essentially, the specific 3-D face model is composed of a series of facets which are defined by the aforementioned vertices. This facet-based representation is replaced with a spline surface representation. The spline surface representation essentially provides more rounded and realistic surfaces to the previously faceted face model using Bézier patches.
Once the subdivision spline surface construction technique is used to “smooth” the specific 3-D face model, a multi-direction texture mapping technique is used to endow texture or photometric detail to the face model to create a texturized, smoothed, specific, 3-D face model. This technique adds realism to the synthetic human faces. Essentially, the input images are used to assign color intensity to each pixel (or textel) of the 3-D face model using conventional texture mapping techniques. More particularly, for each Bézier surface patch of face surface, a corresponding “texture patch” is determined by first mapping the boundary curve of the Bézier patch to the face image. In the preferred embodiment employing frontal and profile input images, the face image chosen to provide the texture information depends on the preferred direction of the Bézier patch. When the angle between the direction and the Y-Z plane is less than 30 degrees, the frontal face image is used to map; otherwise the profile image is used. In addition, facial symmetry is assumed so the color intensities associated with the profile input image are used to texturize the opposite side of the 3-D model.
Once a 3-D face model of a specific subject is obtained, realistic individual virtual faces or 2-D face images, at various poses, can be easily synthesized using conventional computer graphics techniques (for example, using CAD/CAM model rotation). These techniques are used to create groups of training images for input into a “recognizer” to allow for training of the recognizer. It is also optionally possible to take the generated images and synthetically vary the illumination to produce each image at various illuminations. In this way, subjects can be recognized regardless of the illumination characteristics associated with an input image.
The specific features, aspects, and advantages of the present invention will become better understood with regard to the following description, appended claims and accompanying drawings where:
In the following description of the preferred embodiments of the present invention, reference is made to the accompanying drawings, which form a part hereof, and which is shown by way of illustration of specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data other medium which can be used to store the desired information and which can accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any
The exemplary operating environment having now been discussed, the remaining parts of this description section will be devoted to a description of the program modules embodying the invention.
The system and method according to the present invention only requires the capture of one or two images of each person being recognized. However, the capture of two training images of a person sought to be recognized is preferred, though one training image will allow for the synthesis of numerous training images.
By way of overview, and as shown in
Thus, the system and method according to the present invention has the advantage of requiring only a small amount of actual training data to train a recognition classifier. This minimizes the cost and effort required to obtain the training data and makes such recognition systems practical for even low-cost consumer applications.
The following paragraphs discuss in greater detail the various components of the system and method according to the present invention.
1.0 Inputting Actual Face Image(s)
The system and process according to the present invention requires the input of at least one image of the face of a subject. If more than one image is input, each input should have a different pose or orientation (e.g., the images should differ in orientation by at least 15 degrees or so). Preferably two images are input—one frontal view and one profile view.
2.0 Creating a Specific 3-D Face Model
As stated previously, a deformation technique is used to align the input images with a generic 3-D graphic face model to produce a 3-D face model specific to the person depicted in the images. More particularly, once the images have been input, a generic 3-D face model is modified to adapt the specified person's characteristics according to the features extracted automatically from the person's image. To this end, human facial features are extracted from the frontal image, and then the profile image, if available.
2.1 Extraction of Frontal Facial Features
In order to extract the facial features in the frontal face images, a deformable template is employed to extract the location and shape of the salient facial organs such as eyes, mouth and chin. Examples of templates for the eye, the mouth and the chin are illustrated in
The template matching process entails finding the cost function minimum. The cost function includes the integral of the following four curves:
where Γi is the template curve that describes the lip shape. |⊕i|, is its length; Φc({overscore (x)}) is the gray level which is dropped onto the template. The punishment function is:
Etemp=k12((h1−h2)−{overscore ((h1−h2))})2+k34((h3−h4)−{overscore ((h3−h4))})2
Where k12, k34 are the elastic coefficients, and {overscore ((h1−h2))}, {overscore ((h3−h4))} are the average thickness of the lips. By combining the equations above, the final cost function is described as follows:
where Ci, Ki are weight coefficients. Similar procedures are used to extract other facial features as is well known in the art.
2.2 Extraction of Profile Facial Features
A prescribed number of feature points are next defined in the profile model. For example, in tested embodiments of the present invention, thirteen feature points were defined in the profile model, as shown in
2.2.1 Detection of Profile Outline
Color information is effective in image segmentation. It contains three properties: lightness, hue and saturation. For any type of color, the hue keeps constant under different lighting conditions. In YUV color space, hue is defined as the angle between U and V. Colors have high clustering performance in hue distribution. Even different images under varying lighting conditions have a similar hue histogram shape. A proper hue threshold is selected according to hue histogram by a moment-based threshold setting approach. A threshold operation is carried out for the profile image, and it produces a binary image. In the binary image, the white part contains the profile region, and the black part denotes background and hair. The profile outline is located by Canny edge extraction approach.
2.2.2 Location of Profile Feature Points
Based on the observation that many feature points are turning points in the profile outline, a conventional polygonal approximation method is used to detect some feature points, as shown in
2.3 Modifying the Generic 3D Graphic Face Model
A conventional generic 3-D mesh model is used to reflect facial structure. In tested embodiments of the present invention, the whole generic 3-D mesh model used consists of 1229 vertices and 2056 triangles. In order to reflect the smoothness of the real human face, polynomial patches are used to represent the mesh model. Such a mesh model is shown in
It is necessary to adjust the general model to match the specific human face in accordance with the input human face images to produce the aforementioned specific 3-D face model. A parameter fitting process is preferably used accomplish this task.
To adjust the whole generic human face model automatically when one or several vertices are moved, a deformable face model is preferably adopted. Two approaches could be employed. One is a conventional elastic mesh model, in which each line segment is considered to be an elastic object. In this technique, the 3-D location of the extracted frontal and profile feature points are used to replace the corresponding feature points in the generic model. Then a set of nonlinear equations is solved for each movement to determine the proper location for all the other points in the model. Another method that could be used to modify the generic model is an optimizing mesh technique, in which deformation is finished by some optimizing criterions. This technique is implemented as follows. Let the set V={v0, v1, . . . vn, {overscore (v)}1, {overscore (v)}2, . . . vm} be the vertices of 3-D mesh, where {overscore (v)}1, {overscore (v)}2, . . . , {overscore (v)}m are fixed vertices, which do not change when some vertices are moved. Suppose that the vertex v0 is moved to v′0. The corresponding shift of other vertices v1, v2, . . . , vn must then be determined. To do this, it is considered that the balance status is achieved in the meaning of minimizing summation of displacement of all vertices and length change of all edges with weight. Let v′1, v′2, . . . , v′n be new positions of vertices v1, v2, . . . , vn, T=(x′1, y′1, z′1, x′2, y′2, z′2, . . . , x′n, y′n, z′n)T be the vertices coordinate vector of v′1, v′2, . . . , v′n, and e′1, e′2, . . . , e′E be all edge vectors on balance, where E is the number of edges in space mesh. It can be represented as the following minimization problem:
where c, a1, a2, . . . , aE are the weight coefficients. The vector T can be determined by solving the minimization problem.
To reduce the complexity of computation, some simplification can occur. A direct consideration is to fix these vertices that are far from the moved vertex. The number of edges of the minimal path between two vertices is defined to be the distance of the vertices. Generally, the larger distance corresponds to the less effect. So, a distance threshold can be defined. Those vertices that are far from the threshold are considered as the fixed vertices.
Regardless of which of the foregoing adjustment techniques is employed, it is preferred the deformation of the generic model be implemented in two levels. One is a coarse level deformation and the other is a fine level deformation. Both the two deformations follow the same deformation mechanism above. In the coarse level deformation, a set of vertices in the same relative area are moved together. This set can be defined without restriction, but generally it should consist of an organ or a part of an organ of the face (e.g., eye, nose, mouth, etc.). In the fine level deformation, a single vertex of the mesh is moved to a new position, and the facial meshes vertices are adjusted vertex by vertex. The 3-D coordinates of the vertices surrounding the moved vertex are calculated by one of the aforementioned deformation techniques.
It is noted that prior to performing the deformation process, the 3-D face meshes are scaled to match the image's size. Then the facial contour is adjusted as well as the center positions of the organs using a coarse level deformation. The fine level deformation is then used to perform local adjustment.
The two level deformation process is preferably performed interatively, until the model matches all the extracted points of the face images of the specific subject.
3.0 Smoothing the Specific 3D Face Model Using of Subdivision Spline Surface Construction Technique
A subdivision spline surface construction technique is next used to “smooth” the specific 3-D face model. Essentially, the specific 3D face model is composed of a series of facets which are defined by the aforementioned vertices. This facet-based representation is replaced with a spline surface representation. The spline surface representation essentially provides more rounded and realistic surfaces to the previously faceted face model using Bézier patches.
In the construction of this subdivision spline surface representation, a radial basis function interpolation surface over the mesh is generated by the subdivision method. The generating of subdivision spline surface S can be considered as a polishing procedure similar to mesh refinement. From each face with n edges, a collection of n bi-quadratic Bézier patches are constructed. A bi-quadratic Bézier patch is illustrated in
where ti=ni+1−ni+2 is the number of vertices of face Fi.
Through the subdivision procedure described above, the mesh model of the face is reconstructed as a smooth spline surface, as shown in
4.0 Endowing Texture (or Photometric) Detail Using Multi-Direction Texture Mapping Techniques
Once the subdivision spline surface construction technique is used to “smooth” the specific 3D face model, a multi-direction texture mapping technique is used to endow texture or photometric detail to the face model to create a texturized, smoothed, specific, 3-D face model. This technique adds realism to the synthetic human faces. Essentially, the input images are used to assign color intensity to each pixel (or textel) of the 3-D face model using conventional texture mapping techniques. More particularly, for each Bézier surface patch of the face surface, a corresponding “texture patch” is determined by first mapping the boundary curve of the Bézier patch to the face image. In addition, facial symmetry is assumed so the color intensities associated with the profile input image are used to texturize the opposite side of the 3-D model.
In the preferred embodiment employing frontal and profile input images, the face image chosen to provide the texture information depends on the preferred direction of the Bézier patch. When the angle between the direction and the Y-Z plane is less than 30 degrees, the frontal face image is used to map; otherwise the profile image is used.
More specifically, let
be the bi-quadratic Bézier patch. The tangent plane can be represented as the span of a pair of vectors.
The direction of Bézier patch can be estimated by average value of each point in the patch. That can be computed by the formula:
According to the direction of each patch, texture information is selected from frontal and profile view images of the individual human face, as shown in
5.0 Synthesize Various 2D Face Pose Images
Once a 3-D face model of a specific subject is obtained, realistic individual virtual faces or 2-D face images, at various poses, can be easily synthesized using conventional computer graphics techniques (for example, using CAD/CAM model rotation). For example, referring to
The foregoing techniques can be used to create groups of training images for input into a “recognizer” of a face recognition system. For example, synthesized 2-D images could be used as training images for a “recognizer” like that described in co-pending patent application entitled a “Pose-Adaptive Face Recognition System and Process”. This application, which has some common inventors with the present application and the same assignee, was filed on ______ and assigned a Ser. No. ______. The subject matter of this co-pending application is hereby incorporated by reference.
In tested embodiments of the present system and method employing the recognition system of the co-pending application, synthesized training image groups were generated for every in-plane rotation (clockwise/counter-clockwise) of plus or minus 10-15 degrees and every out-of-plane rotation (up and down/right and left) of plus or minus 15-20 degrees, with increments of about 3-7 degrees within a group. The resulting synthetic images for each group were used to train a component of the recognizer to identify input images corresponding to the modeled subject having a pose angle within the group.
While the invention has been described in detail by specific reference to preferred embodiments thereof, it is understood that variations and modifications thereof may be made without departing from the true spirit and scope of the invention. For example, synthetic images generated by the present system and process could be employed as training images for recognition systems other than the one described in the aforementioned co-pending application. Further, the synthetic images generated by the present system and process could be used for any purpose where having images of a person at various pose angles is useful.
Number | Date | Country | |
---|---|---|---|
Parent | 09728936 | Dec 2000 | US |
Child | 11053337 | Feb 2005 | US |