This disclosure relates generally to the field of image processing. More particularly, but not by way of limitation, this disclosure relates to a technique for identifying landmark features in a digital image.
Facial landmark detection is the process of analyzing an image to identify and locate reference nodes (aka landmarks) on a face and is a critical step in computer vision that precedes many important tasks such as expression recognition, face recognition, red-eye removal, face alignment and face tracking. In general, obtaining an estimate of facial landmarks is an ill-conditioned problem due to pose, illumination, and expression variations. These factors compromise the performance of most facial landmark detection methods. It would be beneficial therefore to provide a mechanism that provides robust, high-quality landmark detection capability.
In one embodiment the inventive concept provides a method to identify the location of landmarks in a digital image. The method includes obtaining a bounding box that delimits a portion of the image (the contents of which include a first group of pixels). A candidate landmark vector may be generated for each of a second group of pixels (the second group of pixels consisting of a subset of the first group of pixels). For landmark accuracy and robustness, these candidate landmark vectors are generally highly dimensioned. In accordance with this disclosure, the dimension of the candidate landmark vectors may be reduced at run-time using a priori model data. This model data may be based on positive and negative landmark exemplars: positive exemplars corresponding to image portions known to include the target landmark, and negative exemplars corresponding to image portions known not to include the target landmark.
After dimensional reduction, model data may be applied to the candidate landmark vectors to generate positive and negative sets of likelihood values, where there is one positive and one negative likelihood value for each candidate landmark vector. Corresponding positive and negative likelihood values may be combined to generate an overall likelihood value for each pixel in the second group of pixels. That pixel (from the second group of pixels) having the largest overall likelihood value may be designated as the most likely landmark pixel. In another embodiment, each candidate landmark vector may be normalized (e.g., in accordance with the mean and variance of its constituent components' values) prior to dimensional reduction operations. A computer executable program to implement the method may be stored in any media that is readable and executable by a computer system.
This disclosure pertains to systems, methods, and computer readable media to accurately localize landmark points in images such as facial features (e.g., eyes, nose, chin). In general, techniques are disclosed for generating landmark models based on exemplar portions of images (“patches”) that are known to include the target landmark (forming a “positive” set of landmark vectors) and those that do not (forming a “negative” set of landmark vectors). Theses sets of positive and negative landmark vectors, along with other landmark statistics, form a landmark model. When an unknown image is received, candidate landmark vectors may be generated based on the image's content (or portion thereof). The landmark models then may be used to rapidly reduce the dimensionality of the candidate vectors and, in so doing, improve the speed at which landmark identification may be performed. The landmark models may then be applied to the reduced-dimensioned candidate landmark vectors to identify the most likely location of the target landmark may be identified.
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the inventive concept. As part of this description, some of this disclosure's drawings represent structures and devices in block diagram form in order to avoid obscuring the invention. In the interest of clarity, not all features of an actual implementation are described in this specification. Moreover, the language used in this disclosure has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter, resort to the claims being necessary to determine such inventive subject matter. Reference in this disclosure to “one embodiment” or to “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention, and multiple references to “one embodiment” or “an embodiment” should not be understood as necessarily all referring to the same embodiment.
It will be appreciated that in the development of any actual implementation (as in any development project), numerous decisions must be made to achieve the developers' specific goals (e.g., compliance with system- and business-related constraints), and that these goals may vary from one implementation to another. It will also be appreciated that such development efforts might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the design an implementation of image processing and analysis systems having the benefit of this disclosure.
Referring to
As described in more detail below, landmark detector 135 operates on region of interest 130 to identify the most likely location of the target landmark (e.g., output 140). In the following, landmark detection is described in terms of locating a facial landmark (e.g., an eye or a corner of a mouth). More specifically, the target landmark in the description below is the center of an eye. It will be recognized that the techniques disclosed herein may also be used to detect other facial landmarks (e.g., eyebrows, nose, chin, mouth). It will also be recognized that techniques in accordance with this disclosure may also be applied to detecting other types of landmarks (e.g., finger-tips, shoulder points, triangle vertices).
Referring to
In the following, an illustrative patch training operation will be described based on evaluating a set of positive landmark patches to generate positive baseline landmark data. Each of the described actions should be repeated for the corresponding negative landmark patches to generate negative baseline landmark data. The combination of positive and negative baseline landmark data constitute a landmark model in accordance with this disclosure.
Referring to
Each patch may then be represented as a 1D vector (block 410). Table 1 illustrates one approach to converting a 2D patch into a 1D patch vector in accordance with block 410. As shown, a 2D positive landmark patch organized as a (32×32) array of pixel values may be converted into a 1D positive patch vector having 1024 component values by concatenating the patch's pixels on a row-by-row basis. While a row-by-row approach has been adopted here, patch vectors may also be generated by concatenating a patch's pixels in any way that allows the target landmark to be detected. This includes not using every pixel in a patch. For example, if a patch is large every other or every third pixel may be used to generate the patch vector.
The collection of (2500) positive patch vectors may next be adjusted (block 415). In one embodiment, patch vector space adjustment operations in accordance with block 415 may include determining a mean (μ) and variance (σ2) for each patch vector (block 420), using this information to normalize individual patch vector component values (block 425), and reducing the size or dimension of the patch vector space to a desired value (block 430). By way of example, and in accordance with block 420, patch vector {right arrow over (p)}1 yields mean μ1 and variance σ12, patch vector {right arrow over (p)}2 yields mean μ2 and variance σ22, . . . where μi represents the mean of all the individual component values in patch vector {right arrow over (p)}i and σi2 represents the variance of all the individual component values in patch vector {right arrow over (p)}i. Letting {right arrow over (p)}i[j] represent the value of the j-th component of patch vector {right arrow over (p)}i, patch vector normalization in accordance with block 425 may be accomplished as follows.
where the “arrow” (←) notation indicates that the component value on the left (arrowhead side) is replaced by a value as determined by the formula on the right. It has been found that normalization in accordance with block 425 and EQ. 1 may aid in reducing the effect of illumination and contrast differences between the different patches/patch vectors. The collection of N normalized patch vectors may be aggregated to form a matrix having dimensions (N×1024). This matrix may be used to generate a square covariance matrix ‘A’ having dimensions (1024×1024) (block 430).
To make implementing landmark detection operations in accordance with this disclosure better suited to platforms having limited computational capabilities and/or to increase the speed of landmark detection operations, the dimensionality of the patch vector space resulting from operations in accordance with block 415 may be reduced (435). In one embodiment this may be accomplished by performing singular value decomposition (SVD) on covariance matrix A (block 240). (The combination of forming covariance matrix A and applying SVD to it, is sometimes referred to as principal component analysis, PCA.) After SVD has been performed, the first ‘n’ eigenvectors ({right arrow over (e)}) thereof may be selected to form a reduced dimensionality patch vector space (block 445). In the illustrative implementation using landmark patches composed of (32×32) arrays of 8 bit Y channel pixel values, and a (1024×1024) covariance matrix whose individual component values are 32 bit values, it has been found that the first 128 eigenvectors (i.e., n=128) provide sufficient information to enable robust landmark detection operations. The result of operations in accordance with block 435 is a positive patch projection matrix ‘P’ having dimensions (1024×128), each component represented by a 32 bit value. The choice of how many eigenvectors to use to form a reduced dimensionality vector space (embodied in positive patch projection matrix P) is a matter of design choice. For example, fewer eigenvectors may be selected (resulting in a smaller dimensioned projection matrix) if less accurate or robust landmark detection is tolerable. Similarly, more eigenvectors may be selected (resulting in a larger projection matrix) if more accurate or robust landmark detection is needed.
By viewing positive patch projection matrix P as a Gaussian distributed vector space, mean and variance values (μ and σ2) for each of the 128 (1024×1) vector in positive patch projection matrix P may be found (block 450), the collections of which may be represented in vector notation as:
The aggregate of positive patch projection matrix P and the above model statistics form a positive patch baseline dataset.
As noted above, each of the above actions should be repeated for a corresponding set of ‘M’ negative landmark patches to generate a negative patch baseline dataset. In general, it may be useful to use the same number of positive and negative patches (i.e., N=M), but this is not necessary. For example, it the collection of positive landmark patches is less “accurate” in identifying or characterizing the target landmark that the corresponding collection of negative landmark patches are to identifying the absence of the target landmark, it may be useful to use more positive patches to form the positive patch baseline dataset than are used to form the negative patch baseline dataset. The combination of positive and negative baseline landmark datasets constitute a landmark model in accordance with this disclosure. Illustrative landmark training phase 400 is summarized in Table 2.
Referring to
Actions in accordance with one embodiment of block 510 may be understood by referring to
Once ROI 605 has been obtained and area 610 selected, evaluation window 615 may be centered about selected pixels within area 610 and a candidate landmark vector generated for each such selected pixel. Referring to
Continuing the numeric example begun above, if ROI 605 measures (128×128) pixels, area 610 (64×64) pixels, evaluation window (32×32) pixels, and if evaluation window 615 is centered about every pixel within area 610 that is possible while keeping evaluation window 615 wholly within area 610, 1024 candidate landmark vectors may be generated. The collection of all such vectors may be represented as matrix C as follows:
where ‘d’ and ‘g’ represent individual component values within the identified candidate landmark vectors.
Referring again to
[Ppos] represents the positive patch vector projection matrix and [Pneg] the negative patch vector projection matrix (see Table 2). At this point, the are two candidate landmark vector sets: positive candidate landmark vector set [X] having 1024 vectors (each having 128 components); and a negative candidate landmark vector set [Z] having 1024 vectors (each of which also has 128 components).
A likelihood value for each vector in positive and negative candidate landmark vector sets may now be determined (block 525). Likelihood values for positive candidate landmark vectors represent the likelihood that the corresponding positive candidate landmark vector is the target landmark (e.g., the center of an eye). Likelihood values for negative candidate landmark vectors represent the likelihood that the corresponding negative candidate landmark vector is not the target landmark.
In one embodiment, the likelihood that pixel ‘i’—represented by positive candidate landmark vector {right arrow over (x)}i in [X]—in evaluation window 615 is the target landmark may be given by:
where xj represents the j-th component value in {right arrow over (x)}i, and μi and σi2 represent the mean and variance values for the positive patch projection matrices i-th vector, {right arrow over (x)}i (see Table 2). In like fashion, the likelihood that pixel ‘i’ is not the target landmark may be given by:
where zj represents the j-th component value in {right arrow over (z)}i (that vector in negative candidate landmark vector set [Z] corresponding to pixel ‘i’), and mi and si2 represent the mean and variance values for the negative patch projection matrices i-th vector, {right arrow over (z)}i Table 2).
The overall likelihood that the i-th pixel in evaluation window 615 is the target landmark may now be given as the sum of the corresponding positive and negative likelihood values as expressed in EQS. 2 and 3 (block 530):
Po(i)=P(i)+
While positive and negative likelihood values P(i) and
Once an overall likelihood value for each corresponding pair of candidate landmark vectors has been found, that pixel corresponding to the highest overall likelihood value may be selected as the most likely candidate location (block 535).
In one embodiment, landmark detection application phase 500 only identifies that pixel having the largest overall likelihood value. In another embodiment, both the most likely pixel and the corresponding overall likelihood value may be identified. In still another embodiment, if the most likely pixel's overall likelihood value (e.g., selected in accordance with block 535) is less than a specified threshold, an indication that the target landmark was not found may be returned by operation 500.
The above discussion is presented in terms of locating a single landmark. Referring to
Referring to
Processor 805 may execute instructions necessary to carry out or control the operation of many functions performed by device 800 (e.g., such as the processing of images in accordance with operations 500 and 700). Processor 805 may, for instance, drive display 810 and receive user input from user interface 815. User interface 815 can take a variety of forms, such as a button, keypad, dial, a click wheel, keyboard, display screen and/or a touch screen. Processor 805 may be a system-on-chip such as those found in mobile devices and include a dedicated graphics processing unit (GPU). Processor 805 may be based on reduced instruction-set computer (RISC) or complex instruction-set computer (CISC) architectures or any other suitable architecture and may include one or more processing cores. Graphics hardware 820 may be special purpose computational hardware for processing graphics and/or assisting processor 805 process graphics information. In one embodiment, graphics hardware 820 may include a programmable graphics processing unit (GPU).
Sensor and camera circuitry 850 may capture still and video images that may be processed to generate images in accordance with this disclosure and may, for example, incorporate image processing pipeline 100. Output from camera circuitry 850 may be processed, at least in part, by video codec(s) 855 and/or processor 805 and/or graphics hardware 820, and/or a dedicated image processing unit incorporated within circuitry 850. Images so captured may be stored in memory 860 and/or storage 865. Memory 860 may include one or more different types of media used by processor 805, graphics hardware 820, and image capture circuitry 850 to perform device functions. For example, memory 860 may include memory cache, read-only memory (ROM), and/or random access memory (RAM). Storage 865 may store media (e.g., audio, image and video files), computer program instructions or software, preference information, device profile information, and any other suitable data. Storage 865 may include one more non-transitory storage mediums including, for example, magnetic disks (fixed, floppy, and removable) and tape, optical media such as CD-ROMs and digital video disks (DVDs), and semiconductor memory devices such as Electrically Programmable Read-Only Memory (EPROM), and Electrically Erasable Programmable Read-Only Memory (EEPROM). Memory 860 and storage 865 may be used to retain computer program instructions or code organized into one or more modules and written in any desired computer programming language. When executed by, for example, processor 805 such computer program code may implement one or more of the methods described herein.
It is to be understood that the above description is intended to be illustrative, and not restrictive. The material has been presented to enable any person skilled in the art to make and use the invention as claimed and is provided in the context of particular embodiments, variations of which will be readily apparent to those skilled in the art (e.g., some of the disclosed embodiments may be used in combination with each other). The disclosed landmark detection operations perform a run-time reduction of the landmark vector space's dimensionality, whereas prior art approaches do not. As a result, the run-time complexity for landmark detection may be reduced from XXX (prior art) to XXX: where ‘N’ represents the number of candidate landmark vectors, ‘P’ the size of the candidate landmark vectors, and ‘K’ the number of eigenvectors used for the projection operation. The scope of the invention therefore should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein.”
Number | Name | Date | Kind |
---|---|---|---|
8331698 | Li | Dec 2012 | B2 |
20090220148 | Levy et al. | Sep 2009 | A1 |
20100086214 | Liang et al. | Apr 2010 | A1 |
20120308124 | Belhumeur et al. | Dec 2012 | A1 |
20130028522 | Perlmutter et al. | Jan 2013 | A1 |
Number | Date | Country | |
---|---|---|---|
20140044359 A1 | Feb 2014 | US |