(A) Field of the Invention
The present invention relates to a method of performing automatic figure segmentations on pictures of animals or other life forms, and more particularly, relates to a method of performing automatic segmentations applicable to said pictures by segmenting the image into a corresponding tri-map of foreground, background, and unknown regions.
(B) Description of the Related Art
Image segmentation and matting are hot topics in the areas of computer vision and pattern recognition, due to their potential applications in background substitution, general object recognition, and content-based retrieval. Static images, unlike the objects in video, lack temporal correlation between consecutive frames and thus make the problems severely under-constrained. Therefore, user interaction, such as scribble interface, is usually required to produce a complete labeling of the pixels.
Single-image matting approach is one of the typical approaches for static image segmentation. This approach assumes that intensity of each (xi,yi)-th pixel in an input image is a linear combination of a foreground color F and a background color Bi, and intensity may be calculated as:
Ii=αiFi+(1−αi)Bi,
where αi is referred to as the pixel's partial opacity value or alpha matte. For each pixel in a color image, there are 3 compositing equations in 7 unknowns. The natural image matting, however, poses no restrictions on the background and is inherently under-constrained. In order to resolve the problem, the user is required to provide some additional information in the form of a tri-map or a set of brush strokes (scribbles). Accordingly, automatic segmentation is not attainable for such method.
A method for achieving segmentation of a picture according to one aspect of the present invention comprises: determining a first foreground of a picture based on a predetermined mask; applying Gaussian Mixture Models with weighted data (GMM-WD) to the first foreground to generate a second foreground; determining a first background of the picture based on the second foreground; applying the GMM-WD to the first background to generate a second background; and determining an unknown region based on the second background and the second foreground.
Another aspect of the present invention provides a method of processing an image matting of a picture comprising: identifying a foreground of the picture; identifying a background of the picture; applying Gaussian Mixture Models with weighted data (GMM-WD) to the foreground and the background to generate a second foreground and a second background; identifying an unknown region by excluding the second foreground and the second background from the picture; and performing an image matting process based on the second foreground, the second background, and the unknown region.
Another aspect of the present invention provides a system of automatic generation of a tri-map of a picture comprising: a measure for identifying a first foreground of the picture based on a predetermined mask; a measure for identifying a first background of the picture based on the predetermined mask; a measure for applying Gaussian Mixture Models with weighted data (GMM-WD) to the first foreground and the first background to generate a second foreground and a second background; a measure for identifying an unknown region by excluding the second foreground and the second background from the picture.
The foregoing has outlined rather broadly the features of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features of the invention will be described hereinafter, and form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures or processes for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.
The present invention is related to a method of performing automatic figure segmentation on pictures of animals or other life forms by segmenting the picture into a corresponding tri-map comprising foreground, background, and unknown regions. For a better understanding of the present invention, the steps and the composition are disclosed in the following discussion. Ordinary skills are not disclosed so that unnecessary limitations on the present invention are avoided. The disclosure of the preferred embodiments is to help those skilled in the art to understand the present invention. Accordingly, the scope of the present invention shall not be limited to embodiments disclosed but shall be properly determined by the claims as set forth below.
Eye detection 12 applies an eye detection algorithm applied to the determined facial region of the input image based on a classifier and a facial mask to detect eyes. In one preferred embodiment, the eye detection 12 is based on the method proposed by Castrillon. The classifiers and the facial mask are trained based on training sets, and the results are stored in the facial database.
By comparing the stored facial mask with the input image, along with the face detection and the eye detection of the input image, the stored facial mask can be normalized or rescaled and a face region of the input image can be determined. For example, when the input image is of a human, the average eye width based on the training sets for humans is 89.4 millimeters and the average face height is 121.3 millimeters. Accordingly, the eye width is 89.4/121.3 of the face height. Assuming the center of the detected eye rectangle as Naison (n), the point in the middle of both the nasal root and the nasofrontal suture, the face region of the input image may be automatically normalized or rescaled by the aforesaid algorithms and the location of the center of the detected eyes.
The present invention further provides a method for skin region segmentation 13 and hair region segmentation 14. Skin and hair generally includes shadows and thus the colors of the skin and hair appear inconsistent. The issue of the color inconsistency may solved by a shadow analysis and a color distortion analysis based on a preselected training set. In one aspect of the present invention, the skin region segmentation 13 first selects skin regions from a training set based on the two rectangular areas below the determined eyes region as shown in
where αi is current brightness with respect to the actual brightness of a particular pixel (xi, yi), (ri, gi, bi) is the RGB color vector of the pixel, (μr, μg, μb) is the mean color, and (σr, σg, σb) is the color standard deviation of the selected training set. If αi is smaller than 1 the area of skin or hair is shadowed, and if αi is larger than 1, the area of skin or hair is highlighted.
The skin region segmentation 13 further calculates color distortion analysis as follows:
where CDi is the color distortion of the pixel (xi, yi). If the computed color distortion is smaller than a preset threshold and the αi falls within a predetermined range, the point matches a predetermined model and is classified within a region of the training set. The skin region is accordingly determined.
The hair region segmentation 14 determines the hair region using the same method used by the skin region segmentation 13 as disclosed above.
Some of the skin region and hair region may contain foreign objects and the foreign object may be included/excluded as a part of the region by controlling the value of the standard deviations around the mean value of the colors within a particular region. The image of the foreign objects may be excluded when the color distributions of the pixel are not within a specific standard deviation around the mean value of the training set. The skin region segmentation 13 and the hair region segmentation 14 may exclude pixels for which the standard deviation is one time, two times, or three times away from the mean value of the color distribution of training sets as shown in (b), (c) and (d) of
Some facial regions, such as eyes, nose, and mouth may contain shadows that cannot be easily classified. When the main figures in the picture are humans, the skin region segmentation 13 and the hair region segmentation 14 may apply a face mask with the shape of a pentagon on the determined eyes, and the locations of pixels is therefore defined to be within the pentagon as skin pixels.
Hair regions may contain translucent pixels thus creating some uncertainty when applying the hair and skin segmentation algorithm. The hair region 14 therefore assumes hair regions to be adjacent to a skin region determined by the skin region segmentation 13. The possible location of hair regions may be restricted. In addition, uncertain hair regions may be classified as unknown areas or may be reclassified by assigning each pixel of the uncertain hair region as possibly belonging to the foreground, background, or unknown area, similar to the technique used in the process discussed below.
The body mask application process 120 determines a possible body region of a body mask based on a training process to collect the information of the eye widths detected by eye detection 12, the position of menton detected by face detection 11, and the relative body pixels for all images from the training sets. The body mask application process 120 automatically aligns and selects a body region of the input image based on such collected information. The body mask application process 120 thus determines the location probabilities of the body region W(xi, yi) of a body mask and may rescale the selected body region. In one embodiment of the present invention, the bottom of the body mask can be extended to the bottom of a mask if the figures in the picture include humans.
As shown in
According to Bayes' theorem the probability of each pixel belonging to the background thus becomes:
Similarly, the probability of each pixel belonging to the body region can be expressed as follows:
As shown in
A pixel ρ(ri, gi, bi) through a liner combination of the density function above may be represented by:
The summation of the Pj is equal to 1. Accordingly, there are J Gaussian distributions contributing to the color distribution ρ(ri, gi, bi). The formula implies that a pixel i with color (ri, gi, bi) may have a color distribution drawn from one of the J Gaussian distributions with probability Pj, j=1, 2, . . . , J. The probability of a pixel belonging to the body region is represented by wi=ρ(background=0|xi,yi). When wi=1, the pixel belongs to the body region. Each color pixel of the body mask determined above may be included into the training samples for GMM model estimation. In the to training process, the pixels to be included can be determined manually by identifying possible color distributions within a body region manually, or by utilizing the concept of Expectation Maximization algorithm (EM algorithm) automatically, as discussed below.
Given the set X of the N training vectors, the typical Maximum Likelihood (ML) formulation maximizes the likelihood function Πi=1Np(xi|Θ) to compute a specific estimation of the unknown parameter Θ.
As shown in
Assuming the weight function W(ρi, wi)=ρiwi is adopted in the present invention, the likelihood function becomes:
The parameters can be determined when the likelihood function takes its maximum values, i.e.
If Ji denotes the mixture from which xi is generated, the log-likelihood function becomes:
Where ji takes the integer values within the interval [1,J] and the unknown parameter vector is denoted by Θ=[θT, PT]T. Thus the EM algorithm for parameters can be described by E-step 23 as illustrated below. Assuming the θ(t) is available at the t+1 step of an iteration, the expected value can be:
Dropping the index i from ji, and summing up all possible J values of ji, an auxiliary objective function can be derived as below:
After the expectation E-step 23 is calculated, the steps proceed to M-step 24. The mixture weights are thus normalized in M-Step 24 as below:
The Q(θt, θt+1) is maximized and thus makes its first derivative to zero:
A Lagrange multiplier μ is introduced to solve the partial derivative equation as below:
The Lagrange Multiplier μ can be determined as:
Accordingly, by maximizing Q(θt, θt+1) and making its first derivative to zero, we derive an iterative update scheme for the mixture weight:
The present invention further defines an auxiliary function to measure the contribution of a training sample xi to the mixture j as below:
As shown in
As shown in
Jbody and Jbackground are the numbers of Gaussian Mixtures for the body and background color models respectively, and Pjbody and Pjbackground are the mixture weights of the body and background GMM models respectively.
As shown in
λjbody is the sum of the eigenvalues Σjbody′, and λjbackground is the sum of the eigenvalues of Σjbackground. The body mixture with less color similarity to the background mixtures obtains a larger discriminant value. The new mixture weight is defined as below:
A new body mask w′i may be derived based on the new mixture weights and the original density function as a new hypothetical body mask. Since the new mixture weights are not optimal with respect to the parameter of the original density functions, the process returns to step 22 and replaces the body mask determined with the new hypothetical body mask, then applies the GMM with weighted data model again to refine and determine the background and body based on the new hypothesized body mask w′i, i.e.,
The reweighting process may be applied to any region of the picture but is not limited to the body region of the foreground or the background. In one of the preferred embodiments, a pixel i with wi<0.9 may be rejected in favor of a better criterion for comparison. In general, applying the reweighting once is sufficient in all mixtures.
As shown in
Based on the information collected from training sets, two threshold values T1 and T2 are predetermined. If ρi>T1, pixel i belongs to the body region; if ρi<T2, pixel i belongs to the background region; otherwise pixel i is assigned to the unknown region.
Image matting 16 combines the tri-map to generate an alpha matte based on the determined facial region, eyes region, and a closed-form matting algorithm proposed by Anat Levin, Dani Lischinski, and Yair Weiss (see IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 30, No. 2, 2008).
Image composition 17 collects each segmented part of the picture and the new background image or other pictures as shown in
In one embodiment of the present invention, namely, the probabilistic body mask, body regions of 36 human images are manually labeled as the ground truth training sets to train the location of the prior model. Human images are detected by the face and eye detectors based on the enhanced state-of-the-art AdaBoost detectors mentioned above. Next, the prior mask is geometrically aligned and cropped to a proper size. The cropped prior mask should not be larger than the input image. In
The average proportion of the elapsed time of body/background segmentation to the total elapsed time is more than 85%. The bottleneck of the computational performance is attributed to the time-consuming GMM model parameter estimation for iterative body/background segmentation. To speed up the process, the following efficiency enhancement strategies are provided:
Strategy 1 skips the reweighting process if the body/background mask can provide a reasonable hypothesis. In other words, mixture reweighting is not necessary for every input image.
Strategy 2 downsamples the training sets which can accelerate the process of GMM. When all the pixels with nontrivial body/background mask values are introduced to estimate the parameters of GMM, the process becomes very time-consuming. Thus, downsampling is a reasonable technique to reduce the computational cost.
Strategy 3 sets the maximum number of iterations of the EM algorithm to 50. The EM algorithm may converge to a local maximum of the observed data likelihood function depending on the initial values. In some cases, the EM algorithm converges slowly; hence we can simply terminate the iterative process if it has not converged after a predetermined number of iterations. Other methods have been proposed to accelerate the traditional EM algorithm, such as utilizing conjugate gradient or modifying the Newton-Raphson techniques.
The strategies as mentioned above can be performed without substantial loss of precision of the picture segmentation. Based on an input image shown in
The above-described embodiments of the present invention are intended to be illustrative only. Those skilled in the art may devise numerous alternative embodiments without departing from the scope of the following claims. Accordingly, the scope of the present invention shall be not limited to embodiments disclosed but shall be properly determined by the claims set forth below.
Number | Name | Date | Kind |
---|---|---|---|
7676081 | Blake et al. | Mar 2010 | B2 |
7885463 | Zhang et al. | Feb 2011 | B2 |
20070237393 | Zhang et al. | Oct 2007 | A1 |
20100208987 | Chang et al. | Aug 2010 | A1 |
20110075921 | Sun et al. | Mar 2011 | A1 |
Number | Date | Country | |
---|---|---|---|
20110150337 A1 | Jun 2011 | US |