The following relates to the image processing arts, image retrieval arts, image archiving arts, and related arts.
Automated tagging or classification of images is useful for diverse applications such as image archiving, image retrieval, and so forth. In a typical approach, a number of image key points or patches are selected across the image. Each key patch is quantified by a features vector having elements quantitatively representing aspects of the key patch. These feature vectors are then used as inputs to a trained classifier that outputs a class label (or vector of class label probabilities, in the case of a soft classifier) for the image.
A problem with this global approach is that it is not well-suited to tagging images containing multiple subjects. For example, an image showing a person may be accurately tagged with the “person” class, while an image showing a dog may be accurately tagged with the “dog” class—but a single image showing a person walking a dog is less likely to be accurately classified.
A known approach for such problems is to segment the image into smaller regions, and to classify the regions separately. Since the size of the subject or subjects shown in the image is not known, the segmentation into regions may employ different scales of region size. Since the different region sizes have different numbers of pixels, it is also known to use different classifiers for the different region sizes, for example an image-scale classifier, a patch-scale classifier (operative for an image region containing a single patch), and additional “mid-scale” classifiers for various intermediate region sizes.
Such segmentation approaches have numerous deficiencies. First, there is no basis for knowing a priori which region size is best for classifying a given subject in an image. For example, in the aforementioned example of an image of a person walking a dog one might suspect that the optimal region size for classifying the dog is the region size just encompassing the dog in the image. But, if it turns out that the dog's snout is the most “characteristic” feature of the dog (for example, possibly the feature that best distinguishes images of dogs from images of cats) then the optimal region size might be the region size that just encompasses the dog's snout.
Moreover, some correlations between classifications of overlapping or contained regions of different scales might be expected. For example, the image region encompassing the dog may be (correctly) classified as “dog” while the smaller-scale regions that make up the region encompassing the dog might be (erroneously) misclassified as something other than “dog”. In some cases, a correlation may be found in which this pattern of a larger region classifying as “dog” and its constituent smaller regions not classifying as “dog” may itself be characteristic of images of dogs. Existing image region classifiers do not provide any principled way to identify and utilize such correlations between image regions of different size scales.
In some illustrative embodiments disclosed as illustrative examples herein, an image classifier comprises a digital processor configured to perform operations comprising recursively partitioning an image into a tree of image regions having the image as a tree root and at least one image patch in each leaf image region of the tree, classifying the image regions using at least one classifier, and outputting classification values for the image regions based on the classifying and on weights assigned to the nodes and edges (or node and edge features) of the tree.
In some illustrative embodiments disclosed as illustrative examples herein, an image classification method comprises: recursively partitioning an image into a tree of image regions having the image as a tree root and at least one image patch in each leaf image region of the tree, the tree having nodes defined by the image regions and edges defined by pairs of nodes connected by edges of the tree; assigning unary classification potentials to nodes of the tree; assigning pairwise classification potentials to edges of the tree; and labeling the image regions of the tree of image regions based on optimizing an objective function comprising an aggregation of the unary classification potentials and the pairwise classification potentials.
In some illustrative embodiments disclosed as illustrative examples herein, a storage medium stores instructions executable by a digital processor to perform an image classification method comprising: recursively partitioning an image into a tree of image regions having the image as a tree root and at least one image patch in each leaf image region of the tree, the tree having nodes defined by the image regions and edges defined by pairs of nodes connected by edges of the tree; assigning classification values to each node using a plurality of classifiers trained at different image size scales; assigning a unary classification potential to each node based on the classification values assigned using the plurality of classifiers; assigning a pairwise classification potential to each edge of the tree based on the classification values assigned using the plurality of classifiers trained at different image size scales and on inheritance constraints defined by a hierarchy of classes; and labeling the image regions of the tree of image regions based on optimizing an objective function comprising an aggregation of the unary classification potentials and the pairwise classification potentials.
In the following, the terms “optimization”, “minimization”, and similar phraseology are to be broadly construed as one of ordinary skill in the art would understand these terms. For example, these terms are not to be construed as being limited to the absolute global optimum value, absolute global minimum, or so forth. For example, minimization of a function may employ an iterative minimization algorithm that terminates at a stopping criterion before an absolute minimum is reached. It is also contemplated for the optimum or minimum value to be a local optimum or local minimum value.
With reference to
The computer 10 or other digital processing device includes or has access to a storage medium (not shown) that stores instructions that are executable by the digital processor to perform the image region classification operations and related process operations as disclosed herein. The storage medium may, for example, include one or more of the following: a hard disk drive or other magnetic storage medium; an optical disk or other optical storage medium; a random access memory (RAM), read-only memory (ROM), flash memory or other electronic memory medium; or so forth.
With reference to
With continuing reference to , and is made up of a set of nodes denoted
, where each node is denoted xl,(i,j) where the index l is the level of the node in the tree structure or model
and the indices (i,j) denote a grid position in the image.
as a quadtree. An edge of the tree structure connects any given node (excepting the root node) with the “parent” node that was partitioned to form the given node.
A consequence of this recursive partitioning is that a node (that is, an image region) on a level k is connected to a node on level k+1 if and only if the regions corresponding to these nodes overlap. More formally, in the quadtree model a parent node xk(i,j) is connected to the following four child nodes: xk+1,(2i,2j), xk+1,(2i,2j+1), xk+1,(2i+1,2j), and xk+1,(2i+1,2j+1). Equivalently, a child node xk,(i,j) is connected with a single parent node xk−1,(i/2,j/2). Other partitioning beside the illustrative quadtree 2×2 splitting can be employed—for example, each image region can be split into nine regions using a 3×3 splitting. For images that have a substantially non-square aspect ratio, an anisotropic splitting such as a 2×3 recursive splitting is contemplated. In yet other embodiments, a tree-based split that is not grid-structured is contemplated. For example, a segmentation algorithm could be applied to produce small regions of various shapes, which are gradually combined into larger regions and eventually into the entire image.
Moreover, three-dimensional images can be similarly processed by splitting in three dimensions, yielding nodes designated xl,(i,j,z) where z denotes the grid position in the third dimension. For example, an octree structure or model can be formed for a three-dimensional image by using a 2×2×2 splitting in which each image region is recursively split in isotropic fashion into eight child regions.
In the tree model or structure, there are no edges between nodes at a given level (see the quadtree of
With continuing reference to
With continuing reference to
With continuing reference to . As disclosed herein, during image region classification the hierarchy of classes 30 is enforced or applied via pairwise (edge) potentials along the edges of the tree structure to ensure that nodes on higher levels of the tree structure (that is, nodes with smaller values of l) will always be assigned to a class that is at least as generic as its child nodes at lower levels (e.g., l+1) that are subsumed by the larger image region at the higher level. This is analogous to the notion of inheritance in object-oriented programming.
In the mathematical notation used herein, class inheritance is represented by the notation a<b where class a is less specific than class b. Using the illustrative hierarchy of classes shown in
The output of the processing of the training images 20 performed by processing components 22, 24, 26 is a training set of image regions at different size scales with class labels 32. In a variant embodiment of the classifier training system shown in
The image regions classifier disclosed herein employs a smooth combination of classifiers trained at different size scales in combination with enforcement of class inheritance constraints across levels of the tree structure. The inheritance constraints can be understood as imposing on the class ck+1 of an image region xk+1 and a class ck of a larger image region xk that subsumes the image region xk+1 the inheritance constraint that either ck<ck+1 or ck=ck+1. Toward this end, the classifications of the image regions defined by the tree structure are optimized together by optimizing a suitable objective function comprising an aggregation of (i) unary potentials for the nodes that embody the combination of classifiers trained at different size scales and (ii) pairwise potentials for the edges that embody the class inheritance constraints.
Thus, the training of the image regions classifier entails two components: (i) generating the classifiers trained at different size scales; and (ii) training parameters of the objective function which incorporates these classifiers trained at different size scales.
With reference to
With continuing reference to
The unary potentials 52 are suitably parameterized by the classifiers 44, 46, 48 trained at different size scales weighted by node weightings 60. To begin with, a single classifier is considered. The unitary potential for a region xk(i,j) is then suitably written as:
E(xk,(i,j);yk,(i,j))=Φ1(xk,(i,j);yk,(i,j)),θnodes
(1)
where yk,(i,j) is the class label to which the region xk,(i,j) is assigned by the classifier, Φ1 is a joint feature map of the nodes xε and assignments yε
, and θnodes iodes denotes the node weightings 60 and is suitably represented as a parameter vector parameterizing Φ1(xk,(i,j);yk,(i,j)) which represents the class probabilities generated by the classifier.
In one illustrative case, it is desired to ensure that the class label given to xk(i,j) was given a high probability according to the first-order classifier. Specifically, suppose that the classifier returns a vector of probabilities Px; then the joint feature map may suitably be written as:
In words, if x is assigned the class y, then the probability given by the first-order model should be high (weighted by the node weights θynodes 60). To simplify, E(x;y)=Px,yθy. The formulation of Equation (2) expresses this as a linear function of the node weights θnodes 60. If P represents a probability, then the product of probabilities should be maximized, or equivalently, the sum of log-probabilities should be maximized. More generally, if P does not represent a probability, then either a sum or product may be maximized.
This unary potential is readily extended to parameterization by a plurality of classifiers. In this case the multiple classifiers return probabilities Px1, . . . , PxC (for instance, features based on histograms of orientations; features based on RGB statistics, or so forth). In this case, the joint feature map can be written as a concatenation of the individual feature maps defined by each classifier (and thus, it will be nonzero in exactly C locations). In the illustrated embodiment, the unary potential is parameterized by the set of classifiers 44, 46, 48 trained at different size scales. However, it is contemplated to employ parameterization by more than or fewer than the illustrated three classifiers 44, 46, 48 trained at different size scales. Still further, it is contemplated to employ parameterization by a plurality of classifiers that are different in ways other than the size scale used for training, such as for example differing by the employed classification algorithm.
Another generalization of the unary potential model is to learn a separate parameterization for each level of the hierarchy of classes 30. In this case, a copy of Φ1(x,y) is processed for each level of the hierarchy, and an indicator function is used to “select” the current level.
The pairwise potentials for the edge connecting image region xk (with the remainder of the index suppressed) at level k with image region xk+1 at level k+1 can be written as:
E(xk,xk+1;yk,yk+1)=Φ2(xk,xk+1;yk,yk+1),θedges
. (3)
Here the joint feature map Φ2 expresses two properties: first, the constraints of the hierarchy of classes 30 should be strictly enforced; secondly, nodes assigned to the same class on different levels should have similar probabilities (again using the probabilities P(xk) and P(xk+1) returned by the classifier). To achieve these goals, the indicator function H is suitably defined as:
This indicator function enforces the hierarchical constraints of the hierarchy of classes 30. There is no cost associated to assigning a child node to a more specific class—thus we are only parameterizing the cost when both class labels are the same. The joint feature map Φ2 incorporates the indicator function H of Equation (4) as follows:
Φ2(xkxk+1;yk,yk+1)=−H(yk,yk+1)|Px
where 51 p| is the elementwise absolute value of p.
Like the unary potentials, the pairwise (edge) potentials are readily extended to parameterization by a plurality of classifiers by writing the joint feature map as a concatenation of the individual feature maps. The generalization of learning a separate parameterization for each level of the hierarchy of classes 30 can also be made for the pairwise potentials, for example by processing a copy of Φ2(x,y) for each level of the hierarchy, and using an indicator function to “select” the current level.
Having defined the unary and pairwise potentials, a suitable optimization function to be maximized (in this embodiment; other formulations may employ an optimization function to be minimized) may be written as follows:
where θ is a concatenation of the node and edge weight feature vectors (θnodes; θedges) and y(x) is the assignment given to x under (that is, the full set of labels). As the nodes in the image partition model
form a tree structure, the energy expressed in Equation (6) can be maximized using max-sum belief propagation, for example as set forth in Aji et al., “The generalized distributive law”, IEEE Trans. on Information Theory vol. 46 no. 2, pages 325-43 (2000) which is incorporated herein by reference in its entirety. The running time for this maximization algorithm is of order O(|
∥
|2), where |
| is the number of nodes and |
| is the number of classes.
The loss function specifies “how bad” a given assignment is compared to the correct assignment
. For computational purposes during training, it can be advantageous for the loss function to decompose as a sum over nodes and edges in
. To perform the training, the labels
are provided using existing datasets. One option is to assign the class “multiple” to all image regions with which multiple bounding boxes intersect, to assign “background” to all image regions with which no bounding boxes intersect, and to assign a specific class label to all other image regions. In a suitable approach, a loss function is defined which is of the form:
One suitable loss function having this form is the Hamming loss function, which takes the value 0 when the region is correctly assigned, and 1/|| otherwise, where |
| is the number of regions (or more formally,
where when multiple classes are observed in a single region such that the correct label is “multiple” then no penalty is incurred if one of these specific classes is chosen). The loss may be scaled so that each level of the graphical model makes an equal contribution, that is, so that a mistake on level k makes four times the contribution as a mistake on level k+1.
Structured learning can be used, for example in a framework described in Tsochantaridis et al., “Support vector machine learning for interdependent and structured output spaces”, in Predicting Structured Data, pages 823-30 (2004), which is incorporated herein by reference in its entirety. Given a training set 1 . . .
N, the goal is to solve:
The structured learning algorithm described in Tsochantaridis et al. may be used to solve Equation (8) for the node and edge weights θ=(θnodes;θedges) that minimize the sum of the empirical risk and regularization terms. In addition to solving Equation (8), it is also desired to solve:
that is, for a given value of θ, an assignment can be found which is consistent with the tree structure (Equation (6)), yet incurs a low loss. This procedure is known as column-generation, and can easily be solved in this scenario, as long as Δ() decomposes as a sum over the nodes and edges in the tree model (which is true of the Hamming loss function).
With reference to
The classifiers 44, 46, 48 as trained by the corresponding classifier trainers 34, 36, 38 (see
The output of the objective function optimization performed by the class labeler 72 is the full set of labels for the image regions including the full image (which corresponds to the image “region” at level 0) that maximizes the objective function of
includes labels for each image region of every level of the tree structure, optimized using the optimization function of Equation (6) to simultaneously satisfy the node potentials that are indicative of how well the assigned labels match the individual image regions considered separately) and the pairwise (edge) potentials that ensure compliance with the inheritance constraints defined by a hierarchy of classes for nodes connected by edges of the tree structure. The assigned set of labels
are generally probabilistic in nature, and optionally a post-processor 76 employs thresholding, selection of the “most probable” label for each image region, a user review of the labeling performed using a graphical user interface (GUI) implemented on the computer 10, or other post-processing to assign the final labels to the image regions.
The disclosed image region classifier techniques have been applied to images from the Visual Object Classes (VOC) challenge VOC 2007 and VOC 2008 datasets. In these tests, images from the VOC 2008 dataset were used for training and validation, and images from the VOC 2007 dataset for testing. This presents a learning scenario in which there is a realistic difference between the training and test sets. The training, validation, and test sets contain 2113, 2227, and 4952 images respectively. The validation set is used to choose the optimal value of the regularization constant λ. Scale-invariant feature transform (SIFT)-like features were extracted for the model on uniform grids at 5 different scales. See Lowe, “Distinctive image features from scale-invariant keypoints”, IJCV, 60(2):91-110 (2004). The methodology of Csurka and Perronnin, “a simple high performance approach to semantic segmentation”, in British Machine Vision Conference (BMVC) (2008) (hereinafter “the BMVC reference”). based on Fisher vectors was used to extract a signature for each region. An image key patch is considered to belong to an image region of the tree structure if (1) the center of the image key patch lies in the image region and (2) overlap of the image key patch with the image region is at least 25%. Three different first-order classifiers were used as the classifiers 44, 46, 48, based on sparse logistic regression: one classifier which has been trained to classify the entire collection of features in an image (the ‘image-level’ classifier 44); one classifier which has been trained on bounding boxes (the ‘mid-level’ classifier 46); and one classifier which has been trained on individual patches (the patch-level classifier 48). The baseline to which the classifier of
As another comparison, the aforementioned “maximum probability selector” approach of the BMVC reference is modified by using the image prior as also defined in the BMVC reference. The effect of this modification is to reject labelings at the patch level which are inconsistent with the probability scores at the image level.
As yet another comparison, a “non-learning” approach was employed, which is implemented as the classifier of
Classification scores for the classes “background” and “multiple” were extracted automatically from the first-order scores: the probability of belonging to the background is 1 minus the highest probability of belonging to any other class; the probability of belonging to multiple classes is the twice the product of the two highest probabilities, capped at 1 (so that if two classes have probabilities greater than 0.5, the product will be greater than 0.5 also). Structured learning was performed using the “Bundle Methods for Risk Minimization” code as described in Teo et al., “A scalable modular convex solver for regularized risk minimization”, in KDD 2007 (Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining). This solver requires only that we specify the feature representation Φ(), the loss Δ(
), and a column-generation procedure.
A performance comparison between the learning and non-learning versions of our approach, as well as the baseline is shown in Table 1, which tabulates performance of the learned classifier of
Table 2 shows the contribution to the loss made by each level of the tree structure. It is seen that the performance for the learned classifier of
With reference to |=22), with the class labels shown along the abscissa of
| classes was learned for each image level 0, 1, 2, 3, and for each type of classifier 44, 46, 48 being used (the dashed line corresponds to zero). The learned weights shown in
It will be appreciated that various of the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Also that various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.
Number | Name | Date | Kind |
---|---|---|---|
20080075367 | Winn et al. | Mar 2008 | A1 |
20100128984 | Lempitsky et al. | May 2010 | A1 |
Entry |
---|
Kolmogorov et al., “What Energy Functions Can Be Minimized via Graph Cuts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, No. 2, Feb. 2004. |
Ramanan et al. “Tracking People by Learning their Appearance” IEEE Pattern Analysis and Machine Intelligence (PAMI). Jan. 2007. |
Szeliski et al., “A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, No. 6, Jun. 2008. |
Lazebnik et al., “Beyond Bags of Features: Spatial Pyramid Matching for recognizing Natural Scene Categories,” Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2, pp. 2169-2178 (2006). |
Scharstein et al., “Learning Conditional Random Fields for Stereo,” IEEE Conference on computer Vision and Pattern Recognition, MN, pp. 1-8, Jun. 2007. |
Teo et al., “A Scalable Modular Convex Solver for Regularized Risk Minimization,” Proceedings of the 13th ACM SOGKDD Intl. Conference on Knowledge Discovery and Data Mining, CA, pp. 727-736 (2007). |
Lempitsky et al., “LogCut—Efficient Graph cut Optimization for Markov Random Fields,” IEEE 11th Intl. Conference on Computer Vision, Rio de Janeiro, pp. 1-8, Oct. 2007. |
Shi et al., “Normalized cuts and Image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, No. 8, Aug. 2000. |
Stenger et al., “Model-Based Hand Tracking Using a Hierarchical Bayesian Filter,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, No. 9, Sep. 2006. |
Li et al., Multiresolution image classification by hierarchical modeling with two-dimensional hidden Markov model IEEE Trans. Inform. Theory 46 1826-1841 (2000). |
Shotton et al., “TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context,” ISSN vol. 81, No. 1, Jan. 2009. |
Tsochantaridis et al., “Support Vector Machine Learning for Interdependent and structured Output Spaces,” Proceedings of the 21st Intl. Conference on Machine Learning, Canada (2004). |
Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision (2004). |
Sivic et al., “Unsupervised Discovery of Visual Object Class Hierarchies,” IEEE Conference on Computer Vision and Pattern Recognition, AK, pp. 1-8, Jun. 2008. |
Yedidia et al., “Generalized Belief Propagation”, Advances in Neural Information Processing Systems (NIPS), vol. 13, pp. 689-695, Dec. 2000. |
Kohli et al., “On Partial Optimality in Multi-label MRFs,” Proceedings of the 25th Internatioanl Conference on Machine Learning, Helsinki, Finland (2008). |
Ishikawa, “Exact optimization for Markov random fields with convex priors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, No. 10., Oct. 2003. |
Boykov et al., “An Experimental Comparison of Min-cut/Max-Flow Algorithms for Energy Minimization in Vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, No. 9, Sep. 2004. |
Cour et al., “Learning Spectral Graph Segmentation,” IEEE International Conference on Artificial Intelligence and Statistics (AISTAT) 2005. |
Csurka et al., “A Simple High Performance Approach to Semantic Segmentation,” British Machine Vision Conference, Leeds, UK, Sep. 1-4, 2008. |
Everingham et al., “The PASCAL Visual Object Classes Challenge 2007 (VOC2007), Part 1—Challenge & Classification Task,” In 1st PASCAL Challenge Workshop, at http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html; (2007). |
Everingham et al., “The PASCAL Visual Object Classes Challenge 2007 (VOC2007), Part 2—Detection Task,” In 1st PASCAL Challenge Workshop, at http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html; (2007). |
Everingham et al., “Visual Object Classes Challenge 2008 (VOC2008), VOC Results,” at http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/, (2008). |
Felzenszwalb et al., “A Discriminatively Trained, Multiscale Deformable Part Model,” Computer Vision and Pattern Recognition 2008, CVPR 2008, Conference on Publication, Jun. 2008. |
Finley et al., “Training Structural SVMs when Exact Inference is Intractable,” Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finaland (2008). |
He et al., “Multiscale conditional Random Fields for Image Labeling,” Proceedings of the 2004 IEEE Computer Scoiety Conference on Computer Vision and Pattern Recognition (CVPR '04) (2004). |
Grauman et al., “the pyramid match kernel: discriminative classification with sets of image features,” Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 2, Oct. 2005. |
Aji et al., “The Generalized Distributive Law and Free Energy Minimization,” In Proceedings of the Allerton Conference on Communication, Control, and Computing (pp. 672-681). Urbana, IL: University of Illinois, 2001. |
Boykov et al., “Fast Approximate Energy Minimization via Graph Cuts,” IEEE Transactions on PAMI, vol. 23, No. 11, pp. 1222-1239 (2001). |
Number | Date | Country | |
---|---|---|---|
20110052063 A1 | Mar 2011 | US |