With the proliferation of digital photography, automatic image categorization is becoming increasingly important. Such categorization can be defined as the automatic classification of images into predefined semantic concepts or categories.
Before a learning machine can perform classification, it needs to be trained first, and training samples need to be accurately labeled. The labeling process can be both time consuming and error-prone. Fortunately, multiple instance learning (MIL) allows for coarse labeling at the image level, instead of fine labeling at the pixel/region level, which significantly improves the efficiency of image categorization.
In the MIL framework, there are two levels of training inputs: bags and instances. A bag is composed of multiple instances. A bag (e.g., an image) is labeled positive if at least one of its instances (e.g., a region in the image) falls within the concept being sought, and it is labeled negative if all of its instances are negative. The efficiency of MIL lies in the fact that during training, a label is required only for a bag, not the instances in the bag. In the case of image categorization, a labeled image (e.g., a “beach” scene) is a bag, and the different regions inside the image are the instances. Some of the regions are background and may not relate to “beach”, but other regions, e.g., sand and sea, do relate to “beach”. On close examination, one can see that although sand and/or sea do not appear independently in statistics, they tend to appear simultaneously in an image of a “beach” frequently. Such a co-existence or concurrency can significantly boost the belief that an instance (e.g., the sand, the sea etc.) belongs to a “beach” scene. Therefore, in this “beach” scene, there exists an order-2 concurrent relationship between the sea instance (region) and the sand instance (region). Similarly, in this “beach” scene, there also exist higher-order (order-4) concurrent relationships between instances, e.g., sand, sea, people, and sky.
Existing MIL-based image categorization procedures assume that the instances in a bag are independent and have not explored such concurrent relationships between instances. Although this independence assumption significantly simplifies modeling and computations, it does not take into account the hidden information encoded in the semantic linkage among instances, as described in the above “beach” example.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The concurrent multiple instance learning technique described herein learns image categories or labels. Unlike existing MIL algorithms, in which the individual instances in a bag are assumed to be independent of each other, the technique models the inter-dependency between instances in an image. The concurrent multiple instance learning technique encodes the inter-dependency between instances (e.g. regions in an image) in order to predict a label for a future instance, and, if desired the label for an image determined from the label of these instances. More specifically, in one embodiment, concurrent tensors are used to explicitly model the inter-dependency between instances to better capture an image's inherent semantics. In one embodiment, Rank-1 tensor factorization is applied to obtain the label of each instance. Furthermore, in one embodiment, Reproducing Kernel Hilbert Space (RKHS) is employed to extend instance label prediction to the whole feature space in order to determine the label of an image. Additionally, in one embodiment, a regularizer is introduced, which avoids overfitting and significantly improves a learning machine's generalization capability, similar to that in SVMs.
In the following description of embodiments of the disclosure, reference is made to the accompanying drawings which form a part hereof, and in which are shown, by way of illustration, specific embodiments in which the technique may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the disclosure.
The specific features, aspects, and advantages of the disclosure will become better understood with regard to the following description, appended claims, and accompanying drawings where:
In the following description of the concurrent multiple instance learning technique, reference is made to the accompanying drawings, which form a part thereof, and which is shown by way of illustration examples by which the concurrent multiple instance learning technique described herein may be practiced. It is to be understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the claimed subject matter.
The following section provides an overview of the concurrent multiple instance learning technique, a brief description of MIL in general, an exemplary architecture wherein the technique can be practiced, exemplary processes employing the technique and details of various implementations of the technique.
1.1 Overview of the Technique
The concurrent multiple instance learning technique encodes the inter-dependency between instances (e.g. regions in an image) in order to predict a label for a future instance, and, if desired, the label for an image determined from the labels of these instances. The concurrent multiple instance learning technique has at least three major contributions to image and region labeling. First, the technique, in one embodiment, uses a concurrent tensor to model the semantic linkage between instances in a set of images. Based on the concurrent tensor, rank-1 supersymmetric non-negative tensor factorization (SNTF) can be applied to estimate the probability of each instance being relevant to a target category. Second, in one embodiment, the technique formulates label prediction processes in a regularization framework, which avoids overfitting, and significantly improves a learning machine's generalization capability, similar to that in Support Vector Machines (SVMs). Third, the technique, in one embodiment, uses Reproducing Kernel Hilbert Space (RKHS) to extend predicted labels to the whole feature space based on a generalized representer theorem. The technique achieves high classification accuracy on both bags (images) and instances (regions of images), is robust to different data sets, and is computationally efficient.
The concurrent multiple instance learning technique can be used in any type of video or image categorization, such as, for example, would be used in automatically assigning metadata to images. The labels can be used for indexing images for the purposes of image and video management (e.g., grouping). It can also be used to associate advertisements with a user's search strings in order to display relevant advertisements to a person searching for information on a computer network. Many other applications are also possible.
1.2 Multiple Instance Learning Background
This section provides some background information on generic multiple instance learning useful to understanding the concurrent multiple instance learning technique described herein.
1.2.1 Bag Level Multiple Instance Level Classification
Existing MIL based image categorization approaches can be divided into two categories according to their classification levels, bag level or instance level. The bag level research line aims at predicting the bag label and hence does not try to gain insight into instance labels. For example, in some techniques, a standard support vector machine (SVM) can be used to predict a bag label with so-called multiple instance (MI) kernels which are designed for bags. Other bag level techniques have adapted boosting to multiple instance learning and Ensemble-EMDD, which is a multiple instance learning algorithm.
1.2.1 Instance Level Multiple Instance Level Classification
Other research (instance level) first attempts to infer a hidden instance label and then predicts a bag label. For example, the Diverse Density (DD) approach employs a scaling and gradient search algorithm to find prototype points in instance space with a maximal DD value. This DD-based algorithm is computationally expensive and overfitting may occur for the lack of a regularization term in the DD measure. Other instant level techniques adopt MIL into a boosting framework, where a noisy-or is used to combine instance labels into bag labels. Yet other techniques extend the DD framework, seeking P(yi=1|Bi={Bi1,Bi2, . . . ,Bin}), the conditional probability of the label of the ith bag being positive, given the instances in the bag. They use a Logistic Regression (LR) algorithm to estimate the equivalent probability for an instance, P(yij=1|Bij), and then use a combination function (called softmax) to combine P(yij=1|Bij) in a bag to estimate P(yi=1|Bi):
where Sij=P(yij=1|Bij). The combining function encodes the multiple instance assumption in this MIL algorithm.
1.3 Exemplary Environment for Employing the Concurrent Multiple Instance Learning Technique.
1.4 Exemplary Architecture Employing the Concurrent Multiple Instance Learning Technique.
One exemplary architecture that includes a concurrent multiple instance learning module 200 (residing on a computing device 600 such as discussed later with respect to
1.5 Exemplary Processes Employing the Concurrent Multiple Instance Learning Technique.
An exemplary process employing the concurrent multiple instance learning technique is shown in
Another exemplary process employing the concurrent multiple instance learning technique is shown in
It should be noted that many alternative embodiments to the discussed embodiments are possible, and that steps and elements discussed herein may be changed, added, or eliminated, depending on the particular embodiment. These alternative embodiments include alternative steps and alternative elements that may be used, and structural changes that may be made, without departing from the scope of the disclosure.
1.6 Exemplary Embodiments and Details.
Various alternate embodiments of the concurrent multiple instance learning technique can be implemented. The following paragraphs provide details and alternate embodiments of the exemplary architecture and processes presented above. In this section, the details of possible embodiments of the concurrent multiple instance learning technique will be discussed and details of the technique's ability to infer the underlying instance labels will be provided.
1.6.1 Notation
In order to understand the following detailed description of various embodiments of the technique (such as those shown, for example, in
Let Bi denote the ith bag, Bi+ a positive bag and Bi− a negative one. One can denote bag set as B={Bi}, positive bag set as B−={Bi+} and negative bag set as ={Bi−}. Let I denote the set of instances and nI=| the number of all instances. An instance Ij ∈ 1≦j≦n is denoted as Ij+ when it is positive and is denoted as Ij− when negative. Ij can also be denoted as Bij to emphasize Ij ∈Bi and as Bij+ if it is in a positive bag. Here, the subscript j is a global index for instances and does not relate to a specific bag. Let p(Ij) denote the probability of Ij being a positive instance. The symbol p(Ij) is equivalent to P(yij=1|Bij) in equation (1).
1.6.2 Concurrent Hypergraph Representation
In some embodiments, the concurrent multiple instance learning technique employs hypergraphs in order to determine image region categories.
Based on the concurrent hypergraph G 500, a tensor and its corresponding algebra can naturally be used as a mathematical tool to represent and learn the concurrent relationship between instances. The tensor entries are associated with the hyperedges in G 500. As will detailed in following sections, with the tensor representation, rank-one super-symmetric non-negative tensor factorization (SNTF) can then be applied to obtain p(yi,j=1|Bij), i.e., the probability of an instance Bij being positive. Once the instance label is obtained, the image (e.g., bag) label can be directly computed (for example, by using the combination function shown in Eq. (1)).
1.6.3 Concurrent Relations in MIL
As illustrated in
The term p(Ii
p(Ii
Typically, the logic operation “̂” in equation (2) can be estimated by “min”, so one has
p(Ii
Adopting a noisy-or model, the probability that not all points missed the target concept is
p(Ii
and likewise
p(Ii
Concatenating equation (2), (3), (4) and (5) together, one has
The causal probability of an individual instance on a potential target p(Ii
Consequently, p(Ii
1.6.4 Representation of Concurrent Relations as Tensors
There has been considerable interest in learning with higher order relations in many different applications, such as model selection problems, and multi-way clustering. Hypergraphs and their tensors are natural ways to represent concurrent relationships between instances (e.g. the concurrent relationships shown in
As shown in
An n-order tensor τ of dimension [d1]×[d2]× . . . [dn], indexed by n indices i1, i2, . . . , in with 1≦ij≦dj, is of rank-1 if it can be expressed by the generalized outer product of n vectors: τ=vi v2 . . . vn, where vi ∈ . A tensor τ is called super-symmetric when its entries are invariant under any permutation of their indices. For such a supersymmetric tensor, its factorization has a symmetric form: τ=vn=vi v2 . . . vn. A direct gradient descent based approach was adopted in the present technique to factor tensors, as will be discussed in greater detail below.
Once the concurrent relations are represented in an n-order tensor form (e.g., as shown in
Since the bag label and the concurrent relation information have been incorporated into T, this concurrent tensor is a supervised measure instead of an unsupervised affinity measure.
Given the concurrent tensor T, the technique seeks to estimate p(Ij), i.e., the probability of instance Ij being a positive instance. The desired probabilities form a nonnegative 1×nj of vector P=[p(I1), p(I2), . . . p(In
It is an over-determined problem to solve no unknown variables p(Ij),1≦j≦nI, and it is computationally expensive to find an optimal solution to the probability vector P if it is exhaustively searched for in the nI dimension space Rn
Alternatively, in one embodiment, the technique relaxes the non-differentiable operation “min” to a differentiable function, and then a gradient search algorithm is adopted to efficiently search for the optimal solution to P. The logic “̂” can also been estimated by a kind of T-norm function. More specifically, the multiplication operation has been proven to be such an operator, and the “min” operator is an upper bound of the “multiplication” operator:
p(Ii
Therefore an alternative solution is to use “multiplication” to estimate the logic “̂”:
T
i
,i
, . . . ,i
=p(Ii
In this form, the set of nIn equations can be represented in a compact tensor form:
The above equation can be translated to the fact that T is a rank-1 super-symmetric tensor, and P can be calculated given the concurrent tensor T. Equation (12) is an over-determined multi-linear system with nin equations like (11). This problem can be solved by searching for an optimal solution P to approximate the tensor T in light of least-squared criterion, and the obtained P can best reflect the semantic linkage among instances represented by T.
In order to find the best solution to P, one considers the following least-squared problem:
where ∥·∥F2 the squared Frobenious norm defined as ∥K∥F2=K,K=Σi
The most direct approach is to form a gradient descent scheme. To that end, the gradient function with respect to P is derived first. Following that the differential commutes with inner-product operation ·,·, i.e., dK,K=2K,dK and the identity d(Pn)=(dP)P(n−1)+ . . . +P(n−1)(dP), one has
Then the partial derivative with respect to pj (the jth entry of P) is:
where ej is the standard vector (0, 0, . . . , 1, 0, . . . , 0) with 1 in the jth coordinate, and S represents an n-tuple index, s/ir denotes {i1, . . . , ir−1, ir+1, . . . , in}, Si
With this gradient, a direct gradient descent scheme can be applied to form an iterative algorithm of search for the best solution P. However, this solution to P is limited to the available set of instances and does not naturally extend to the case where novel examples need to classified. In the following section, an approach to extend the solution P to the whole feature space in a natural way, i.e. find an optimal function p(x) defined on the whole feature space to give the probability of an instance of being positive, is given. In the following section, an optimization-based approach to find the optimal solution to p(x) in Reproducing Kernel Hilbert Space (RKHS) is employed.
1.6.5 A Kernelization Framework
The description in this section relates to boxes 214 and 216 of
To begin, the objective cost function in problem (13) is rewritten. Given function p(x), the probability vector P in (13) can be given as P=[p(I1), p(I2), . . . p(In
Therefore, the cost function in (13) can be rewritten as
Note that different from (13), C(p(x), {Ii}i=1n
where λ is a parameter that trades off the two components.
Since the above objective function F(p(x), {Ii}i=1n
where k(·,·) is a Mercer Kernel associated with RKHS
Let K=[k(Ii, Ij)]n
(Gaussian Kernel) over instance features and coefficient vector, a=[a1 a2 . . . an
and substitute (18) into (17), the following optimization problem is obtained:
To solve it, the gradient of F(a) is derived with respect to a:
Where ∇PC is the gradient of cost function C(p(x), {Ii}i=1n
With this obtained gradient, a L-BFGS quasi-Newton method can used to solve this optimization problem. This method is a standard optimization algorithm which can be used to solve the optimal p(x) in equation (17). It searches for the whole space allowed by the constraints of equation (17) in the gradient direction of equation (20). By building up an approximation scheme through successive evaluation of the gradient in equation (20), L-BFGS can avoid the explicit estimation of a Hessian matrix. It has been proven L-BFGS has a fast convergence rate to learn the parameters a than traditional scaling learning algorithms. It should be noted, however, that other methods can be used to solve this optimization problem also.
The concurrent multiple instance learning technique is designed to operate in a computing environment. The following description is intended to provide a brief, general description of a suitable computing environment in which the concurrent multiple instance learning technique can be implemented. The technique is operational with numerous general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable include, but are not limited to, personal computers, server computers, hand-held or laptop devices (for example, media players, notebook computers, cellular phones, personal data assistants, voice recorders), multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
Device 600 may also contain communications connection(s) 612 that allow the device to communicate with other devices. Communications connection(s) 612 is an example of communication media. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal, thereby changing the configuration or state of the receiving device of the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Device 600 may have various input device(s) 614 such as a display, a keyboard, mouse, pen, camera, touch input device, and so on. Output device(s) 616 such as speakers, a printer, and so on may also be included. All of these devices are well known in the art and need not be discussed at length here.
The concurrent multiple instance learning technique may be described in the general context of computer-executable instructions, such as program modules, being executed by a computing device. Generally, program modules include routines, programs, objects, components, data structures, and so on, that perform particular tasks or implement particular abstract data types. The concurrent multiple instance learning technique may be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should also be noted that any or all of the aforementioned alternate embodiments described herein may be used in any combination desired to form additional hybrid embodiments. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. The specific features and acts described above are disclosed as example forms of implementing the claims.