1. Technical Field
This description generally relates to visual analysis of images.
2. Background
In the field of image analysis, images are often analyzed based on visual features. The features include shapes, colors, and textures. The features in the image can be detected and the content of the image can be guessed from the detected features.
In one embodiment a method for creating a visual vocabulary comprises extracting a plurality of descriptors from one or more labeled images; clustering the descriptors into augmented-space clusters in an augmented space, wherein the augmented space includes visual similarities and label similarities; generating a descriptor-space cluster in a descriptor space based on the augmented-space clusters, wherein one or more augmented-space clusters are associated with the descriptor-space cluster; and generating augmented-space classifiers for the augmented-space clusters that are associated with the descriptor-space cluster based on the augmented-space clusters.
In one embodiment a device for generating a visual vocabulary comprises one or more computer-readable media configured to store labeled images, and one or more processors that are coupled to the one or more computer-readable media and that are configured to cause the device to extract descriptors from one or more labeled images, wherein the labels include semantic information, and wherein extracted descriptors include visual information; augment the descriptors with semantic information from the labels; generate clusters of descriptors in an augmented space based on the semantic information and the visual information of the descriptors; generate a respective augmented-space classifiers for each one of the clusters of descriptors in the augmented space; generate clusters of descriptors in a descriptor space based on the clusters of descriptors in the augmented space, wherein two or more clusters of descriptors in the augmented space are associated with a corresponding cluster of descriptors in the descriptor space; and associate the two or more augmented-space classifiers for the two or more clusters of descriptors in the augmented space with the corresponding cluster of the clusters of descriptors in the descriptor space.
In one embodiment a method for encoding a descriptor comprises obtaining a descriptor; mapping the descriptor to a descriptor-space cluster in a descriptor space; applying a plurality of augmented-space classifiers that are associated with the descriptor-space cluster to the descriptor to generate respective augmented-space-classification scores; and generating a descriptor representation that includes the augmented-space-classification scores.
The following disclosure describes certain explanatory embodiments. Other embodiments may include alternatives, equivalents, and modifications. Additionally, the explanatory embodiments may include several novel features, and a particular feature may not be essential to some embodiments of the devices, systems, and methods described herein.
Descriptors are extracted from one or more labeled images 111 by a descriptor-extraction module 100. The descriptors are initially defined in a descriptor space 101. The descriptor space 101 is a vector space that is defined by the basis vectors of the native attributes of the descriptors.
Modules (e.g., the descriptor-extraction module 100, an augmentation module 110, a classifier-training module 120) include logic, computer-readable data, or computer-executable instructions, and may be implemented in software (e.g., Assembly, C, C++, C#, Java, BASIC, Perl, Visual Basic), hardware (e.g., customized circuitry), or a combination of software and hardware. In some embodiments, the system includes additional or fewer modules, the modules are combined into fewer modules, or the modules are divided into more modules. Though the computing device or computing devices that execute a module perform the operations, for purposes of description a module may be described as performing one or more operations.
The descriptors and the labels 112 from the images 111 are obtained by an augmentation module 110, which maps the descriptors from the descriptor space 101 to an augmented space 102 (e.g., a topological space) based on the semantic information in the labels 112. For example, a label 112 of an image 111 (or the label 112 of a region of an image 111) may be associated with all of the descriptors that were extracted from the image 111. Therefore, if a first image is associated with the label “dog,” all of the descriptors that were extracted from the first image may be associated with the label “dog.” Additionally, each descriptor may be associated with one or more labels. Also, a distance between the labels is defined, for example according to an ontology. For example, “cat” and “dog” may be closer than “cat” and “truck.”
Sometimes semantic labels are assigned to one or more specific regions of an image (e.g., a label is assigned to some regions and not to other regions). Thus, an image may include labels that are generally assigned to the whole image and labels that are assigned to one or more regions in the image. The regions may be disjoint or overlapping. In some embodiments, if a descriptor extracted from an image is in one or more labeled regions of the image, then the descriptor is associated with the one or more labels that are assigned to the one or more regions. Else, if the descriptor is not from a labeled region but the image has one or more generally assigned labels, then the descriptor is associated with the one or more generally assigned labels. In some embodiments, the descriptor is associated with all of the labels, if any, of the regions that include the descriptor and with all of the generally assigned labels, if any, of the image from which the descriptor was extracted. Other embodiments may use other techniques to associate labels with descriptors.
Moreover, although the augmented space 102 illustrated by
The augmentation module 110 then clusters the descriptors in the augmented space 102 to form A-space clusters 117. The descriptors may be clustered by using, for example, k-means clustering, or an expectation-maximization algorithm. Also, D-space clusters 118 (which include D-space clusters 118A-B in this example) are generated based on the A-space clusters 117, for example by agglomerating the A-space clusters 117 that overlap when projected into the descriptor space 101.
A classifier-training module 120 then trains a respective A-space classifier (e.g., A-space classifiers 1-5) for each of the A-space clusters 117. In some embodiments, a classifier is a binary classifier. The classifier-training module 120 may train an A-space classifier with a one-against-all scheme by using the descriptors contained in the corresponding A-space cluster 117 as a positive sample set and the descriptors in the other A-space clusters 117 as a negative sample set. Accordingly, the discriminant information contained in the descriptors of an A-space cluster 117 is encoded into the corresponding classifier. This may prevent the loss of any significant semantic information. Also, in some embodiments a respective D-space classifier is trained for each of the D-space clusters 118.
A classifier-organization module 130 associates each D-space cluster 118 with the A-space classifiers of the component A-space clusters 117 of the D-space cluster 118. Assuming that there are K D-space clusters 118, then the k-th D-space cluster 118 has Mk classifiers associated with it. If Mk=1, then there is only one classifier associated with the D-space cluster 118. This one classifier may be a null classifier, and the output of the classifier may be 1. If Mk>1, then there are Mk classifiers, ym, m=1, . . . Mk, associated with the k-th D-space cluster 118. The Mk classifiers of the k-th D-space cluster 118 are the classifiers of the A-space clusters that compose the k-th D-space cluster 118. Thus, in
Therefore, in
The flow starts in block 200, where descriptors are extracted from one or more images. Next, in block 210, the descriptors are mapped to augmented space. The flow then proceeds to block 220, where augmented-space clusters are generated.
Next, in block 230, the augmented-space clusters are mapped to the descriptor space. For example, the augmented-space clusters may be projected into the descriptor space. Then in block 240, descriptor-space clusters are generated based on the augmented-space clusters, for example by agglomerating the augmented-space clusters' projections in the descriptor space through an agglomerative-type clustering of the clusters or by a divisive clustering method.
Following, in block 250, a respective classifier is trained for each augmented-space cluster. Any applicable classifier-learning method may be used to train the classifiers. For example, let x be a descriptor representation in a descriptor space. In some example embodiments, the binary classifier is a linear SVM classifier, where
and where w and b denote the normal vector to the optimal separating hyperplane and bias found by SVM, respectively.
Some embodiments use an AdaBoost-like method:
where xt is an element of x, and
where vt, ut, and θt are parameters of a stump classifier generated by AdaBoost learning.
Finally, in block 260, the descriptor-space clusters are associated with the applicable augmented-space classifiers (e.g., the augmented-space classifiers of the component augmented-space clusters of a corresponding descriptor-space cluster). For a D-space cluster containing only one (Mk=1) A-space cluster, a null classifier may be associated with the D-space cluster. The null classifier outputs 1 if the D-space cluster is activated and outputs 0 otherwise. In some embodiments, the activation of a cluster occurs when the D-space cluster is selected as the nearest D-space cluster to an input descriptor based on a standard k-means nearest-centroid assignment process. The final visual vocabulary (also referred to herein as “FVV”) includes the classifiers associated with the D-space clusters.
The flow then moves to block 330, where descriptor-space clusters are generated based on the mapped augmented-space clusters. Next, in block 335, it is determined if a classifier has been generated for each descriptor-space cluster. If not (block 335=no), then the flow proceeds to block 340, where a classifier is generated for the next descriptor-space cluster, and then the flow returns to block 335. If yes (block 335=yes), then the flow proceeds to block 345, where the augmented-space classifiers of the augmented-space clusters that compose a descriptor-space cluster are associated with the descriptor-space cluster or its classifier.
The flow starts in block 400, where a D-space cluster and its corresponding Mk A-space clusters are obtained. Next, in block 405, a counter i is set to 0. The flow then moves to block 410, where it is determined if all M A-space clusters have been considered (i=M). If not (block 410=no), the flow then proceeds to block 415, where the i-th A-space cluster is set as a positive sample set. Next, in block 420, the A-space clusters, other than the i-th A-space cluster, that are associated with the D-space cluster are set as a negative sample set. Following, in block 425, samples from other D-space clusters 491 are added as an external negative sample set. The flow then moves to block 430, where a classifier for the i-th A-space cluster is trained using the selected positive and negative samples.
Next, in block 435, the count i is incremented, and then the flow returns to block 410. If in block 410 it is determined that all M A-space clusters have been considered (i=M), the flow then proceeds to block 440, where the M A-space classifiers are output.
Following, in block 520, the descriptor x is scored using the A-space classifiers that are associated with the activated D-space cluster, for example the Mk A-space classifiers, ym, m=1, . . . Mk, that are associated with the k-th D-space cluster. The output is the classification result of the Mk classifiers, [y1(x), y2(x), . . . , yM
Finally, in block 530, the A-space-classifier scores are aggregated. So the encoding V of the descriptor x is given by
V=[0, . . . ,0,y1(x),y2(x), . . . ,yM
In some embodiments, the encoding operations activate the J D-space clusters (J≦K) nearest to the input descriptor x. The output of each activated D-space cluster is then generated. The output of the j-th D-Cluster is an intermediate encoding Vj, which may be calculated according to
V
j=[0, . . . ,0,yj,1(x),yj,2(x), . . . ,yj,M
The outputs of all the activated D-space clusters may be aggregated to get the final encoding V of the descriptor x, where
V=Σ
j=1
J
p
j
·V
j, (6)
and where pj is a weight that indicates the significance of the corresponding D-space cluster.
Some embodiments determine the weights based on the respective distances between the input descriptor x and the D-space clusters, for example according to
where σ is a constant, dj is a distance between the descriptor x and the j-th D-space cluster, and Z is a normalization parameter to make Σj=1Jpj=1. As stated previously, in some embodiments the distance is a Euclidean distance, dj=∥x−cj∥, where cj denotes the center (or centroid) of the j-th D-space cluster.
Additionally, in some embodiments, the encoding further describes attribute features. C={zk}k=1C represents the semantic-label sets used to create a semantic subspace in an augmented space. In augmented space, each generated A-space cluster may contain one or more semantic labels. A C-dimensional label histogram B can be generated from an A-space cluster, for example according to
B=[b
1
,b
2
, . . . ,b
C], (8)
where bi is a count of samples with the label zi in the A-space cluster. For example, such a label histogram may be built for each A-space cluster during vocabulary learning. Then each histogram is associated with a classifier learned from its corresponding A-space cluster. As a result, a classifier outputs not only a classification decision, but also a histogram of labels, which can be considered to be a set of semantic attributes associated with the decision.
Given an input descriptor x, its semantic attributes may be extracted by using the learned attribute histograms during an encoding phase. Some embodiments generate a C-dimensional attribute-feature vector according to
where Bm is the attribute histogram associated with the m-th classifier of the activated D-space cluster, and where Z is a normalization constant (e.g., for an L1 normalization).
Some embodiments activate the J nearest D-space clusters. These embodiments can generate a C-dimensional attribute-feature vector through a weighted linear combination of J individual attribute-feature vectors, for example according to
A=Σ
j=1
J
p
j
·A
j, (10)
where Aj is the attribute-feature vector generated from the j-th D-space cluster (e.g., according to equation (9)), and where pj is the weights (e.g., according to equation (7)).
Finally, attribute-feature vectors generated according to equation (9) or equation (10) can be combined with a bag-of-visual feature vector generated according to equation (4) or equation (6), respectively, via a concatenation or a weighted concatenation, for example. The combined feature representation may provide enhanced discriminative power and may be used for general image recognition and retrieval.
A descriptor-encoding module 650 then obtains the descriptor 613 and the A-space classifier(s) that are associated with the activated D-space cluster 618, and, based on them, generates a descriptor encoding 616, for example according to equation (4) or equation (9).
A descriptor-encoding module 750 then obtains the descriptor 713 and the A-space classifier(s) that are associated with the two activated D-space clusters 718, and, based on them, generates a descriptor encoding 716, for example according to equation (6) or equation (10).
The storage/memory 913 includes one or more computer-readable or computer-writable media, for example a computer-readable storage medium or a transitory computer-readable medium. A computer-readable storage medium is a tangible article of manufacture, for example a magnetic disk (e.g., a floppy disk, a hard disk), an optical disc (e.g., a CD, a DVD, a Blu-ray), a magneto-optical disk, magnetic tape, and semiconductor memory (e.g., a non-volatile memory card, flash memory, a solid-state drive, SRAM, DRAM, EPROM, EEPROM). A transitory computer-readable medium, for example a transitory propagating signal (e.g., a carrier wave), carries computer-readable information. The storage/memory 913 is configured to store computer-readable data or computer-executable instructions. The components of the vocabulary-generation device 910 communicate via a bus.
The vocabulary-generation device 910 also includes a descriptor-extraction module 914, an augmentation module 915, a classifier-training module 916, a classifier-organization module 917, and an encoding module 918. In some embodiments, the vocabulary-generation device 910 includes additional or fewer modules, the modules are combined into fewer modules, or the modules are divided into more modules. The descriptor-extraction module 914 includes instructions that, when executed by the vocabulary-generation device 910, cause the vocabulary-generation device 910 to obtain one or more images (e.g., from the image-storage device 920) and extract one or more descriptors from the images. The augmentation module 915 includes instructions that, when executed by the vocabulary generation device 910, cause the vocabulary-generation device 910 to map descriptors to an augmented space, generate descriptor clusters in the augmented space, or generate descriptor clusters in the descriptor space. The classifier-training module 916 includes instructions that, when executed by the vocabulary-generation device 910, cause the vocabulary-generation device 910 to train augmented-space classifiers for the augmented-space clusters or train descriptor-space classifiers for the descriptor-space clusters. The classifier-organization module 917 includes instructions that, when executed by the vocabulary-generation device 910, cause the vocabulary-generation device 910 to associate augmented-space classifiers with respective ones of the descriptor-space clusters. The encoding module 918 includes instructions that, when executed by the vocabulary-generation device 910, cause the vocabulary-generation device 910 to map descriptors to one or more descriptor-space clusters and encode descriptors with scores generated by the augmented-space classifiers that are associated with the activated one or more descriptor-space clusters.
The image-storage device 920 includes a CPU 922, storage/memory 923, I/O interfaces 924, and image storage 921. The image storage 921 includes one or more computer-readable media that are configured to store images. The image-storage device 920 and the vocabulary-generation device 910 communicate via a network 990.
The above-described devices, systems, and methods can be implemented by providing one or more computer-readable media that contain computer-executable instructions for realizing the above-described operations to one or more computing devices that are configured to read and execute the computer-executable instructions. Thus, the systems or devices perform the operations of the above-described embodiments when executing the computer-executable instructions. Also, an operating system on the one or more systems or devices may implement at least some of the operations of the above-described embodiments. Thus, the computer-executable instructions or the one or more computer-readable media that contain the computer-executable instructions constitute an embodiment.
Any applicable computer-readable medium (e.g., a magnetic disk (including a floppy disk, a hard disk), an optical disc (including a CD, a DVD, a Blu-ray disc), a magneto-optical disk, a magnetic tape, and semiconductor memory (including flash memory, DRAM, SRAM, a solid state drive, EPROM, EEPROM)) can be employed as a computer-readable medium for the computer-executable instructions. The computer-executable instructions may be stored on a computer-readable storage medium that is provided on a function-extension board inserted into a device or on a function-extension unit connected to the device, and a CPU provided on the function-extension board or unit may implement at least some of the operations of the above-described embodiments.
The scope of the claims is not limited to the above-described embodiments and includes various modifications and equivalent arrangements. Also, as used herein, the conjunction “or” generally refers to an inclusive “or,” though “or” may refer to an exclusive “or” if expressly indicated or if the context indicates that the “or” must be an exclusive “or.”