Image retrieval contains three important procedures, i.e., feature extraction, off-line indexing, and online retrieval. Among the three procedures, off-line indexing organizes the relevant images together to eliminate redundancy and makes them easy to access during online retrieval. Therefore, indexing strategy largely influences the retrieval accuracy, time and memory costs. Nowadays, tons of works have been published focusing on extracting better image features and designing more accurate online retrieval algorithms, but the effort on better indexing strategy is relatively limited.
Other works use inverted index to index the image ID in the database. The indexing is done in a per image manner. They do not extensively explore the correlation in database images, either. In these methods, local invariant image features are extracted to capture local low-level content which are robust to local transformations. An image typically generates about 1000 feature points. Database images are indexed using these local features.
Despite the great success of these approaches in local descriptor based image retrieval, most of existing works follow the one-layer “descriptor to image” indexing structure. Although being very effective, it has several obvious drawbacks. Firstly, image databases usually store multiple copies of similar objects or scenes, especially those having millions of images. A group of local descriptors may also appear frequently in multiple images. Although frequently appeared descriptors are down-weighted using inverted document frequency, the “descriptor to image” indexing does not have a strategy to eliminate such redundance across images to save memory. In other words, current indexing scheme causes potentially higher memory cost than necessary. Secondly, recent advances in large-scale image classification and saliency analysis may help with conducting robust similarity analysis among images. Because current indexing is performed for each image individually, it is not straightforward to embed complex database image relations into current framework for online retrieval.
Most of current image indexing systems for image retrieval view database as a set of individual images. It limits the flexibility of the retrieval framework to conduct sophisticated cross image analysis, resulting in higher memory consumption and sub-optimal retrieval accuracy.
In one aspect, systems and methods are disclosed to respond to a query for one or more images by using a processor, applying an indexing strategy which processes images as grouplets rather than individual single images; generating a two layer indexing structure with a group layer, each associated with one or more images in an image layer; cross-indexing the images into two or more groups; and retrieving near duplicate images with the cross-indexed images and the grouplets.
In another aspect, the system contains two procedures: 1) grouplet generation and 2) grouplet based indexing and retrieval. Because images within each grouplet are indexed and retrieved as one unit, they are required to be highly relevant with each other to ensure the retrieval precision. To discover such grouplets in a large-scale image database, we build sparse graphs where the vertexes are images and the links denote the mutual k-Nearest Neighbor (kNN) relationships computed in different ways. Then in such graphs, we seek the maximal cliques as grouplets. Each maximal clique is a subgraph where any two vertexes are linked, thus the images in it would be highly relevant with each other. After generating different types of grouplets, we follow the classic BoWs (Bag-of-visualWords) indexing procedure to index them, i.e., extracting local descriptors, computing TF (Term Frequency) vectors with pooling strategy, and building inverted file indexes. During online retrieval stage, we also follow the BoWs retrieval procedure and only extract local descriptors from the query to retrieve relevant grouplets, then unpack them and rank the individual images.
Advantages of the system may include one or more of the following. Our method treats the database images as joint sets of groups. Each group consists of a set of images which has high correlation base on either local similarity or global semantic similarities. In constant to most previous works which index each individual image, we apply the indexing for each group. Because groups are constructed in a way that images in a specific group are similar in some respect, local descriptors in a group are highly redundant. Redundant descriptors only need to be indexed once which significantly reduces the memory usage. In the process of group construction, both global high level features as well as local features are taken into account to support robust indexing.
Our approach shows better precision, efficiency and memory cost, i.e., about 130% and 50% memory cost of baseline BoWs model if three types and one type of grouplets are considered, respectively. Therefore, we conclude that our approach is superior to the existing indexing approaches in the aspects of precision, efficiency, and memory cost. The system seamlessly integrates various content analysis techniques. Our approach is largely different from many recent retrieval approaches working on feature extraction and online retrieval. These approaches can be integrated with our indexing strategy to for better performance. Our online retrieval only extracts local descriptors, but is able to consider and integrate multiple image similarities. This is superior to many retrieval fusion approaches, which introduce extra computations and memory costs by fusing different features or multiple retrieval results during online retrieval.
Turning now to the figures,
Cross indexing consists of two main steps: 1) grouplet generation and 2) grouplet indexing. We formulate grouplet generation as seeking all maximal cliques in a sparse graph, where vertexes are images and links denote the mutual k-Nearest Neighbor (kNN) relations computed with customized similarity measurements. As shown in
Our online retrieval follows the BoWs (Bag-of-visual Words) retrieval procedure and first extracts local descriptors from the query to retrieve relevant grouplets. Images are then retrieved from grouplets using the grouplet-image correspondences obtained during grouplet construction. Although only local descriptors are used for online query, images sharing similar local descriptors, similar object regions, and similar semantics could be retrieved, because the intermediate grouplet layer models sophisticated image relations. We test our approach on several image retrieval benchmark datasets. Compared with recent image retrieval algorithms, our approach shows lower memory cost, higher efficiency, and competitive accuracy. Retrieval on large-scale datasets further manifests such advantages.
We observe that in image database, many images share strong relevance with each other. Rather than indexing these images individually, we propose to package these images into one basic unit for indexing and retrieval. We call such basic units containing highly relevant images as grouplets.
In contrast to traditional methods which use image id to index the local feature descriptor, we use group id to index a local feature descriptor. The group layer index enables fast group search using inverted index. Similar to previous work, we use a vocabulary tree structure to perform the first layer descriptor indexing task. The second image layer index allows retrieving images from searched groups. The image layer index is naturally obtained in the group constructing process.
FIG. 3's module 102 shows an exemplary group construction using three different image similarity measurements: Local feature similarity, semantic similarity and sub-region similarity. Local feature similarity models local content similarity between images. Semantic similarity measures the semantic meaning similarity between two images. As illustrated in
During the query, we first extract local features from the query image. Through each descriptor, we retrieve corresponding groups via descriptor-group indexing. Then we find the images through the image-group indexing. The retrieve score of an image in the database is aggregated if the image is retrieved by multiple descriptors. We allow each image to be in multiple groups which we call as cross connected image to group mapping. If an image belongs to at most one group, then all images in a group have exactly the same retrieval score. It cannot differentiate images inside the same group. Our framework enables multiple groups to vote scores for one image. So the retrieval score for two images could be different even they are in the same group.
In one embodiment, we have an image dataset D={d1, d2, . . . , dM}. In cross indexing, we represent the database as a collection of grouplets, i.e., G={G1, G2, . . . , GN} generated on D. We define a grouplet Ga as a collection of images, i.e.,
G
a
:{d
i}iεG
where |·| is cardinality of G, i.e. the number of images in a grouplet. Because indexing more grouplets results in larger memory cost, we require each grouplet is not the subset of any other to control the number of grouplets. We denote the collection of grouplets containing image di as Gi, GiεG. Because di could belong to multiple grouplets, |Gi|≧1.
Based on such grouplet representation in cross indexing, during online retrieval the similarity between query q and database image di could be formulated as,
where we use the similarities between grouplets and query, i.e., sim(·) to vote the similarity between query and database images. Therefore, Eq. (2) differs from the TF-IDF (Term Frequency-Inverse Document Frequency) similarity in inverted file indexing, which directly computes the similarity between query and database images.
According to Eq. (2), images in the same grouplet would present more consistent similarity with query than the ones in different grouplets. Therefore the quality of generated grouplet would largely affect the similarity computation in image retrieval after cross indexing. To make the image retrieval valid, grouplets should embed discriminative relations among images to ensure closely related images share more consistent similarities with the query. This guides the formulation of our grouplet generation, i.e.,
where D denotes the given distance matrix, which can be computed by customized measurements like semantic meaning, or local visual similarity,
where DIS(·) denotes the distance between two collections of grouplets. By replacing DIS(·) and D with the followings:
SIM(Gi,Gj)=1−DIS(Gi,Gj)=Gi∩Gj/Gi∪Gj (4)
S(i,j)=1−D(i,j),S(i,j)=S(j,i),S(i,j)ε{0,1}(5)
we could reasonably simply Eq. (4) as,
where SIM (·) denotes the similarity between two collections of grouplets, and matrix S can be seen as an undirected graph representing the customized relations among database images. Grouplet generation is hence equivalent to dividing this graph into subgraphs that satisfy: 1) images in the same subgraph should be highly relevant to each other and irrelevant images should appear in different subgraphs; 2) the number of subgraphs should be small to save memory.
According to the graph theory, a clique in an undirected graph is defined as a subset of vertexes, in which every two vertices are connected. A maximal cliques is a clique that cannot be extended by including one more adjacent vertex. Hence, optimizing Eq. (6) is equivalent to finding all maximal cliques in an undirected graph, i.e., images within a maximal clique are connected with each another, and minimum number of cliques can be generated. Therefore, grouplet generation could be reasonably solved by seeking all maximal cliques in a graph defined by S.
In one embodiment, a mutual kNN graph is used to reveal the relevance relations among images, and then seek all maximal cliques in it as grouplets. Suppose di, dj are mutual kNNs of each other, they should satisfy
d
iεkNN(dj);djεkNN(di), (7)
where kNN(·) denotes the k-Nearest Neighbors of an image.
The edges represent the mutual-kNN relationships. It is obvious that mutual kNN reveals reliable relevances among images. Based on the mutual kNN relations, we could build a sparse graph H=(V,S), where V is the vertex set, i.e., the database images, and S stores the edges among vertices, i.e., if di and dj are mutual kNNs of each other, then S(i,j)=1.
Finding all maximal cliques in a graph is a NP-complete problem. Despite of this hardness, plenty of efficient algorithms have been studied. In, Makino et al. propose the output-sensitive algorithm based on fast matrix multiplication, which finds about 100,000 maximal cliques per second from a sparse graph. In this paper, we employ the method of to find maximal cliques. By constructing sparse graphs with properly selected parameter k, the maximal cliques can be efficiently identified.
It can be observed that images sharing strong relevance could be identified as one grouplet. The isolated image not similar to others compose a grouplet containing itself. This necessarily ensures the high relevance among images in each grouplet. As aforementioned, the parameter k decides the sparseness of matrix S, hence it largely decides the number and quality of generated grouplets. In cross indexing, the intermediate grouplet layer allows seamless integration of different image content analysis techniques through customizing the mutual kNN relations. We use three complementary clues to generate the final grouplet collection, i.e., G={G(1), G(r), G(g)}
G(1) denotes grouplets generated with local descriptors. We could employ vocabulary tree to compute BoWs models, build inverted indexes for database images, and finally compute the TF-IDF similarities to build the mutual kNN graph. Recent works on local descriptor based image search [?, ?] and image relation computation [?, ?, ?, ?] can also be used to improve the quality of G(1). Because local descriptor and the vocabulary tree are mainly used in partial-duplicate image search, G(1) effectively organizes the partial-duplicate images together into the same grouplet.
G(r) denotes grouplets generated with regional features. We first densely generate the initial regions on an image through over segmentation. After rejecting the regions with too large or too small sizes, we compute a matrix storing the overlap rates among the remained regions. Affinity Prorogation is hence applied on this matrix to cluster these regions. We finally keep at most 5 clusters and select the largest region in each of them to represent this image. Suppose the region collections of two images di and dj are {rm}mεi and {rn}nεj, respectively, we define the regional image similarity as:
where |·| is the cardinality of a set, i.e., the number of regions in di or dj, respectively, s(·) returns the similarity between feature vectors of two image regions. We hence could build a graph using the defined regional similarity Sr(·). Because regions tend to capture the object-level clues and may eliminate the negative effects of background clutters, G(r) is expected to organize images with similar objects into the same grouplet.
G(g) denotes grouplets generated with global similarity. We simply use the similarity computed with global features to construct the mutual kNN graph for G(g) generation. G(g) hence tends to organize images with similar global appearances into the same grouplet.
In cross indexing, we mix different types of grouplets together and then proceed to index them with the two-layer indexing structure. Because relevant images many be similar in multiple aspects, e.g., local and global appearances, there may exist redundant grouplets. To remove such redundancy and save memory cost, we define the similarity of two grouplets as
S
G(Ga,Gb)=|Ga∩Gb|/|Ga∪Gb|, (9)
where |·| is cardinality of G, i.e. the number of images in a grouplet. With Eq. (9), we discard the smaller grouplet if the similarity of two grouplets is larger than α. In this paper, we experimentally set α=0.8.
After removing the redundant grouplets, we follow the inverted file indexing paradigm to construct the grouplet index. We first extract and encode local descriptors into visual words with a vocabulary tree containing millions of visual words, then compute TF (Term Frequency) vectors of grouplets. For grouplets containing only one image, we directly compute the L-1 normalized visual word histogram as the TF vector. For grouplets containing multiple images, we first compute the TF vector of each image and then employ the max pooling strategy, which is well suited to the sparse TF vector as discussed in. For a grouplet G:{di}iεG, the TF value of visual word v in G is computed as:
where TFi denotes the L-1 normalized TF vector of database image di.
Based on the TF vectors of all grouplets, we index them in the grouplet index, where each cell in the index records the TF value of a visual word and the ID of a grouplet. We further build the second layer index to record the grouplet-image relations. The gouplet-image relation is acquired during the grouplet generation process.
Because of the two-layer indexing structure, our online retrieval procedure consists of two steps. The first step is almost identical to the BoWs based image retrieval in, i.e., extracting and quantizing SIFT descriptors into visual words and computing the TF-IDF similarity, i.e.,
sim(di,Ga)=ΣvIDF(v)×min(TFd
This process returns grouplets sharing similar local descriptors with the query.
According to the grouplet-image relation recorded in image index, we then unpack these grouplets into a list of single images. As illustrated in Eq. (2), the similarity between query q and database image di is computed by voting the similarities of q and grouplets containing di.
Because we generate grouplets with different aspects of similarities, images consistent with query in multiple aspects would be returned first. This is superior to most of existing local descriptor based image retrieval systems, which mainly focus on retrieving partial-duplicate images of the query. In addition, our retrieval strategy only extracts local descriptors for query, hence is also superior to most of retrieval fusion strategies, which either need to fuse multiple features or multiple results during online retrieval.
One embodiment uses Cross Indexing with Grouplets to view the database images as a set of grouplets, each of which is defined as a group of highly relevant images. The number of grouplets is smaller than the number of images, thus naturally leading to less memory cost. Moreover, the definition of a grouplet could be based on customized relations, allowing for seamless integration of advanced data mining techniques in off-line indexing. The cross indexing with grouplets views the database images as a set of grouplets and builds a two-layer indexing structure to achieve efficient image retrieval. We define each grouplet as a set of highly relevant images to eliminate the redundancy. Moreover, the definition of a grouplet could be based on customized relations, allowing for seamless integration of advanced data mining techniques in off-line indexing. Our framework is instantiated with three different types of grouplets by seeking the maximal cliques in mutual kNN graphs defined by local similarities, regional relations, and global visual features, respectively. To validate the system, we construct three different types of grouplets, which are respectively based on local similarities, regional relations, and global visual modeling. Extensive experiments on public benchmark datasets demonstrate the efficiency and superior performance of our approach.
The system may be implemented in hardware, firmware or software, or a combination of the three.
Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is discussed next. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself
This application claims priority to Provisional Application Ser. No. 61/948,903 filed Mar. 6, 2014 and 62030677 filed Jul. 30, 2014, the content of which is incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61948903 | Mar 2014 | US | |
62030677 | Jul 2014 | US |