One way of comparing a first image item to a second image item is to compute local features associated with each image item, and then compare the features of the first image item with the features of the second image item. If the first image item includes features that are close to the second image's features, then the first image item likely visually resembles the second image item.
The above approach can be used to retrieve information from a database of image items. In this application, a retrieval system extracts the features of a query image item and then finds the image items in the database that have a similar feature set. One problem with this approach is that it is it consumes a significant amount of time to generate and compare a large quantity of image features. This approach also requires a considerable amount of memory to store the computed features.
One way of addressing the above technical issues is to cluster groups of related features of a source dataset into respective “words,” to thereby form a vocabulary. Comparison of a query image item with the source dataset can then be performed on a word-level, rather than a more elementary feature-level. Nevertheless, prior approaches have not adequately explored the vocabulary-generating operation in suitable detail, resulting in potential inefficiencies and limitations in such approaches. For example, prior approaches generate a new vocabulary for each dataset to be searched, and there is a mindset that the vocabulary should be as big as possible.
Functionality is described for generating a vocabulary from a source dataset of image items or other non-textual items. The vocabulary (and an associated index) serves as a tool for retrieving items from a target dataset in response to queries. The vocabulary can be used to retrieve items from a variety of different target datasets. For instance, the vocabulary can be used to retrieve items from a target dataset that has a different size than the source dataset. The vocabulary can also be used to retrieve items from a target dataset that has a different type than the source dataset. The vocabulary is referred to a multi-use vocabulary in the sense it can be used in conjunction with other datasets besides the source dataset from which it originated.
In one illustrative case, a multi-use vocabulary is produced from a source dataset having at least an approximate minimum size. In addition, or alternatively, the multi-use vocabulary includes at least an approximate minimum number of words.
Additional exemplary implementations and features are described in the following.
The same numbers are used throughout the disclosure and figures to reference like components and features. Series 100 numbers refer to features originally found in
This disclosure sets forth an approach for generating and using a multi-use vocabulary based on non-textual data, such as, but not limited to, image data.
As a preliminary note, any of the functions described with reference to the figures can be implemented using software, firmware, hardware (e.g., fixed logic circuitry), manual processing, or a combination of these implementations. The term “logic, “module,” “component,” “system” or “functionality” as used herein generally represents software, firmware, hardware, or a combination of the elements. For instance, in the case of a software implementation, the term “logic,” “module,” “component,” “system,” or “functionality” represents program code that performs specified tasks when executed on a processing device or devices (e.g., CPU or CPUs). The program code can be stored in one or more computer readable memory devices.
More generally, the illustrated separation of logic, modules, components, systems, and functionality into distinct units may reflect an actual physical grouping and allocation of software, firmware, and/or hardware, or can correspond to a conceptual allocation of different tasks performed by a single software program, firmware program, and/or hardware unit. The illustrated logic, modules, components, systems, and functionality can be located at a single site (e.g., as implemented by a processing device), or can be distributed over plural locations.
The terms “machine-readable media” or the like refers to any kind of medium for retaining information in any form, including various kinds of storage devices (magnetic, optical, static, etc.). The term machine-readable media also encompasses transitory forms for representing information, including various hardwired and/or wireless links for transmitting the information from one point to another.
Aspects of the functionality are described in flowchart form. In this manner of explanation, certain operations are described as constituting distinct blocks performed in a certain order. Such implementations are illustrative and non-limiting. Certain blocks described herein can be grouped together and performed in a single operation, and certain blocks can be performed in an order that differs from the order employed in the examples set forth in this disclosure. The blocks shown in the flowcharts can be implemented by software, firmware, hardware, manual processing, any combination of these implementations, and so on.
This disclosure includes the following sections. Section A describes an illustrative system for generating and using a multi-use vocabulary. Section B describes illustrative procedures that explain the operation of the system of Section A. Section C describes data processing functionality that can be used to implement any aspect of the system of Section A.
A. Illustrative System
The system includes two principal components: a vocabulary providing module 102; and a vocabulary application module 104. The purpose of the vocabulary providing module 102 is to generate a vocabulary 106 based on image data obtained from a source dataset 108 of image items. The purpose of vocabulary application module 106 is to apply the vocabulary 106 for a prescribed end use. According to one end use, a user may input a query image item to the vocabulary application module 104. In response, the vocabulary application module 104 can use the vocabulary 106 to determine whether there are any image items in a target dataset 110 that match the query image item.
Before exploring each piece of the system 100 in detail, note that
In another case, one of the target datasets 112 may include a larger collection of image items than is provided in the source dataset 108, stated in other terms, the source dataset 108 may represent a subset of a more encompassing body of image data expressed in a target dataset. For example, one of the target datasets can include a large collection of image items taken of a particular general subject, such as houses within a particular district of a city, whereas the source dataset 108 can comprise a fraction of this large collection of image items. In another case, one of the target datasets 112 may include a smaller collection of image items than is provided in the source dataset 108; stated in other terms, this target dataset may represent a subset of a more encompassing collection of image data expressed in the source dataset 108.
In another case, one of the target datasets 112 may include a collection of image items of a first type and the source dataset 108 can include a collection of image items of a second type, where the first type differs from the second type. For example, one of the target datasets 112 can represent image items taken of houses in a particular city, while the source dataset 108 can represent image items taken of artwork in a museum. These two datasets have different types because the general themes and environments of their corresponding datasets differ. In another case, one of the target datasets 112 can have the same size and type as the source dataset 108, but the target dataset includes a different portion of data than the source dataset 108. For example, the target dataset can represent a first half of a collection of pictures taken of houses in a city, while the source dataset 108 can represent the second half of this collection. Still other kinds of target datasets can make use of the common vocabulary 106. In general, the source dataset 108 and the target datasets 112 can originate from any source (or sources) 114 of data items.
Because the single vocabulary 106 can be used in conjunction with multiple target datasets 112, it is referred to as a multi-use vocabulary. In other words, the vocabulary 106 can be viewed as a universal dataset because it is not restricted for use with the source dataset 108, but can be used in many other types of target datasets 112. To summarize the above explanation, the target datasets 112 can differ from the source dataset 108 in one or more respects. For instance, the target datasets 112 can have different sizes than the source dataset 108, different types than the source data set 108, different selections of same-type data than the source dataset 108, and so on.
With this overview, it is now possible to explore the composition of the vocabulary providing module 102 in greater detail. The vocabulary providing module 102 includes a vocabulary characteristic determination module 116. The purpose of the vocabulary characteristic determination module 116 is to determine one or more characteristics of the vocabulary 106 which allow it to function in the multi-use or universal role described above. For instance, the vocabulary characteristic determination module 116 can determine a minimum approximate size of the source dataset 108 that should be used to provide a vocabulary 106 that can be used for multiple different target datasets 112. In addition, or alternatively, the vocabulary characteristic determination module 116 can determine a minimum approximate number of words that the vocabulary 106 should contain to be used for multiple different target datasets 112.
In one case, the vocabulary characteristic determination module 116 operates in a partially automated manner. For example, the vocabulary characteristic determination module 116 can generate various graphs and charts for a human user's consideration. The human user can then analyze this information to determine the nature of the vocabulary 106 that should be generated to ensure multi-use application. In another case, the vocabulary characteristic determination module 116 can operate in a more fully automated manner by automatically determining the characteristics of the vocabulary 106 that should be generated.
The vocabulary providing module 102 also includes a vocabulary generating module 118. The purpose of the vocabulary providing module 102 is to generate the vocabulary 106 from the source dataset 108. The vocabulary generating module 118 generates the vocabulary 106 based on the considerations identified by the vocabulary characteristic determination module 116.
The vocabulary generating module 118 can also provide an index 120. The index can describe the correspondence between words in the vocabulary 106 and words in individual images in a dataset (such as the source dataset 108 and/or one or more of the target datasets 112). The index 120 can be formed using an inverted file approach.
Now turning to the vocabulary application module 104, this module 104 accepts a query image item from a user and determines whether this query image item matches one or more image items in a target dataset or target datasets. It performs this task by determining features in the query image item and then determining words associated with those features. It then uses these words, in conjunction with the vocabulary 106 and the index 120, to determine whether any image items in a target dataset include the same or similar image content.
The system 100 can be physically implemented in various ways to suit different technical environments. In one case, the vocabulary providing module 102 and the vocabulary application module 104 can be implemented by a single processing device. For example, the vocabulary providing module 102 and the vocabulary application module 104 can represent two programs or discrete logic components implemented by a single computer device. In another case, the vocabulary providing module 102 and the vocabulary application module 104 can be implemented by two respective data processing devices, such as two respective computer devices. In this case, the first data processing device can provide the vocabulary 106 for use by the second data processing device. In any case, the vocabulary providing module 102 can operate in an offline manner (e.g., as a set-up or initialization task, not driven by user queries), while the vocabulary application module 104 can operation in an online manner (e.g., driven by user queries).
In one case, a user can interact with the vocabulary application module 104 in a local mode of operation. In this case, the user may directly interact with a local data processing device which provides the vocabulary application module 104. In another case, a user can interact with the vocabulary application module 104 in a network-accessible mode of operation. In this case, the user may use a local data processing device (not shown) to interact with a network-accessible vocabulary application module 104 via a network 122. The network 122 can represent a local area network, a wide area network (such as the Internet), or any other type of network or combination thereof.
B. Illustrative Procedures
B.1. Generation of a Vocabulary
In operation 202, the vocabulary generation module 118 extracts features from the image items in the source dataset 108. Each local feature is represented by a high-dimensional feature vector which describes a local region of a feature point. Different techniques can be used to determine and represent features. In one illustrative and non-limiting approach, operation 202 involves using a Scale Invariant Feature Transform (SIFT) technique in conjunction with a Difference of Gaussian (DoG) detector to extract and represent features. For background information regarding these known techniques, note, for instance: K. Mikolajczyk, et al., “Local Features for Object Class Recognition,” Proceedings of the 10th IEEE International Conference on Computer Vision, ICCV, 2005, pp. 1792-1799; and David G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, Vol. 60, No. 2, 2004, pp. 91-110.
In operation 204, the vocabulary generation module 118 determines whether a vocabulary already exists. If not, in operation 206, the vocabulary generation module 118 generates the vocabulary 106 for the source dataset 108. The vocabulary generation module 118 forms the vocabulary 106 by grouping common features into respective units called words. In other words, the vocabulary generation module 118 operates by partitioning a feature space created in operation 202 into words.
Different approaches exist for partitioning the feature space. One approach is clustering. In particular, hierarchical clustering can be performed to reduce computation cost. In this approach, operation 206 involves splitting the feature space into to a small number of subsets by clustering, and then splitting the subsets into smaller sets respectively. This process is repeated until one or more conditions are satisfied. Since the vocabulary generated in this way follows a tree structure, the vocabulary represents a vocabulary tree.
Generally, there are two types of condition for use in terminating a clustering operation: a tree depth condition; and a leaf size condition. The term tree depth refers to a number of levels of the vocabulary tree. The term leaf size refers to a feature number in a leaf node of the vocabulary tree. A vocabulary tree built by satisfying a depth condition is referred to herein as a “D-tree,” while a vocabulary tree build by satisfying a leaf size condition is referred to as an “L-tree.” For example, a “D-8” tree refers to a tree in which splitting terminates when the tree reaches the eighth level. An “L-100” tree refers to a tree in which splitting terminates when the feature number of a cluster is less than 100. These two methodologies reflect different conceptualizations of feature space partitioning. D-tree clustering generates words of similar feature space size, but may produce results having a different number of features. L-tree clustering generates words that cover a similar number of features, but may produce results having different feature space sizes.
In one illustrative and non-limiting implementation, operation 206 can use a Growing Cell Structures (GCS) algorithm to split features into five subsets. Background information on the GCS technique is provided in B. Fritzke, “Growing Cell Structures—A Self-Organizing Network in k Dimensions,” Artificial Neural Networks II, I. Aleksander & J. Taylor, eds., North-Holland, Amsterdam, 1992, pp. 1051-1056.
In operation 206, the vocabulary generation module 118 also creates the index 120 for the vocabulary 106. The index 120 provides a document list for each word. The list identifies scenes which contain the features that belong to the word. The index 120 thus forms an inverted file for the words in the vocabulary 106. If an image vocabulary already exists (as determined in operation 204), then operation 208 involves inserting the features of the dataset 108 into the existing vocabulary tree to form a new inverted file. Operation 208 can also involve forming aggregative results, such as an indication of the frequency of each word within a scene, the frequency of each word within a dataset, and so forth.
B.2. Application of the Vocabulary
There are several uses of the vocabulary formed in the procedure 200 of
In operation 402, the vocabulary application module 104 receives a query in the form of a query input image. A user enters this query input image with the goal of finding one or more image items in the target dataset which are the same as or which closely resemble the query input image. In a more particular case, the goal may be to find one or more image items which include an object which closely resembles an object of interest in the query input image.
In operation 404, the vocabulary application module 104 extracts features from the query image item in the same manner described above with respect to items in the source dataset 108.
In operation 406, the vocabulary application module 104 determines whether any words in the vocabulary 106 correspond to the features extracted from the query image item.
In operation 408, the vocabulary application module 104 identifies items in the target dataset which are associated with any matching words identified in operation 406.
In operation 410, the vocabulary application module 104 ranks the items identified in operation 408 in order of relevance. Operation 410 then involves outputting the ranked list of relevant items to the user for his or her consideration.
Different techniques can be used to assess relevance. According to one technique, given an image vocabulary, a query image q or a database document (scene) d can be represented as an N dimensional vector of words which correspond to the local features extracted from them. Each word has a weight associated with it. N is the number of words in the vocabulary (which is the same as the dimension of the query or document vector). The relevance between q and d can be calculated as the cosine of the angle between the two word vectors. That is:
where wdi is the weight of the ith word in document d, wqi is the weight for the word in query q. The denominator in this equation represents the norm of the document or query vector.
The weight of each word may take two factors into consideration: term frequency (TF) and inverse document frequency (IDF). Term frequency refers to the normalized frequency of a word in a document. In the present case, large term frequency means that the word appears multiple times in the same scene, which indicates that the feature is more robust. Therefore, such features can be given higher weight. TF may be calculated as:
where ni is the number of occurrences of term ti in document d, and Nd is the number of words in document d.
The motivation for using inverse document frequency is that terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one. In the present case, very common terms may correspond to noisy features. IDF can be calculated as:
where |D| is the total number of documents in the database, and |{d|tiε}| is the number of documents in which ti appears. In text retrieval, if a word appears in too many documents, that is, if IDF is small, the word will be ignored in word matching since it contributes little while introducing too much noise. Such words are called “stop words.” By deleting stop words from the index, both memory cost and retrieval time can be reduced. In image retrieval, analogously, a leaf on the vocabulary tree can be defined as a stop word if it appears in too many scenes.
Finally the weight for word ti in document d is defined as the multiplication of TF and IDF:
wdi=TF(ti,d)IDF(ti)
The weight of the query is calculated using the same function, considering the query as a document.
B. 3. Selecting Characteristics of a Desirable Vocabulary
In operation 702, the vocabulary characteristic determination module 116 determines retrieval performance. Retrieval performance can be measured in various ways. In one type of technique, a Success Rate at N (SR@N) measure can be used to represent the success of a retrieval operation. Namely, SR@N represents the probability of finding a correct answer within N top results. Given n queries, SR@N is defined as
where pos(aq) is the position of the correct answer aq for the qth query, θ( ) is a Heaviside function defined by θ(x)=1, if x≧0, and θ(x)=0 otherwise. Generally, SR@N increases rapidly when N is small (e.g., smaller than five), and then increases at a slower rate for larger values of N. In the following, the success rate for N=1 is used to measure performance; this metric is generally representative of the performance for other values of N.
To provide more insightful results, the retrieval performance can also be measured for multiple different clustering approaches. For example, retrieval performance can be determined for a D-tree clustering approach and an L-tree clustering approach. Retrieval performance can be assessed for yet additional types of clustering approaches. Moreover, results of several tests can be averaged together to provide more reliable results.
Operation 702 indicates that retrieval performance can be assessed against various considerations. One such consideration is the size of the vocabulary that is generated. Another consideration is the size of the source dataset used to generate the vocabulary. Another consideration is the type of the source dataset 108 in relation to the type of the target dataset.
Operation 702 can assess the retrieval performance as a function of each of the above considerations, that is, by isolating each consideration in turn. A human user or an automated analysis routine (or a semi-automated analysis routine) can then consider all of the results together to determine what factors play a role in producing a multi-use vocabulary. Namely, one goal is to determine the characteristics of a vocabulary that can be used in conjunction with multiple target datasets. Another goal is to ensure that the vocabulary is not unnecessarily large or complex. An unnecessarily large or complex vocabulary may be costly to generate and maintain, even though it may also qualify as a multi-use vocabulary. In view of these factors, the general goal is to provide an optimal vocabulary or an approximately optimal vocabulary.
Consider first the consideration of vocabulary size.
As indicated in
Note that on a scale of 1,000,000, the D-9 tree and the L-2000 tree yield similar performance, but the vocabulary size of the D-9 tree is 4 million while vocabulary size of the L-2000 tree is 0.4 million. This verifies the earlier conclusion that approximately half a million words is a suitable vocabulary size. The reason for the increase of SR@1 on 1,000,000 images might be that there are too many features and some of the features introduce noise, causing the vocabulary quality to get worse.
There are two factors that affect the accuracy: vocabulary size and vocabulary quality. Vocabulary quality refers to the extent that a vocabulary effectively reflects the distribution of a feature space. As discussed above in connection with
Consider the case in which the image items of type A are more general (producing more variation) than the image items of type B. Comparing
Returning finally to
The approximate values of 100,000 and 500,000 are representative of one particular scenario associated with one particular environment. Other vocabulary characteristics may be appropriate for different respective scenarios and environments. In general, for instance, a source dataset size can be selected to correspond to an approximate transition point at which further increases in size do not yield significant increases in performance, relative to increases in size prior to the transition point. A vocabulary size can be selected to correspond to an approximate transition point at which further increases in word number do not yield significant increases in performance, relative to increases in number prior to the transition point. These transitions points generally correspond to the leveling-off (or elbow) points in the performance vs. size graphs described herein.
Although not described herein, it is found that term-weighting considerations (such as the use of TF, IDF, and stop word considerations) may improve retrieval performance in some scenarios, but these improvements are not great. Thus, these term-weighting considerations can optionally be omitted in certain cases. Omitting these considerations will reduce the complexity of the calculations and will reduce the time-related and memory-related costs associated therewith.
C. Illustrative Data Processing Functionality
The processing functionality 1802 can include a processing module 1804 for implementing various processing functions. The processing module 1804 can include volatile and non-volatile memory, such as RAM 1806 and ROM 1808, as well as one or more processors 1810. The processing functionality 1802 can perform various operations identified above when the processor(s) 1810 executes instructions that are maintained by memory (e.g., 1806, 1808, or elsewhere). The processing functionality 1802 also optionally includes various media devices 1812, such as a hard disk module, an optical disk module, and so forth.
The processing functionality 1802 also includes an input/output module 1814 for receiving various inputs from the user (via input modules 1816), and for providing various outputs to the user (via output modules). One particular output mechanism may include a presentation module 1818 and an associated graphical user interface (GUI) 1820. The processing functionality 1802 can also include one or more network interfaces 1822 for exchanging data with other devices via one or more communication conduits 1824. One or more communication buses 1826 communicatively couple the above-described components together.
In closing, a number of features were described herein by first identifying illustrative problems that these features can address. This manner of explication does not constitute an admission that others have appreciated and/or articulated the problems in the manner specified herein. Appreciation and articulation of the problems present in the relevant art(s) is to be understood as part of the present invention.
More generally, although the invention has been described in language specific to structural features and/or methodological acts, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as illustrative forms of implementing the claimed invention.
This Application claims priority to Provisional Application Ser. No. 60/891,662, filed on Feb. 26, 2007. The Provisional Application is incorporated herein by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4553261 | Froessl | Nov 1985 | A |
4574395 | Kato | Mar 1986 | A |
4748678 | Takeda et al. | May 1988 | A |
5440651 | Martin | Aug 1995 | A |
5754712 | Tanaka et al. | May 1998 | A |
5850490 | Johnson | Dec 1998 | A |
5852823 | De Bonet | Dec 1998 | A |
5917958 | Nunally et al. | Jun 1999 | A |
6052494 | Ohtani | Apr 2000 | A |
6173275 | Caid et al. | Jan 2001 | B1 |
6189002 | Roitblat | Feb 2001 | B1 |
6236768 | Rhodes et al. | May 2001 | B1 |
6253201 | Abdel-Mottaleb et al. | Jun 2001 | B1 |
6263121 | Melen et al. | Jul 2001 | B1 |
6366908 | Chong et al. | Apr 2002 | B1 |
6411953 | Ganapathy et al. | Jun 2002 | B1 |
6487554 | Ganapathy et al. | Nov 2002 | B2 |
6584465 | Zhu et al. | Jun 2003 | B1 |
6594650 | Hasuo et al. | Jul 2003 | B2 |
6606417 | Brechner | Aug 2003 | B1 |
6625335 | Kanai | Sep 2003 | B1 |
6674923 | Shih et al. | Jan 2004 | B1 |
6732119 | Ganapathy et al. | May 2004 | B2 |
6760714 | Caid et al. | Jul 2004 | B1 |
6813395 | Kinjo | Nov 2004 | B1 |
6834288 | Chen et al. | Dec 2004 | B2 |
6859552 | Izume et al. | Feb 2005 | B2 |
6859802 | Rui | Feb 2005 | B1 |
6910030 | Choi et al. | Jun 2005 | B2 |
7080014 | Bush et al. | Jul 2006 | B2 |
7162053 | Camara et al. | Jan 2007 | B2 |
7197158 | Camara et al. | Mar 2007 | B2 |
7219073 | Taylor et al. | May 2007 | B1 |
7257268 | Eichhorn et al. | Aug 2007 | B2 |
7460737 | Shuster | Dec 2008 | B2 |
7493340 | Rui | Feb 2009 | B2 |
7613686 | Rui | Nov 2009 | B2 |
7698332 | Liu et al. | Apr 2010 | B2 |
7787711 | Agam et al. | Aug 2010 | B2 |
7792887 | Amirghodsi | Sep 2010 | B1 |
7844125 | Eichhorn et al. | Nov 2010 | B2 |
8180755 | Dalvi et al. | May 2012 | B2 |
20020184267 | Nakao | Dec 2002 | A1 |
20030044062 | Ganapathy et al. | Mar 2003 | A1 |
20040170325 | Eichhorn et al. | Sep 2004 | A1 |
20070041668 | Todaka | Feb 2007 | A1 |
20070258648 | Perronnin | Nov 2007 | A1 |
20080205770 | Jia et al. | Aug 2008 | A1 |
20080212899 | Gokturk et al. | Sep 2008 | A1 |
20090138473 | Manabe et al. | May 2009 | A1 |
20090234825 | Xia et al. | Sep 2009 | A1 |
20110076653 | Culligan et al. | Mar 2011 | A1 |
20110258152 | Wang et al. | Oct 2011 | A1 |
20110314059 | Hu | Dec 2011 | A1 |
Number | Date | Country |
---|---|---|
20000054860 | Sep 2000 | KR |
20010106555 | Dec 2001 | KR |
Entry |
---|
Nister, D. et al., “Scalable Recognition with a Vocabulary Tree”, 2006, IEEE Computer Society Conference on Computer Vsion and Pattern Recognition, p. 1-8. |
PCT Search Report for Application No. PCT/US2008/055046, mailed Jul. 14, 2008 (11 pages). |
Number | Date | Country | |
---|---|---|---|
20080205770 A1 | Aug 2008 | US |
Number | Date | Country | |
---|---|---|---|
60891662 | Feb 2007 | US |