Systems and methods herein generally relate to searching image sources, and more particularly to using image queries to search accumulations of stored images.
It is challenging to search accumulations of stored images because often the images within such collections are not organized or classified, and many times the images lack captions or other effective text descriptions. Additionally, user convenience is enhanced when a user is allowed to simply present an undescribed image as a query, and the systems automatically locates similar images to produce an answer to the query image.
Therefore, the task of image retrieval, when given a query image, is to retrieve all images relevant to that query within a potentially very large database of images. Initially this was tackled with bag of-features representations, large vocabularies, and inverted files, and then with feature encodings such as the Fisher vector or the VLAD descriptors, the retrieval task has recently benefited from the success of deep learning representations such as convolutional neural networks that were shown to be both effective and computationally efficient for this task. Among retrieval methods, many have focused on retrieving the exact same instance as in the query image, such as a particular landmark or a particular object.
Another group of methods have concentrated on retrieving semantically related images, where “semantically related” is understood as displaying the same object category or sharing a set of tags. This requires the previous methods herein to make the strong assumption that all categories or tags are known in advance, which does not hold for complex scenes.
Various methods herein automatically identify similar images within a training database (that has training images with human-supplied text captions). The similar images are identified by semantically matching the human-supplied text captions (for example, using a processor device electrically connected to an electronic computer storage device that stores the training database). For example, to identify similar images, the process of matching image pairs can be based on a threshold of similarity (e.g., using a hard separation strategy).
These methods also automatically train an image representation function. The image representation function is based on a deep network that transforms image data (and potentially captions) into vectorial representations in an embedding space. Further, the training modifies the weights of the deep network so that the image representation function will produce more similar vectors for similar images, and less similar vectors for dissimilar images where the similar and dissimilar images is information produced by leveraging the human-supplied text captions.
The process of identifying similar images produces matching image triplets consisting of a query image (sometimes also known as an anchor), a relevant image (chosen because it is similar to the query according to the captions), and a non-relevant image (dissimilar according to the captions). More specifically, the training process uses the processor to automatically select a similar image within the training database that is similar to one of the training images within the training database, select a dissimilar image within the training database that is not similar to that training image, and then automatically adjust the weights of the deep network so that the image representation function produces similar vectors for the similar image and the training image, and produces dissimilar vectors for the dissimilar image and the training image. The training repeats the processes of identifying the similar and dissimilar images based on textual captions and adjusting the weights of the image representation function, for thousands of other training image triplets. The image representations produced by the learned image representation function can be compared using distances such as the Euclidean distance or similarity functions such as the dot product.
At some point after training, these methods automatically apply the trained function to second images in a non-training second database to produce second vectors for the second images. The second database is stored in the same or different electronic computer storage device, and is different from the training database. These methods receive (e.g., into the same, or a different, processor device) a query image, with or without captions, and an instruction to find second images in the second database that match the query image. To find images that match the query image, these methods automatically (e.g., using the processor device) apply the trained function to the query image to produce a query vector. This allows these methods to automatically rank the second images based on how closely the second vectors match the query vector, using the processor device, and automatically output (e.g., from the processor device) the top ranking ones of the second images as a response to the query image.
Systems herein include, among other components, one or more electronic computer storage devices that store one or more training databases (having training images with human-supplied text captions) and non-training databases used for deployment, one or more processor devices electrically connected to the electronic computer storage device, one or more input/output devices electrically connected to the processor device, etc.
The processor devices automatically identify similar images within the training database by semantically matching the human-supplied text captions. For example, a process of matching image pairs based on a threshold of similarity (e.g., using a hard separation strategy) can be used to identify similar images.
The processor devices automatically train an image representation function, which processes image data (and potentially captions) into vectors. For example, the processor devices modify the weights of the deep network during training, so that the image representation function will produce more similar vectors for similar images, and less similar vectors for dissimilar images.
The process of identifying similar images produces matching image triplets consisting of a query image (sometimes also known as an anchor), a relevant image (chosen because it is similar to the query according to the captions), and a non-relevant image (dissimilar according to the captions). More specifically, the processor devices automatically select a similar image within the training database that is similar to one of the training images within the training database, select a dissimilar image within the training database that is not similar to that training image, and then automatically adjust the weights of the deep network, so that the image representation function produces similar vectors for the similar image and the training image, and produces dissimilar vectors for the dissimilar image and the training image. During training, the processor devices repeat the processes of identifying the similar and dissimilar images and adjusting the weights of the image representation function, for thousands of other training image triplets. The image representations produced by the learned image representation function can be compared using distances such as the Euclidean distance or similarity functions such as the dot product.
After training, the processor devices automatically apply the trained function to second images in a non-training second database to produce second vectors for the second images. For example, the second database may or may not have captions, can be stored in the same or different electronic computer storage devices, and is different from the training database because the second database is a live, actively used database.
The input/output devices will receive a query image (with or without captions) and an instruction to find the second images in the second database that match the query image. The processor devices automatically apply the trained function to the query image to produce a query vector. The processor devices then automatically rank the second images based on how closely the second vectors match the query vector. Finally, the input/output devices automatically output top ranking ones of the second images as a response to the query image.
These and other features are described in, or are apparent from, the following detailed description.
Various exemplary systems and methods are described in detail below, with reference to the attached drawing figures, in which:
The systems and methods described herein focus on the task of semantic retrieval on images that display realistic and complex scenes, where it cannot be assumed that all the object categories are known in advance, and where the interaction between objects can be very complex.
Following the standard image retrieval paradigm that targets efficient retrieval within databases of potentially millions of images, these system and methods learn a global and compact visual representation tailored to the semantic retrieval task that, instead of relying on a predefined list of categories or interactions, implicitly captures information about the scene objects and their interactions. However, directly acquiring enough semantic annotations from humans to train such a model is not required. These methods use a similarity function based on captions produced by human annotators as a good computable surrogate of the true semantic similarity, and such provides information to learn a semantic visual representation.
This disclosure presents a model that leverages the similarity between human-generated region-level captions, i.e., privileged information available only at training time, to learn how to embed images in a semantic space, where the similarity between embedded images is related to their semantic similarity. Therefore, learning a semantic representation significantly improves over a model pre-trained on industry standard platforms.
Another variant herein leverages the image captions explicitly and learns a joint embedding for the visual and textual representations. This allows a user to add text modifiers to the query in order to refine the query or to adapt the results towards additional concepts.
For example, as shown in
One underlying visual representation is the ResNet-101 R-MAC network. This network is designed for retrieval and can be trained in an end-to-end manner. The methods herein learn the optimal weights of the model (the convolutional layers and the projections in the R-MAC pipeline) that preserve the semantic similarity. As a proxy of the true semantic similarity these methods leverage the tf-idf-based BoW representation over the image captions. Given two images with captions the methods herein define their proxy similarity s as the dot product between their tf-idf representations.
To train the network, this disclosure presents a method to minimize the empirical loss of the visual samples over the training data. If q denotes a query image, d+ a semantically similar image to q, and d− a semantically dissimilar image, the formula defines the empirical loss as L=ΣqΣd
L
υ(q,d+,d−)=½max(0,m−ϕTqϕ++ϕTqϕ−) (Equation 1),
m is the margin and ϕ: I D is the function that embeds the image into a vectorial space, i.e., the output of the model. In what follows, ϕq, ϕ+, and ϕ− denote ϕ(q), ϕ(d+), and ϕ(d−). The methods herein optimize this loss with a three-stream network as in with stochastic optimization using ADAM.
To select the semantically similar d+ and dissimilar d− images, a hard separation strategy was adopted. Similar to other retrieval works that evaluate retrieval without strict labels, the methods herein considered the nearest k neighbors of each query according to the similarity s as relevant, and the remaining images as irrelevant. This was helpful, as now the goal is to separate relevant images from irrelevant ones given a query, instead of producing a global ranking. In the experiments, the methods herein used k=32, although other values of k led to very similar results. Finally, note that the caption annotations are only needed at training time to select the image triplets, and are not needed at test time.
In the previous formulations, the disclosure only used the textual information (i.e. the human captions) as a proxy for the semantic similarity in order to build the triplets of images (query, relevant and irrelevant) used in the loss function. The methods herein provide a way to leverage the text information in an explicit manner during the training process. This is done by building a joint embedding space for both the visual representation and the textual representation, using two new defined losses that operate over the text representations associated with the images:
L
t1(q,d+,d−)=½max(0,m−ϕTqθ+−ϕTqθ−) (Equation 2), and
L
t2(q,d+,d−)=½max(0,m−θTqϕ+−θTqϕ−) (Equation 3),
As before, m is the margin, ϕ: I→D is the visual embedding of the image, θ: τ→
D and is the function that embeds the text associated with the image into a vectorial space of the same dimensionality as the visual features. The methods herein define the textual embedding as
where t is the l2-normalized tf-idf vector and W is a learned matrix that projects t into a space associated with the visual representation.
The goal of these two textual losses is to explicitly guide the visual representation towards the textual one, which is the more informative representation. In particular, the loss in Equation 2 enforces that text representations can be retrieved using the visual representation as a query, implicitly improving the visual representation, while the loss in Equation 3 ensures that image representations can be retrieved using the textual representation, which is particularly useful if text information is available at query time. All three losses (the visual and the two textual ones) can be learned simultaneously using a siamese network with six streams—three visual streams and three textual streams. Interestingly, by removing the visual loss (Eq. (1)) and keeping only the joint losses (particularly Eq. (2)), one recovers a formulation similar to popular joint embedding methods such as WSABIE or DeViSE. In this case, however, retaining the visual loss is important as the methods herein target a query-by-image retrieval task, and removing the visual loss leads to inferior results.
The following validates the representations produced by the semantic embeddings on the semantic retrieval task and quantitatively evaluates them in two different scenarios. In the first, an evaluation determines how well the learned embeddings are able to reproduce the semantic similarity surrogate based on the human captions. In the second, the models are evaluated using some triplet-ranking annotations acquired from users, by comparing how well the visual embeddings agree with the human decisions on all these triplets. This second scenario also considers the case where text is available at test time, showing how, by leveraging the joint embedding, the results retrieved for a query image can be altered or refined using a text modifier.
The models were benchmarked with two metrics that evaluated how well they correlated with the tf-idf proxy measure, which is the task the methods herein optimized for, as well as with the user agreement metric. Although the latter corresponded to the exact task that the methods herein wanted to address, the metrics based on the tf-idf similarly provided additional insights about the learning process and allow one to cross validate the model parameters. The approach was evaluated using normalized discounted cumulative gain (NDCG) and Pearson's correlation coefficient (PCC). Both measures are designed to evaluate ranking tasks. PCC measures the correlation between ground-truth and predicted ranking scores, while NDCG can be seen as a weighted mean average precision, where every item has a different relevance, which in this case is the relevance of one item with respect to query, measured as the dot product between their tf-idf representations.
To evaluate the method we use a second database of ten thousand images, of which the first one thousand are used as queries. The query image is removed from the results. Finally, because of particular interested in the top results, results using the full list of 10 k retrieved images were not reported. Instead, NDCG and PCC were reported after retrieving the top R results, for different values of R, and plotted the results.
Different versions of the embedding were evaluated. A tuple of the form ({V, V+T}, {V, V+T}) is provided for use herein. The first element denotes whether the model was trained using only visual embeddings (V), as shown in Equation 1, or joint visual and textual embeddings (V+T), as shown in Equations 1-3. The second element denotes whether, at test time, one queries only with an image, using its visual embedding (V), or with an image and text, using its joint visual and textual embedding (V+T). In all cases, the database consists only of images represented with visual embeddings, with no textual information.
This approach was compared to the ResNet-101 R-MAC baseline, pre-trained on ImageNet, with no further training, and to a WSABIE-like model, that seeks a joint embedding optimizing the loss in Equation 2, but does not explicitly optimize the visual retrieval goal of Equation 1.
The following discusses the effect of training in the task of simulating the semantic similarity surrogate function and
A first observation is that all forms of training improve over the ResNet baseline. Of these, WSABIE is the one that obtains the smallest improvement, as it does not optimize directly the retrieval end goal and only focuses on the joint embedding. All methods that optimize the end goal obtain significantly better accuracies. A second observation is that, when the query consists only of one image, training the model explicitly leveraging the text embeddings—models denoted with (V+T, V)—does not seem to bring a noticeable quantitative improvement over (V,V). However, this allows one to query the dataset using both visual and textual information—(V+T, V+T). Using the text to complement the visual information of the query leads to significant improvements.
Table 1 (shown above) shows the results of evaluating the methods on the human agreement score and shows the comparison of the methods and baselines evaluated according to User-study (US) agreement score and area under curve (AUC) of the NDCG and PCC curves (i.e. NDCG AUC and PCC AUC). As with NDCG and PCC, learning the embeddings brings substantial improvements in the user agreement score. In fact, most trained models actually outperform the score of the tf-idf over human captions, which was used as a “teacher” to train the model, following the learning with privileged information terminology. The model leverages both the visual features as well as the tf-idf similarity during training, and, as such, it is able to exploit the complementary information that they offer. Using text during testing does not seem to help on the user agreement task, but does bring considerable improvements in the NDCG and PCC metrics. However, having a joint embedding can be of use, even if quantitative results do not improve, for instance for refining the query, see
In
These methods also automatically train an image representation function, as shown in item 302. The image representation function processes image data (potentially in combination with the captions) into vectors. Further, the training in item 302 modifies the weights of the image representation function so that the image representation function will produce more similar vectors for similar images, and less similar vectors for dissimilar images (for example, again using the processor device).
The process of identifying similar images in item 300 produces matching image pairs, so the training in item 302 can be performed using such matching image pairs. More specifically, the training process in item 302 uses the processor to automatically select a similar image within the training database that is similar to one of the training images within the training database, select a dissimilar image within the training database that is not similar to that training image, and then automatically adjust the weights of the image representation function so that the image representation function produces similar vectors for the similar image and the training image, and produces dissimilar vectors for the dissimilar image and the training image. The training in item 302 repeats the processes of identifying the similar and dissimilar images and adjusting the weights of the image representation function, for thousands of other training images. The image representation function that is trained to produce the similar vectors for the similar images comprises a “trained function.”
At some point after training, these methods automatically apply the trained function to second images in a non-training second database to produce second vectors for the second images, as shown in item 304. The second database is stored in the same or different electronic computer storage device, and is different from the training database. As shown in item 306, these methods receive (e.g., into the same, or a different, processor device) a query image, with or without captions, and an instruction to find second images in the second database that match the query image. To find images that match the query image, these methods automatically (e.g., using the processor device) apply the trained function to the query image to produce a query vector, in item 308. This allows these methods, in item 310, to automatically rank the second images based on how closely the second vectors match the query vector, using the processor device, and automatically output (e.g., from the processor device) the top ranking ones of the second images as a response to the query image, in item 312.
The hardware described herein plays a significant part in permitting the foregoing method to be performed, rather than function solely as a mechanism for permitting a solution to be achieved more quickly, (i.e., through the utilization of a computer for performing calculations). As would be understood by one ordinarily skilled in the art, the processes described herein cannot be performed by a human alone (or one operating with a pen and a pad of paper) and instead such processes can only be performed by a machine (especially when the volume of data being processed, and the speed at which such data needs to be evaluated is considered). For example, if one were to manually attempt to adjust a vector producing function, the manual process would be sufficiently inaccurate and take an excessive amount of time so as to render the manual classification results useless. Specifically, processes such as applying thousands of training images to train a function, calculating vectors of non-training images using the trained function, electronically storing revised data, etc., requires the utilization of different specialized machines, and humans performing such processing would not produce useful results because of the time lag, inconsistency, and inaccuracy humans would introduce into the results.
Further, such machine-only processes are not mere “post-solution activity” because the methods utilize machines at each step, and cannot be performed without machines. The function training processes, and processes of using the trained function to embed vectors are integral with the process performed by the methods herein, and is not mere post-solution activity, because the methods herein rely upon the training and vector embedding, and cannot be performed without such electronic activities. In other words, these various machines are integral with the methods herein because the methods cannot be performed without the machines (and cannot be performed by humans alone).
Additionally, the methods herein solve many highly complex technological problems. For example, as mentioned above, human image classification is slow and very user intensive; and further, automated systems that ignore human image classification suffer from accuracy loss. Methods herein solve this technological problem by training a function using a training set that includes human image classification. In doing so, the methods and systems herein greatly encourage the user to conduct image searches without the use of captions, thus allowing users to perform searches that machines were not capable of performing previously. By granting such benefits, the systems and methods herein solve a substantial technological problem that users experience today.
As shown in
The input/output device 214 is used for communications to and from the computerized device 200 and comprises a wired device or wireless device (of any form, whether currently known or developed in the future). The tangible processor 216 controls the various actions of the computerized device. A non-transitory, tangible, computer storage medium device 210 (which can be optical, magnetic, capacitor based, etc., and is different from a transitory signal) is readable by the tangible processor 216 and stores instructions that the tangible processor 216 executes to allow the computerized device to perform its various functions, such as those described herein. Thus, as shown in
The one or more printing engines 240 are intended to illustrate any marking device that applies a marking material (toner, inks, etc.) to continuous media or sheets of media, whether currently known or developed in the future and can include, for example, devices that use a photoreceptor belt or an intermediate transfer belt, or devices that print directly to print media (e.g., inkjet printers, ribbon-based contact printers, etc.).
Therefore, as shown above, systems herein include, among other components, one or more electronic computer storage devices 210 that store one or more training databases (having training images with human-supplied text captions) and non-training databases, one or more processor devices 224 electrically connected to the electronic computer storage device, one or more input/output devices 214 electrically connected to the processor device, etc.
The processor devices 224 automatically identify similar images within the training database by semantically matching the human-supplied text captions. For example, a process of matching image pairs based on a threshold of similarity (e.g., using a hard separation strategy) can be used to identify similar images.
The processor devices 224 automatically train an image representation function, which processes image data (and potentially captions) into vectors. For example, the processor devices 224 modify the weights of the image representation function during training, so that the image representation function will produce more similar vectors for similar images, and less similar vectors for dissimilar images.
The process of identifying similar images produces matching image pairs, so the training can be performed using such matching image pairs. More specifically, the processor devices 224 automatically select a similar image within the training database that is similar to a training image within the training database, select a dissimilar image within the training database that is not similar to the training image, and then automatically adjust the weights of the image representation function, so that the image representation function produces similar vectors for the similar image and the training image, and produces dissimilar vectors for the dissimilar image and the training image. During training, the processor devices 224 repeat the processes of identifying the similar and dissimilar images and adjusting the weights of the image representation function, for thousands of other training images. The image representation function that is trained to produce the similar vectors for the similar images comprises a “trained function.”
After training, the processor devices 224 automatically apply the trained function to second images in a non-training second database to produce second vectors for the second images. For example, the second database may or may not have captions, can be stored in the same or different electronic computer storage devices, and is different from the training database because the second database is a live, actively used database.
The input/output devices 214 will receive a query image (with or without captions) and an instruction to find the second images in the second database that match the query image. The processor devices 224 automatically apply the trained function to the query image to produce a query vector. The processor devices 224 then automatically rank the second images based on how closely the second vectors match the query vector. Finally, the input/output devices 214 automatically output top ranking ones of the second images as a response to the query image.
While some exemplary structures are illustrated in the attached drawings, those ordinarily skilled in the art would understand that the drawings are simplified schematic illustrations and that the claims presented below encompass many more features that are not illustrated (or potentially many less) but that are commonly utilized with such devices and systems. Therefore, Applicants do not intend for the claims presented below to be limited by the attached drawings, but instead the attached drawings are merely provided to illustrate a few ways in which the claimed features can be implemented.
Many computerized devices are discussed above. Computerized devices that include chip-based central processing units (CPU's), input/output devices (including graphic user interfaces (GUI), memories, comparators, tangible processors, etc.) are well-known and readily available devices produced by manufacturers such as Dell Computers, Round Rock Tex., USA and Apple Computer Co., Cupertino Calif., USA. Such computerized devices commonly include input/output devices, power supplies, tangible processors, electronic storage memories, wiring, etc., the details of which are omitted herefrom to allow the reader to focus on the salient aspects of the systems and methods described herein. Similarly, printers, copiers, scanners and other similar peripheral equipment are available from Xerox Corporation, Norwalk, Conn., USA and the details of such devices are not discussed herein for purposes of brevity and reader focus.
The terms printer or printing device as used herein encompasses any apparatus, such as a digital copier, bookmaking machine, facsimile machine, multi-function machine, etc., which performs a print outputting function for any purpose. The details of printers, printing engines, etc., are well-known and are not described in detail herein to keep this disclosure focused on the salient features presented. The systems and methods herein can encompass systems and methods that print in color, monochrome, or handle color or monochrome image data. All foregoing systems and methods are specifically applicable to electrostatographic and/or xerographic machines and/or processes.
Further, the terms automated or automatically mean that once a process is started (by a machine or a user), one or more machines perform the process without further input from any user. In the drawings herein, the same identification numeral identifies the same or similar item.
It will be appreciated that the above-disclosed and other features and functions, or alternatives thereof, may be desirably combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations, or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims. Unless specifically defined in a specific claim itself, steps or components of the systems and methods herein cannot be implied or imported from any above example as limitations to any particular order, number, position, size, shape, angle, color, or material.