The disclosed subject matter relates to mobile visual search technology, and more particularly to an automated Active Query Sensing system for sensing surrounding scenes or different views of objects while forming a subsequent search query.
As mobile handheld devices have become pervasive, new technologies such as mobile media search have emerged. One such example includes systems for searching information about products, locations, or landmarks, by taking a photograph of the target of interest near the mobile user. The captured image can be used as a query, sent via the mobile network to the server in which reference images of the candidate objects or locations are matched in order to recognize the true target. Such mobile visual search functionalities have been used in commercial systems, such as “snaptel” (http://www.snaptell.com), Nokia “point and find” (http://www.pointandfind.nokia.com), “kooaba” (http://www.kooaba.com), as well as others. Taking mobile location search for example, certain mobile location search systems can offer features and services complementary to GPS or network-based localization, in part because the recognized location in mobile location search systems can be more precise, and no satellite or cellular network infrastructures are needed. Similar cases also exist in other related scenarios such as mobile product search and mobile poster search.
Certain mobile visual search systems are built based on image matching, the success of which can depend on several factors, including the separability of image content associated with different targets (inter-class distance), divergence of content among reference images of the same target (within-class variation), and distortion added to the query image during the mobile imaging process. Every reference image of a target can potentially be used as a query to successfully recognize the true target and properly reject other targets. However, such systems can suffer from unreliability and failed queries, and can therefore result in user frustration.
Systems and methods for automatically determining an improved view for a visual query in a mobile search are disclosed herein.
In some embodiments, methods for automatically determining an improved view for a visual query in a mobile search system include obtaining at least one result data set based on a prior visual query, wherein the at least one result data set includes at least a top result and one or more other results; retrieving at least one distinctiveness measurement for one or more views of one or more objects in the at least one result data set; and determining the improved view based on the retrieved at least one distinctiveness measurement.
In certain embodiments, determining the improved view utilizes an information gain maximization process and/or a majority voting process. In further optional embodiments, the method can include estimating a viewing angle of the prior query and determining outlier results, if any, from the at least one result data set based on the estimated viewing angle of the prior query, and removing from the at least one result data set the determined outlier results using machine learning classification (e.g., Support Vector Machine) and/or local feature matching based image alignment. In certain embodiments, the estimate is refined based on machine learning classification using image matching results from comparing the prior query and one or more results in the at least one result data set.
In other embodiments, the method can include providing a suggested view change for the current visual query to a mobile search system user based on the difference between the viewing angle of the prior query and the determined improved view. The method can also include providing images of one or more views of one or more results from the at least one result data set, and/or prompting the user to indicate whether the improved view is to be determined. In some embodiments, the retrieving and determining can be initiated after the user has indicated that the improved view is to be determined. In other embodiments, the retrieving and determining can be initiated independent of whether the user has indicated that the improved view is to be determined, and the suggested view change can be provided after the user has indicated that the improved view is to be determined.
In some embodiments, the distinctiveness measurements of at least one view for the one or more results in the at least one result data set are pre-computed using one or both of content based view distinctiveness prediction and training performance based view distinctiveness prediction.
Some embodiments involve visual location queries. In such embodiment, the method can further include removing from the at least one result data set the top result and all other results geographically close to the top result, wherein the top result of the at least one result data set is deemed incorrect.
In other embodiments, non-transitory computer-readable media have a set of instructions programmed to perform the methods for automatically determining an improved view for a visual query in a mobile search system described above.
Another embodiment provides an active query sensing system for automatically determining an improved current view for a visual query in a mobile search. The system can include a mobile visual search device configured to obtain at least one result data set based on a prior visual query, where the data set includes at least a top result and one or more other results. The system can also include a determination module, configured to retrieve at least one distinctiveness measurement for each of the results in the data set, and determine the improved view based on the retrieved distinctiveness measurement. The system can also include a user interface module coupled to the determination module, configured to provide images of one or more views of results from the data set. The system can also include a distinctive view learning module, coupled to the determining module, and configured to pre-compute the distinctiveness measurements using one or both of content based view distinctiveness prediction and training performance based view distinctiveness prediction.
Embodiments of the disclosed subject matter can automatically determine a suggested improved view angle for a visual query in a mobile search system after a user has indicated that the top returned result of an initial search list of candidate locations is incorrect. The view angle of the initial query can be estimated based on information in the query image. An offline distinctive-view (i.e., the view from a given location most likely to return a correct search result) learning system analyzes images of known locations to determine a most recognizable viewing angle for such locations. The most distinctive view for each remaining candidate location can be retrieved from the offline distinctive-view learning system and majority voting can be performed to determine the likely most distinctive view for the search location. This can then be compared with the estimated actual view to provide the user with suggested instructions (e.g., turn right 90 degrees) for improving the chances of subsequent query success over random view selection.
Although the descriptions herein are focused on using mobile visual search to determine locations, the disclosed subject matter can be applied more generally to search various objects and scenes in the physical world, including but not limited to products and 3D objects, in addition to geographical locations.
Different views of 3D objects, locations, etc. produce images with varying degrees of distinctiveness, i.e., the degree to which a particular object or location can be identified based on the information contained in the image. As used herein, the term “view” refers to the orientation, scale, and location for taking a picture as a visual query input.
When these types of searches fail, incorrect locations with visually similar appearances typically are returned as the top match. This performance appears to be generally consistent with the modest accuracy (0.4-0.7 average precision) reported in some of the prior art systems for mobile location search. In the example shown in
The disclosed subject matter focuses on a novel aspect of improving the mobile visual search experience based on the existence of unique preferred views for successful recognition of each target location. For example, some views of a location consist of unique “signature” attributes that are distinctively different from others. Other views can contain common objects (trees, walls, etc.) that are much less distinctive. When searching for specific targets, queries using such unique preferred views will typically lead to much more robust recognition results. To this end, the disclosed subject matter includes an automated Active Query Sensing (“AQS”) system to automatically determine an improved view for visual sensing to formulate a visual query.
Fully automatic AQS can be difficult to achieve for the initial query when the user location is initially unknown to the system. In such cases, location-specific information helpful for determining an improved view for a visual query can initially be unavailable. Although some prior information (e.g., GPS data, base station tags, previously identified locations, and trajectories) can be available for predicting the likely current locations, that information is not always reliable. The disclosed subject matter can improving the success rate of subsequent queries when prior queries have failed. Specifically, the disclosed AQS system can include two components: offline distinctive view learning and online active query sensing.
First, the disclosed systems provide automatic methods for assessing the “distinctiveness” of views associated with a location. Such distinctiveness measures are determined using an offline analysis of the matching scores between a given view and other images (including those of the same location and different locations), unique image features contained in the view, or combinations thereof. The proposed distinctiveness measure can provide much more reliable predictions about improved query views for mobile location recognition, compared with alternatives using random selection or the dominant view.
Second, the disclosed systems can use the prior query as a “probe” to narrow down the search space and form a small set of candidate locations, from which the prior query is aligned. The optimal view change (e.g., turn to the right of the prior query view) is then estimated in order to predict an improved or best view for the next query.
The most likely view of the current query image 318 can be estimated 320 through coarse classification of the query image to some of the predefined views (e.g., side, front-back) and then refined by aligning the query image to the panorama or the multi-view image set associated with each of the top candidate locations. Such alignment process can also be used to filter out outlier locations 322 that are inconsistent with the majority of the candidate locations in terms of prediction of the current query view.
The filtered candidate location set can then be used 324 to retrieve 325 the distinctive views 326 associated with each possible location 328, which have been pre-computed offline 330. A majority voting process 332 can be used to determine the suggested view 334 for the next query. The difference between the suggested query view and the predicted current view can be used 335 to suggest 336 a view change 338 to the user 310, who can then turn the mobile location search device 306 according to the suggested change 338, and submit the next query.
Optionally, the initial set of candidate locations 316 can be reduced by removing the top returned location deemed incorrect by the user and/or locations nearby 340. Such locations can be the duplicates of or closely related to the incorrect location, and therefore unlikely to be correct locations.
In an example embodiment, approximate image matching is achieved in a million-scale image database using bag of visual words (“BoW”) with inverted indexing technique. The embodiment uses a hierarchical tree based structure for efficient codebook construction and visual local feature quantization. In addition, multi-path search and spatial verification can also be incorporated to improve accuracy.
Local Feature Extraction:
Both interest point detection and dense sampling can be used in building the search system. The former can be based on Difference of Gaussian (“DoG”), while the latter can be based on multi-scale sliding window with three scales and fixed steps, producing approximately equal numbers of local features. As shown in
Local Feature Clustering:
Hierarchical K Means Clustering can be used to build the million-scale visual vocabulary. There are two basic settings in building the Vocabulary Tree: (1) Branching Factor B controls how many clusters are built to partition a given set of local features into its lower hierarchy, and (2) Hierarchical Layer H controls the number of hierarchical layers in the tree. There is a tradeoff between speed and quantization accuracy in choosing different B and H values. In an example configuration, with empirical validation, B=10 and H 6 can be set to construct a final codebook of approximately 1 million codewords.
Quantization:
Soft quantization can improve the retrieval precision.
In an example configuration, soft quantization performs better when the codebook size is small. But as the codebook size increases, the performance of soft quantization degrades, and is outperformed by hard quantization. Greedy N-Best Path (“GNP”) can be used to rectify the quantization errors by searching multiple paths over the quantization tree. By using GNP of 10 paths, a further gain in average precision (2%) can be achieved.
Spatial Verification:
Spatial matching can also be incorporated into the image matching. Using spatial matching, a point in one image is considered to match a point in another image if a sufficient number of nearby local features are also matched.
Inverted Indexing:
Finally, Histogram Intersection Kernel (“HIK”) can be implemented in the image matching system together with inverted indexing to ensure scalability to the million-scale database. Once a certain amount of local descriptors from one query are assigned into a given visual word, all images indexed by this visual word can be assigned a certain score.
The final one million codebook configuration in an example embodiment of the disclosed subject matter is shown in Table 1. The performance of the final system has an average precision at 0.7961 over some validation set using the reference images as queries.
The disclosed AQS system can be used independent of the image matching subsystem if the query view is location dependent.
Offline Distinctive View Learning
As discussed above, each location has certain views that are more distinctive and can be used for successful retrieval. Referring now to
Two approaches can be used to pre-compute distinctive views of a given set of locations: content-based view distinctiveness prediction and training performance based view distinctiveness prediction. The former explores the unique attributes contained in each view such as distinct objects, local features, etc., while the latter predicts the test performance by assessing the query effectiveness over a training data set. Embodiments of the disclosed subject matter can incorporate either or both. Note in the discussion below, it is assumed that the continuous space of view can be appropriately discretized, for example, to a finite set of choices (e.g., six angles used in the NAVTEQ data set). Other discretizations can be used and will be apparent to a person of ordinary skill in the art.
Content Based View Distinctiveness Prediction:
With the BoW representation, distinctive visual words typically have a better discriminating power than words that appear frequently in the databases. A word can be considered more distinctive if its frequency of occurrence in the database images (documents) is low. Extending this concept, a TF-IDF related content-based feature can be defined as follows:
F(k)=count(wordi|IDF(wordi)>k/K×IDFmax), (1)
where k=1, 2, . . . , K−1. If K is set to be 10, then the above feature accounts for the number of visual words whose IDF exceed certain thresholds (up to 90% with 10% increment). As a result, images of distinctive views will have more words at high IDF than others.
A Support Vector Machine (“SVM”) based classifier can be trained and its classification score used to predict the distinctiveness of an image. For example, a subset of geo-tagged locations sampled from Google Street View can be used as a labeled training set to train the SVM classifier. Since the feature dimension is kept low (10 if K=10), a training set of such a size is adequate.
Training Performance Based View Distinctiveness Prediction:
Each location is associated with a finite set of reference images captured in different views. Each of the reference images can be used to query the database and evaluate its capability in retrieving related images of the same location, or other locations sharing overlapped scenes. Although there can be a gap between such training performance and the real test performance when querying by new images that have not been seen before, the score distributions of relevant (positive) images and irrelevant (negative) images can serve as an approximate measure.
An ideal score distribution is the one that has maximal separation between the scores of the positive results and those of the negative ones, e.g., the score distribution shown in
Nrelevant is the number of relevant documents to the current query; r is the rth relevant document; P(r) is the precision at the cut-off rank of document r. In the literature, there are some subtle variations in definition of AP. The one used above is also called full-length AP.
The other method, called Saliency as defined below, is similar to AP with several modifications. First, the ratio of the positive score statistics to that of the negative scores are computed. Second, the actual score values are incorporated in the measure. These modifications incorporate the score separation between the positive and negative classes:
where N is the number of returned locations, which can be a fixed size or adjusted based on the number of positive samples. score(j) is the location matching score, which is the maximal score of its six views. rel(j) is the relevance judgment of the jth returned location, which gives 1 for correct locations and 0 for incorrect. Other statistical measures, such as KL Divergence, can also be used.
Note the numerator in Equation 3 above is very similar to that of AP (described in Equation 2), except the score values are used instead of binary values (1 for positive and 0 for negative) and the inner average is repeated for every sample, not just the positive points. Despite the simplicity of the above Saliency measure, it can yield high prediction accuracy.
The offline measures of distinctiveness for each view can also be used to “grade” the searchability of a location. Based on the search results of the associated views, a location can be categorized into one of the following groups:
The above analysis can also be used with certain modifications. First, the offline analysis is not limited to the discrete views that have been indexed in the database. In practice, users can sample the view space in a more flexible manner. Second, there can be a generalization gap between offline analysis based on training performance and the real-world online testing. Nonetheless, the offline analysis offers an approximate yet systematic process of discovering the preferred query views for searchable locations.
Online View Estimation and Active Query Sensing
Modules for online view estimation and active query view suggestion are described in this section. An example process is summarized in Algorithm 1. Given a prior query that fails to recognize the correct location, the objective is to develop automatic methods that can estimate the likely view captured by the prior query, and from the candidate location set, discover an improved or the best view for the next query.
Using the image matching subsystem, first a small set of top-N most likely locations can be identified. In the case in which the user has indicated that the top matched location is incorrect, the first location and locations geographically close to the first location can be removed as the first location has been deemed incorrect by the user. The definition of “geographically close” can vary depending on the circumstances. The system can either have a default definition, e.g., 50 meters, or prompt or allow the user to set a value, or both. Next, an SVM classifier can be employed to assign the prior query image to one of a few rough orientations, followed by refinement based on image matching. Algorithm 1 shows the working pipeline of an embodiment of the active query sensing system of the disclosed subject matter. Some key components of Algorithm 1 are explained in detail below.
Viewing Angle Prediction:
Although the visual content in different views of the location database can be very diverse, there exist general patterns differentiating each other. For example, the side views tend to contain more features related to buildings, trees, and sides of parked vehicles, while other views (e.g., front) have more attributes like skylines, streets, and front/back views of vehicles. Such differences tend to be holistic, reflecting the overall characteristics of the scenes, thus motivating the choice of GIST descriptor for view classification.
The SVM classifiers can be trained offline based on GIST features extracted from 3000 images (500 for each view) randomly chosen from a database. Given an online query, the classifier is used to predict the current viewing angle in a one-versus-all manner. GIST features are efficient in describing the global configuration of images. However, as shown in
where O is a candidate view under consideration, P(θi|l, q) is the matching score between query q and view i of location l, and P(l|q) is the prior of location l. The prior can be obtained by additional metadata about locations such as GPS or the history data about the user locations. The default is a constant for all locations.
This refinement is based on the principle that similar views, even from different locations, typically have similar visual contents (e.g., skylines, side of a truck etc.), which are more likely to be included in the top image match results. Therefore, the final angle prediction method can be based on both the combination of both local feature (scale-invariant feature transform (SIFT) for image matching) and global feature (GIST for SVM classification). This approach is robust in application scenarios. It should be noted that when the solution space for view prediction (and alignment) is large, a more sophisticated correspondence matching method, such as RANSAC, can be useful to reliably align the query image to the panorama associated with each location.
Once the current query view is estimated, it can be used it to filter out the outlier locations that do not share consistent view estimation.
Majority Voting for View Suggestion:
Given the filtered candidate set of the locations, a majority voting scheme can be used to estimate the most beneficial view to be used as the next query. It can be expressed as:
where Hdistinctiveness (θi|l, q) outputs 1 if the distinctiveness of view θi and location l is greater than a threshold. P(l|q) can be used to model the prior of location l given query q. Such prior can be obtained from rough. GPS information, history of the mobile user location, etc. The default of P(l|q) is a constant for all locations leading the equation to a majority voting.
The scheme takes into account the distinctiveness of each view with respect to each remaining candidate location, as well as the location priors. With the estimated improved query view and the view angle of the current query, suggestions can be made to inform the user of an improved way of turning or moving the camera phone for the subsequent visual search.
AQS can also be used to maximize information gain about the target location while selecting additional query views. In an example embodiment, where is the set of database locations, is the set of possible views, and is the set of possible query images, the location lε can be found using given user queries.
The user can take a query image from a certain viewing angle at time t=1. Then for each iteration t, if the correct location is not ranked first among the result, an additional query view is suggested.
Defining q, as the query taken at iteration t, there are two parts in qt=(mt, vt): mt, the actual image used for the query, and vt, the viewing angle used in capturing the image. Both components are included because actual captured images in a certain view direction at a location can still vary due to the changes of time, lighting, traffic, or even the devices used. If Vt-1 is defined as the views already tried, then the remaining candidate view Vtε−Vt-1. For simplicity, an embodiment will be described in which the view of the query image is known, followed by an embodiment in which the viewing angle of the query is unknown.
Supposing queries have failed for iterations 1 . . . t−1, with query set Qt-1, the expected information gain (“IG”) can be used as a criterion to select the query viewing angle vt.
The term for maximization represents the expected information gain after a specific view angle vt is chosen in iteration t by choosing an angle that maximizes the information gain. The expectation can be computed over possible images under different imaging conditions as discussed above. From the definition of information gain:
where,
is the entropy of p(l|Qt-1) H (l|Qt-1) is a constant given Qt-1. p(Qt-1) can be modeled as
in which p(l|qi) can be directly approximated by using the score distribution of locations, given query qi. In some embodiments, in each iteration, p(l|qi) is set to 0 for all the locations that have been determined incorrect by the user. w(qi) indicates the “weight” or “quality” of query qi, which can be estimated by analyzing the content or quality of query qi. Less informative images, e.g., images with mostly trees, cars, etc. and images with low quality, typically have lower influence on location prediction.
Without actually getting the query image, mt, and even if angle vt is fixed, Equation 9 cannot be directly applied to determine p(l|Qt-1,qt). However, mt can be approximated using the references images stored in the database, query images submitted earlier by users from the same angle vt of similar time, etc. Thus, p(l|Qt-1,qt) can be approximated as
where w(qi) can be modeled using various well-known methods, such as Saliency. For the first term of Equation 8, p(mt|Qt-1), by using the approximation mentioned above,
Assuming the new captured image in angle vt at location l can be approximated by the existing reference image corresponding to the same location and view angle in the database, then the first term in Equation 12 becomes deterministic and Equation 9 becomes:
If the entropy reduction term in the above equation is further approximated with the Saliency measure previously introduced, then the same method based on majority voting described in Equation 5 is obtained. This can be beneficial because the majority voting method is straightforward and simple to compute.
When the query viewing angle is unknown, the estimation method described above, (“Online View Estimation and Active Query Sensing”) can be applied to predict the most likely view angle of the initial submitted query. Alternatively, the query image can be aligned to the associated reference images of each location. This can result in different aligned angles with respect to different candidate locations because the query image can look very similar to one view for a specific candidate location, but look less similar to a different view for the other candidate location. In such a case, instead of maximizing the expected information gain by choosing a single best angle vt, the optimal relative change of the view angle can be found to maximize the information gain. If the majority of the candidate locations agree on the optimal angle change, a consistent turning action can be recommended.
An analysis of a prior art system without the benefit of the disclosed subject matter was performed. Referring now to
The mobile location search scenarios were simulated by creating a test set using the Google Street View interface. Queries were manually cropped from Google Street View for 226 randomly chosen locations covered by the above-mentioned routes in NYC. Although such test images are less ideal compared to real photos captured by mobile phones, they were used for initial testing since the Google images are quite different from the reference images in the NAVTEQ data set, and challenging conditions (e.g., occlusion and time change) can be presented.
For each location, six query images were cropped from viewing angles similar to the view orientations used in the database (as shown in
For each simulated query image from each of the random locations, the most likely location (among the 50,000 locations in the database) was returned having the highest aggregated matching scores between the query image and the multiple views associated with the location. Details of the matching process are described in more detail above. A returned location is considered correct if it is within a distance threshold from the query location. Setting the appropriate threshold involves consideration of several factors, such as the application requirements and the visual overlap of reference images. It was set to 200 meters in this initial study since two locations can still share overlapped views at this distance in the data set.
Another finding of the case study is Location Dependence.
The performance of the components and overall system of an embodiment of the disclosed subject matter was evaluated using the NAVTEQ NYC data set (about 300,000 images, 50,000 locations). The test queries are the 1,356 images over 226 locations randomly cropped from the Google Street View interface as described above. Out of the 226 locations, 11.1% were found to be unsearchable by any of the views and thus were discarded. The remaining 201 locations are searchable by at least one view angle. The proportions of locations searchable by various numbers of views are shown in
First the “dominance” of each view of the query was analyzed. Table 2 shows the percentages of successful searches over the 201 test locations by each of the six views. Each view has a reasonable chance of success (between 35% and 65%) while View 2 (left) has the lowest rate. This can be due to the relatively low quality of the camera used for View 2 in the database. View 4 (front), the one pointing to the front of the imaging vehicle, has the highest success rate as it appears to cover highly visual objects (e.g., buildings on both sides), as well as distinctive features such as sky lies.
Distinctive View Prediction:
Next, the performance of predicting search robustness was evaluated using offline distinctiveness analysis. Table 3 shows the percentages of successful searches over 201 test locations by using different methods to predict an improved view for each test location. The two types of proposed methods: training performance based (AP and Saliency), and content based (IDF SVM classifier), against the random view selection and the one that always chooses a dominant view (front). Among all the competing approaches, the distinctivness measure (as defined in Equation 3) incorporating the score statistics ratio between the positive training group and the negative group turned out to achieve the highest performance (84%) with a large margin over other approaches (the next best one is 68% achieved by AP).
For each location, the external mobile query was picked up with the most distinctive angle predicted offline. Then, it was tested whether or not the true location can be found using this query. Table 3 shows the robustness validation of different view discrimination measurements. For content based approach (SVM based on statistics of distinctive features), K=10 was set in Equation 1. For each testing location, SVM classifier was used to get the probability based classification results, and the viewing angle with largest probability to be discriminative is predicted to be the most distinctive viewing angle. The result is based on a five-fold cross validation on the test set. As shown in Table 3, the Saliency measurement obtains the highest score.
Query View Prediction:
For the module of view estimation of test queries, it was found that the GIST based SVM classifier was able to achieve 86.5% classification accuracy over the 1,356 test image queries using only Views 1-4. When Views 5 and 6 are added, they cause confusion with views of highly similar content (View 1 with View 5, and View 2 with View 6). This can be due to the symmetry between the views (180 degree opposite direction) giving rise to similar visual content. To resolve this, a maximal voting scheme based on image matching scores (as described in Equation 4) was applied. This kept the view estimation accuracy as high as 82.1% among all six view angles.
Active Query View Sensing:
The effectiveness of the example AQS system in helping users choose an improved view for subsequent queries after the first query fails was evaluated. The simulated system was initialized with a randomly chosen viewing angle in the first visual search. As shown in
Location Difficulty Level Prediction:
How well the proposed distinctiveness measure can be used to predict the difficulty level of each location in terms of location recognition was further evaluated. Accurate prediction of a confidence distribution map such as the one shown in
The procedure of an example AQS system includes:
The hardware architecture of an example embodiment of the disclosed subject matter will now be discussed. The example embodiment includes a search server and client applications using iPhone4S. The mobile applications communicate with the server through Wi-Fi or cellular data services. A client-server architecture for the communication process is used. The query and search results are uploaded and downloaded through PHP and HTTP. The client program processes and compresses the image with objective-C's CGImage Class. To support multiple simultaneous users, this example embodiment system uses the built-in iOS device ID. Various devices and configurations can be used and will be apparent to a person of ordinary skill in the art.
Using embodiments of the disclosed subject matter, over 0.3 million images can be searched within 2 seconds over Wi-Fi, including all end to end processes starting with query image uploading, communication, feature extraction, searching, and download of panorama and map information of search results. To further speed up system response, several state-of-the-art techniques such as those extracting and sending compact descriptors instead of the query image can also be used.
The methods for automatically determining an improved current view for a visual query in a mobile location search, described above, can be implemented as computer software using computer-readable instructions and physically stored in computer-readable media. The computer software can be written in any suitable computer languages, as would be apparent to one of ordinary skill in the art. The software instructions can be executed on various types of computers.
For example,
Processor(s) 1601 (also referred to as central processing units, or CPUs) optionally contain a cache memory unit 1602 for temporary local storage of instructions, data, or computer addresses. Processor(s) 1601 are coupled to storage devices including memory 1603. Memory 1603 includes random access memory (RAM) 1604 and read-only memory (ROM) 1605. As is well known in the art, ROM 1605 acts to transfer data and instructions uni-directionally to the processor(s) 1601, and RAM 1604 is used typically to transfer data and instructions in a bi-directional manner. Both of these types of memories can include any suitable of the computer-readable media described below.
A fixed storage 1608 is also coupled bi-directionally to the processor(s) 1601, optionally via a storage control unit 1607. It provides additional data storage capacity and can also include any of the computer-readable media described below. Storage 1608 can be used to store operating system 1609, EXECs 1610, application programs 1612, data 1611 and the like and is typically a secondary storage medium (such as a hard disk) that is slower than primary storage. It should be appreciated that the information retained within storage 1608, can, in appropriate cases, be incorporated in standard fashion as virtual memory in memory 1603.
Processor(s) 1601 is also coupled to a variety of interfaces such as graphics control 1621, video interface 1622, input interface 1623, output interface 1624, storage interface 1625, and these interfaces in turn are coupled to the appropriate devices. In general, an input/output device can be any of: video displays, track balls, mice, keyboards, microphones, touch-sensitive displays, transducer card readers, magnetic or paper tape readers, tablets, styluses, voice or handwriting recognizers, biometrics readers, or other computers. Processor(s) 1601 can be coupled to another computer or telecommunications network 1630 using network interface 1620. With such a network interface 1620, it is contemplated that the CPU 1601 might receive information from the network 1630, or might output information to the network in the course of performing the above-described method. Furthermore, method embodiments of the present disclosure can execute solely upon CPU 1601 or can execute over a network 1630 such as the Internet in conjunction with a remote CPU 1601 that shares a portion of the processing.
According to various embodiments, when in a network environment, i.e., when computer system 1600 is connected to network 1630, computer system 1600 can communicate with other devices that are also connected to network 1630. Communications can be sent to and from computer system 1600 via network interface 1620. For example, incoming communications, such as a request or a response from another device, in the form of one or more packets, can be received from network 1630 at network interface 1620 and stored in selected sections in memory 1603 for processing. Outgoing communications, such as a request or a response to another device, again in the form of one or more packets, can also be stored in selected sections in memory 1603 and sent out to network 1630 at network interface 1620. Processor(s) 1601 can access these communication packets stored in memory 1603 for processing. The components shown in
In addition, embodiments of the present disclosure further relate to computer storage products with a computer-readable medium that have computer code thereon for performing various computer-implemented operations. The media and computer code can be those specially designed and constructed for the purposes of the present disclosure, or they can be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and holographic devices; magneto-optical media such as optical disks; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (ASICs), programmable logic devices (PLDs) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. Those skilled in the art should also understand that term “computer readable media” as used in connection with the presently disclosed subject matter does not encompass transmission media, carrier waves, or other transitory signals.
As an example and not by way of limitation, the computer system having architecture 1600 can provide functionality as a result of processor(s) 1601 executing software embodied in one or more tangible, computer-readable media, such as memory 1603. The software implementing various embodiments of the present disclosure can be stored in memory 1603 and executed by processor(s) 1601. A computer-readable medium can include one or more memory devices, according to particular needs. Memory 1603 can read the software from one or more other computer-readable media, such as mass storage device(s) 1635 or from one or more other sources via communication interface. The software can cause processor(s) 1601 to execute particular processes or particular parts of particular processes described herein, including defining data structures stored in memory 1603 and modifying such data structures according to the processes defined by the software. In addition or as an alternative, the computer system can provide functionality as a result of logic hardwired or otherwise embodied in a circuit, which can operate in place of or together with software to execute particular processes or particular parts of particular processes described herein. Reference to software can encompass logic, and vice versa, where appropriate. Reference to a computer-readable media can encompass a circuit (such as an integrated circuit (IC)) storing software for execution, a circuit embodying logic for execution, or both, where appropriate. The present disclosure encompasses any suitable combination of hardware and software.
For example, the mobile visual search device 1710 can be configured to obtain a result data set in response to an initial visual query. The determination module 1720 can receive the result data set from the mobile visual search device 1710, and can be configured to retrieve distinctiveness measurements for results in the result data set. The determination module 1720 can determine the improved view based on the retrieved distinctiveness measurements.
The system 1700 can also include a distinctive view learning module 1730 configured to pre-compute the distinctiveness measurements of views for the results in the result data set using one or both of content based view distinctiveness prediction and training performance based view distinctiveness prediction as described above.
The system 1700 can also include a user interface module 1740 configured to provide images of views of results from the result data set as shown in
While this disclosure has described several example embodiments, there are alterations, permutations, and various substitute equivalents, which fall within the scope of the disclosure. It will thus be appreciated that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, embody the principles of the disclosure and are thus within the spirit and scope thereof.
This application claims priority to U.S. Provisional Patent Application Ser. No. 61/477,844, filed on Apr. 21, 2011 the entirety of the disclosure of which is explicitly incorporated by reference herein.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US2012/033833 | 4/16/2012 | WO | 00 | 4/16/2014 |
Number | Date | Country | |
---|---|---|---|
61477844 | Apr 2011 | US |