The present disclosure relates generally to image matching during processing of visual search requests and, more specifically, to improving feature matching accuracy during processing of a visual search request.
Mobile query-by-capture applications (or “apps”) are growing in popularity. Snap Tell is a music, book, video or video game shopping app that allows searching for price comparisons based on a captured image of the desired product. Vuforia is a platform for app development including vision-based image recognition. Google and Baidu likewise offer visual search capabilities.
In general, the performance of processing visual search requests is very dependent upon the quality of point matching. In particular, the need to avoid false positive matches during processing visual search requests can dramatically increase the number of points that must be correlated in order to reliably determine a match.
There is, therefore, a need in the art for improved visual search request processing.
To improve precision of visual search processing, SIFT points within a query image are forward matched to features in each of a plurality of repository images and SIFT points within each repository image are backward matched to features within the query image. Forward-only, backward-only and forward-and-backward matches may be weighted differently in determining an image match. Two way matching may be triggered by query image bit rate in excess of a threshold or by a sum of weighted distances between matching points exceeding a threshold. Significant performance gains in eliminating false positive matches are achieved.
Before undertaking the DETAILED DESCRIPTION below, it may be advantageous to set forth definitions of certain words and phrases used throughout this patent document: the terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation; the term “or,” is inclusive, meaning and/or; the phrases “associated with” and “associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term “controller” means any device, system or part thereof that controls at least one operation, where such a device, system or part may be implemented in hardware that is programmable by firmware or software. It should be noted that the functionality associated with any particular controller may be centralized or distributed, whether locally or remotely. Definitions for certain words and phrases are provided throughout this patent document, those of ordinary skill in the art should understand that in many, if not most instances, such definitions apply to prior, as well as future uses of such defined words and phrases.
For a more complete understanding of the present disclosure and its advantages, reference is now made to the following description taken in conjunction with the accompanying drawings, in which like reference numerals represent like parts:
The following documents and standards descriptions are hereby incorporated into the present disclosure as if fully set forth herein:
Mobile visual search using Content Based Image Recognition (CBIR) and Augmented Reality (AR) applications are gaining popularity, with important business values for a variety of players in the mobile computing and communication fields. One key technology enabling such applications is a compact image descriptor that is robust to image recapturing variations and efficient for indexing and query transmission over the air. As part of on-going Motion Picture Expert Group (MPEG) standardization efforts, definitions for Compact Descriptors for Visual Search (CDVS) are being promulgated (see [REF1] and [REF2]).
Visual search server 102 includes one or more processor(s) 110 coupled to a network connection 111 over which signals corresponding to visual search requests may be received and signals corresponding to visual search results may be selectively transmitted. The visual search server 102 also includes memory 112 containing an instruction sequence for processing visual search requests in the manner described below, and data used in the processing of visual search requests. The memory 112 in the example shown includes a communications interface for connection to image database 101.
User device 105 is a mobile phone and includes an optical sensor (not visible in the view of
Referring back to
The visual search server 102 receives the visual search request over the communications channel(s) of network 100, and includes descriptor decoding functionality 140 for decoding the query image descriptors within the visual search request. Descriptor matching functionality 141 provides two-way matching of local features between the query image and repository images within database 101 as described in further detail below, and search results functionality 142 returns information regarding the matching repository image(s), if any, over the communications channel(s) of network 100 to the mobile device 105. Results processing and display functionality 133 within the mobile device 105 receives the search results from the visual search server 102 and displays information regarding those results (which may be the matching image(s) themselves or merely descriptors identifying the content of the query image) to the user.
Among the objectives in implementing a visual search process should be front end real time performance accommodating, for example 640×480 pixel images at 30 frames per second (fps) for video, a low bit rate over the air achieving 100× compression with respect to the images forming the basis of a search request or 10× compression of the raw feature information, greater than 95% accuracy in pair-wise matching (verification) of a search image and greater than 90% precision in correct identification of the matching image(s), and indexing and search efficiency allowing a real time backend response from an image repository including as many as 100 million images.
Key to mobile visual search and AR applications is use of compact descriptors that are robust to image recapturing variations (e.g., from slightly different perspectives) and efficient for indexing and query transmission over the air, an area that is part of on-going Motion Picture Experts Group (MPEG) standardization efforts. In a CDVS system, visual queries include two parts: global descriptors and local descriptors for distinctive image regions (or points of interest) within the image and the associated coordinates for those regions within the image. A local descriptor includes of a selection of (for example) SIFT points [REF7] based upon local key point descriptors, compressed through a multi-stage visual query (VQ) scheme. A global descriptor is derived from quantizing the Fisher Vector computed from up to 300 SIFT points, which basically captures the distribution of SIFT points in SIFT space.
The matching of repository images with a query image during a visual query search may be processed in multiple stages or steps: In the first step, local features (e.g., SIFT points) from the query image are matched to one or more repository side images. If the number of matched SIFT points, n_sift_matched, is below a certain threshold, then two images are declared non-match pairs. Otherwise, geometric consistence between the matched SIFT points is checked, along with the global descriptor difference, and only when certain thresholds are crossed are the two images declared a matching pair.
In processing a query image, a short list of matches may be retrieved based on a global feature [REF5] that captures the local descriptor(s) distribution, using global descriptor matching 201 with global descriptors from image database 101. To ensure certain recall performance, this short list is usually large, with hundreds of images. Therefore, in the second step, local descriptors are utilized in a re-ranking process that will identify the true matches from the short list. Coordinate decoding 202 and local descriptor decoding 203 from the local descriptor from the image search query are determined, and the local descriptor re-encoding 204 may optionally be performed in software (S-mode) only. Top match comparison 205 from the short list of top matches in the global descriptor matching 201 is then performed using feature matching 206 and geometric verification 207, to determine the retrieved image(s) information. Obviously as the image database 101 size grows, especially in real world applications where repositories typically consists of billions of images, the short list will grow dramatically in size and the ultimate retrieval accuracy will depend upon the performance of the local feature based re-ranking.
The performance of a CVDS system is very dependent on the quality of SIFT point matches. While a one-way match process may be employed (comparing repository image SIFT points to those within the query image), the results produced can be un-robust, generating an unacceptably high number of false positive (FP) pairs for the query. To improve performance, the present disclosure introduces two-way SIFT point matching—first “forward,” between SIFT points in the query image and those in the repository image, and then “backward,” between SIFT points in a repository image and those in the query image—and verification before proceeding to determination of geometric consistency and global descriptor comparison to prune off inconsistence.
To address those disadvantages described above and to improve visual search performance, a two-way key point feature matching based solution is described. Given a pair of images, query image Iq and one repository image I0 within the set of repository images I0, I1, . . . , Ir, the task is to determine whether the pair is a matching pair or a non-matching pair—i.e., whether the two images contain the same visual object or set of objects, although the viewpoint, scale and/or orientation of the objects may be different in the two images. Such a matching is performed by using the SIFT point descriptors from the two images.
Given these sets, three different types of matches may be recognized:
Mf only: matches belonging to Mf that are not in Mb;
Mb only: Matches belonging to Mb that are not in Mf; and
Mf∩Mb: matches that belong to both Mf and Mb.
The last group is illustrated in
As compared to considering only the set of matches Mf, it is easy to see that the two-way matched descriptors, i.e., the matches in the set Mf∩Mb, should be more reliable in determining a “true” match (avoiding false positives) because that set satisfies both the forward matching and the reverse matching criterion. In the present disclosure, different weighted combinations of the sets of matches described above may be utilized according to the different image level matching criteria. In one scenario, referred to as a two-way only scenario, only the set of matches Mf∩Mb is used during query. In another scenario, referred to as a weighted two way scenario, the set of matches Mf∪Mb may be used while computing the final match score, with a higher weight assigned to the matches in Mf∩Mb as compared to Mf only matches and Mb only matches.
Note that in order to obtain the local descriptor matching score, first a geometric consistency check is performed using the Logarithmic Distance Ratio (LDR) approach [REF9]. The geometric consistency check provides a subset of matches that pass the check, referred to as inliers. The final matching score is computed as the sum of weights of each match among the inliers. In one approach, a weight of a match is computed as a function of distance between SIFT descriptors. In order to incorporate the two-way match information, these weights are post multiplied with a factor, such as a value 1 for the matches belonging to Mf∩Mb and a value of 0.5 for matches belonging to Mf only and Mb only sets.
If n and n are the number of SIFT descriptors in query and repository images Iq and I0, the algorithmic complexity of two-way and two-way weighted SIFT matching is O(mn), which is the same as for the one-way match approach, adding only one additional (distance) sorting and inconsistent elimination. The two-way weighted scenario requires slightly more computation than two way only scenario since the geometric consistency check is performed on set Mf∩Mb, as Mb compared to just Mf in the two way only approach. Thus, the approached described above incurs no extraction time complexity penalty, no communication overhead and change to the bit stream for a search request, no memory cost, and no significant computational penalty.
In one embodiment of the present disclosure, the matches Mf are computed first and a coarse measure of the matching score Sw is estimated based on the individual matching scores of the matches in Mf. In our one such implementation, Sw is the sum of weights associated with each match which is computed as
where d1 is the distance of a keypoint in one image to the closest match in the second image and d2 is the distance of the keypoint to the second closest match. The two way matching procedure may be performed only if Sw belongs to a certain range (e.g., [6,16] in one implementation). Otherwise, the baseline one-way matching is performed instead. This optimization significantly reduces the computational complexity of the technique.
Based on the experiments conducted on MPEG CDVS Test Model version 6.1, the retrieval results for weighted two-way scenario are provided in TABLE I below:
The matching results are provided in TABLE II below, where the columns “TPR” indicate a true positive match rate and the columns “FPR” indicate a false positive match rate:
Note that the two way matching is only activated for higher bit rates, e.g., 4 k-16 k, which can be performed with a specific software switch. The above matching results are obtained when the two way matching is selectively turned on based on the value of Sw. The two way matching can be selectively turned on based on another measure of matching scores of individual descriptors or any other quality or compatibility measure of the two images being matched.
The process 400 begins with determination of SIFT points for query image, and transmission of the query to the visual search server (step 401). Forward matching of features for the query image to one of the repository images (step 402) is performed, optionally followed by a determination of whether the sum of weights Sw for the matches lies outside a predetermined range (step 403). If so, the process skips to a determination of whether additional repository images that have not been compared to the query image exists (step 406). If not, however, the process instead proceeds to backward matching of features from the repository image to the query image (step 404), and optionally to weighting of the forward-and-backward matches differently from the forward-only and backward-only matches (step 405) before determining if any repository images have not yet been compared (step 406). Once the query image features have been compared to the features of all repository images, the depicted portion of the overall visual search query processing terminates (step 407).
As shown, the elimination of the SIFT level false positive pairs translates into significant image level FPR improvement for higher rates. Lower bitrates do not provide enough local descriptors for this scheme to take effect.
Although the present disclosure has been described with an exemplary embodiment, various changes and modifications may be suggested to one skilled in the art. It is intended that the present disclosure encompass such changes and modifications as fall within the scope of the appended claims.
This application hereby incorporates by reference U.S. Provisional Patent Application No. 61/750,684, filed Jan. 9, 2013, entitled “TWO WAY LOCAL FEATURE MATCHING TO IMPROVE VISUAL SEARCH ACCURACY,” U.S. Provisional Patent Application No. 61/812,646, filed Apr. 16, 2013, entitled “TWO WAY LOCAL FEATURE MATCHING TO IMPROVE VISUAL SEARCH ACCURACY,” and U.S. Provisional Patent Application Ser. No. 61/859,037, filed Jul. 26, 2013, entitled “TWO WAY LOCAL FEATURE MATCHING TO IMPROVE VISUAL SEARCH ACCURACY.”
Number | Date | Country | |
---|---|---|---|
61750684 | Jan 2013 | US | |
61812646 | Apr 2013 | US | |
61859037 | Jul 2013 | US |