Augmented reality (AR) involves superposing information directly onto a camera view of real world objects. Recently there has been tremendous interest in developing AR type applications for mobile applications, such as a mobile phone. One type of AR application that is of interest is vision-based AR, i.e., processing the pixels in the camera (view) frames to both detect and track points of interest (POI) to the user.
Vision-based AR uses object detection that involves not only the recognition (or not) of a reference object in the query image captured by camera but also computing the underlying spatial transformation of the object between reference and query. One important consideration in the design of a vision-based AR system is the size and composition of the database (DB) of features derived from images of reference objects. Another important consideration is the query process in which the descriptions of query features are matched against those of reference images.
A database for object recognition is generated by performing at least one of intra-object pruning and inter-object pruning, as well as keypoint clustering and selection. Intra-object pruning removes similar and redundant keypoints within an object and different views of the same object, and may be used to generate and associate a significance value, such as a weight, with respect to remaining keypoint descriptors. Inter-object pruning retains the most informative set of descriptors across different objects, by characterizing the discriminability of the keypoint descriptors for all of the objects and removing keypoint descriptors with a discriminability that is less than a threshold.
A match between a query image and information related to images of objects stored in a database is performed by retrieving nearest neighbors from the database and determining the quality of the match for the retrieved neighbors. The quality of the match is used to generate an object candidate set, which is used to remove outliers. A confidence level for each query feature may also be used to remove outliers. The search maybe performed on a mobile platform, which downloads a geographically relevant portion of the database from a central server.
As used herein, a mobile platform refers to a device such as a cellular or other wireless communication device, personal communication system (PCS) device, personal navigation device (PND), Personal Information Manager (PIM), Personal Digital Assistant (PDA), laptop or other suitable mobile device which is capable of receiving wireless communication and/or navigation signals, such as navigation positioning signals. The term “mobile platform” is also intended to include devices which communicate with a personal navigation device (PND), such as by short-range wireless, infrared, wireline connection, or other connection—regardless of whether satellite signal reception, assistance data reception, and/or position-related processing occurs at the device or at the PND. Also, “mobile platform” is intended to include all devices, including wireless communication devices, computers, laptops, etc. which are capable of communication with a server, such as via the Internet, WiFi, or other network, and regardless of whether satellite signal reception, assistance data reception, and/or position-related processing occurs at the device, at a server, or at another device associated with the network. Any operable combination of the above are also considered a “mobile platform.”
A satellite positioning system (SPS) typically includes a system of transmitters positioned to enable entities to determine their location on or above the Earth based, at least in part, on signals received from the transmitters. Such a transmitter typically transmits a signal marked with a repeating pseudo-random noise (PN) code of a set number of chips and may be located on ground based control stations, user equipment and/or space vehicles. In a particular example, such transmitters may be located on Earth orbiting satellite vehicles (SVs) 102, illustrated in
In accordance with certain aspects, the techniques presented herein are not restricted to global systems (e.g., GNSS) for SPS. For example, the techniques provided herein may be applied to or otherwise enabled for use in various regional systems, such as, e.g., Quasi-Zenith Satellite System (QZSS) over Japan, Indian Regional Navigational Satellite System (IRNSS) over India, Beidou over China, etc., and/or various augmentation systems (e.g., an Satellite Based Augmentation System (SBAS)) that may be associated with or otherwise enabled for use with one or more global and/or regional navigation satellite systems. By way of example but not limitation, an SBAS may include an augmentation system(s) that provides integrity information, differential corrections, etc., such as, e.g., Wide Area Augmentation System (WAAS), European Geostationary Navigation Overlay Service (EGNOS), Multi-functional Satellite Augmentation System (MSAS), GPS Aided Geo Augmented Navigation or GPS and Geo Augmented Navigation system (GAGAN), and/or the like. Thus, as used herein an SPS may include any combination of one or more global and/or regional navigation satellite systems and/or augmentation systems, and SPS signals may include SPS, SPS-like, and/or other signals associated with such one or more SPS.
The mobile platform 100 is not limited to use with an SPS for position determination, as position determination techniques described herein may be implemented in conjunction with various wireless communication networks, including cellular towers 104 and from wireless communication access points 106, such as a wireless wide area network (WWAN), a wireless local area network (WLAN), a wireless personal area network (WPAN). Further the mobile platform 100 may access one or more servers to obtain data, such as reference images and reference features from a database, using various wireless communication networks via cellular towers 104 and from wireless communication access points 106, or using satellite vehicles 102 if desired. The term “network” and “system” are often used interchangeably. A WWAN may be a Code Division Multiple Access (CDMA) network, a Time Division Multiple Access (TDMA) network, a Frequency Division Multiple Access (FDMA) network, an Orthogonal Frequency Division Multiple Access (OFDMA) network, a Single-Carrier Frequency Division Multiple Access (SC-FDMA) network, Long Term Evolution (LTE), and so on. A CDMA network may implement one or more radio access technologies (RATs) such as cdma2000, Wideband-CDMA (W-CDMA), and so on. Cdma2000 includes IS-95, IS-2000, and IS-856 standards. A TDMA network may implement Global System for Mobile Communications (GSM), Digital Advanced Mobile Phone System (D-AMPS), or some other RAT. GSM and W-CDMA are described in documents from a consortium named “3rd Generation Partnership Project” (3GPP). Cdma2000 is described in documents from a consortium named “3rd Generation Partnership Project 2” (3GPP2). 3GPP and 3GPP2 documents are publicly available. A WLAN may be an IEEE 802.11x network, and a WPAN may be a Bluetooth network, an IEEE 802.15x, or some other type of network. The techniques may also be implemented in conjunction with any combination of WWAN, WLAN and/or WPAN.
Additionally, because the database 212 may include objects that are captured in multiple views, and, additionally, each object may possess local features that are similar to features found in other objects, it is desirable that the database 212 is pruned to retain only the most distinctive features and, as a consequence, a representative minimal set of features to reduce storage requirements while improving recognition performance or at least not harming recognition performance. For example, an image in VGA resolution (640 pixels×480 pixels) that undergoes conventional Scale Invariant Feature Transform (SIFT) processing would result in around 2500 d-dimensional SIFT features with d≈128. Assuming 2 bytes per feature element, storage of the SIFT features from one image in VGA resolution would require approximately 2500×128×2 bytes or 625 Kb of memory. Accordingly, even with a limited set of objects, the storage requirements may be large. For example, the ZuBud database has only 201 unique POI building objects with five views per object, resulting in a total of 1005 images and a memory requirement that is in the order of 100s of Mega bytes. It is desirable to reduce the number of features stored in the database, particularly where a local database 153 will be stored on the client side, i.e., mobile platform 100.
The tagged imagery 252 is processed by extracting features from the geo-tagged imagery, pruning the features in the database, as well as determining and assigning a significance for the features, e.g., in the form of a weight (254). The extracted features are to provide a recognition-specific representation of the images, which can be used later for comparison or matching to features from a query image. The representation of the images should be robust and invariant to a variety of imaging conditions and transformations, such as geometric deformations (e.g., rotations, scale, translations etc.), filtering operations due to motion blur, bad optics etc., as well as variations in illuminations, and changes in pose. Such robustness cannot be achieved by comparing the image pixel values and thus, an intermediate representation of image content that carries the information necessary for interpretation is used. Features may be extracted using a well known technique, such as Scale Invariant Feature Transform (SIFT), which localizes features and generates their descriptions. If desired, other techniques, such as Speed Up Robust Features (SURF), Gradient Location-Orientation Histogram (GLOH), Compressed Histogram of Gradients (CHoG) or other comparable techniques may be used. Extracted features are sometimes referred to herein as keypoints, which may include feature location, scale and orientation when SIFT is used, and the descriptions of the features are sometimes referred to herein as keypoint descriptors or simply descriptors. The extracted features may be compressed either before pruning the database or after pruning the database. Compressing the features may be performed by exploiting the redundancies that may be present along the features dimensions, e.g., using principal component analysis to reduce the descriptor dimensionality from N to D, where D<N, such as from 128 to 32. Other techniques may be used for compressing the features, such as entropy coding based methods. Additionally, object metadata for the reference objects, such as geo-location or identification, is extracted and associated with the features (256) and the object metadata and associated features are indexed and stored in the database 212 (258).
Inter-object pruning (320) is used to retain the most informative set of descriptors across different objects, by characterizing the discriminability of the keypoint descriptors for all of the objects and removing keypoint descriptors with a discriminability that is less than a threshold. Inter-object pruning (320) helps improve classification performance and confidence by discarding keypoints in the database that appear in several different objects.
Location based pruning and keypoint clustering (340) is used to help ensure that the final set of pruned descriptors have good information content and provide good matches across a range of scales. Location based pruning removes keypoint location redundancies within each view for each object. Additionally, keypoints are clustered based on location within each view for each object and a predetermined number of keypoints within each cluster is retained. The location based pruning and/or keypoint clustering (340) may be performed after the inter-object pruning (320), followed by associating the remaining keypoint descriptors with objects and storing in the database 212. If desired, however, as illustrated with the broken lines in
Additionally, if desired, the database 212 may be pruned using only one of the intra-object pruning, e.g., where the data is limited in the number of reference objects it contains, or the inter-object pruning.
The server 210 includes a server control unit 220 that is connected to and communicates with the external interface 214 and the user interface 216. The server control unit 220 accepts and processes data from the external interface 214 and the user interface 216 and controls the operation of those devices. The server control unit 220 may be provided by a processor 222 and associated memory 224, software 226, as well as hardware 227 and firmware 228 if desired. The server control unit 220 includes a intra-object pruning unit 230, an inter-object pruning unit 232 and a location based pruning and keypoint clustering unit 234, which may be are illustrated as separate from the processor 222 for clarity, but may be within the processor 222. It will be understood as used herein that the processor 222 can, but need not necessarily include, one or more microprocessors, embedded processors, controllers, application specific integrated circuits (ASICs), digital signal processors (DSPs), and the like. The term processor is intended to describe the functions implemented by the system rather than specific hardware. Moreover, as used herein the term “memory” refers to any type of computer storage medium, including long term, short term, or other memory associated with the mobile platform, and is not to be limited to any particular type of memory or number of memories, or type of media upon which memory is stored.
The methodologies described herein may be implemented by various means depending upon the application. For example, these methodologies may be implemented in software 226, hardware 227, firmware 228 or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
For a firmware and/or software implementation, the methodologies may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. Any machine-readable medium tangibly embodying instructions may be used in implementing the methodologies described herein. For example, software codes may be stored in memory 224 and executed by the processor 222. Memory may be implemented within the processor unit or external to the processor unit. As used herein the term “memory” refers to any type of long term, short term, volatile, nonvolatile, or other memory and is not to be limited to any particular type of memory or number of memories, or type of media upon which memory is stored.
For example, software 226 codes may be stored in memory 224 and executed by the processor 222 and may be used to run the processor and to control the operation of the mobile platform 100 as described herein. A program code stored in a computer-readable medium, such as memory 224, may include program code to extract keypoints and generate keypoint descriptors from a plurality of images and to perform intra-object and/or inter-object pruning as described herein, as well as program code to cluster keypoints in each image based on location and retain a subset of keypoints in each cluster of keypoints; program code to associate remaining keypoints with an object identifier; and program code to store the associated remaining keypoints and object identifier in the database.
If implemented in firmware and/or software, the functions may be stored as one or more instructions or code on a computer-readable medium. Examples include computer-readable media encoded with a data structure and computer-readable media encoded with a computer program. Computer-readable media includes physical computer storage media. A storage medium may be any available medium that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer; disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The server 210 prunes the database by at least one of intra-object pruning, inter-object pruning as well as location based pruning and/or keypoint clustering. The server may employ an information-theoretic approach or a distance comparison approach for database pruning. The distance comparison approach may be based on, e.g., Euclidean distance comparisons. The information-theoretic approach to database pruning models keypoint distribution probabilities to quantify how informative a particular descriptor is with respect to the objects in the given database. Before describing database pruning by server 210, it is useful to briefly review the mathematical notations to be used. Let M denote the number of unique objects, i.e., points of interest (POI), in the database. Let the number of image views for the ith object be denoted by Ni. Let the total number of descriptors across the Ni, views of the ith object be denoted by Ki. Let fi,j represent the jth descriptor for the ith object, where j=1 . . . Ki and i=1 . . . M. Let the set Si contain the Ki descriptors for the ith object such that siε{fi,j; j=Ki}. By pruning the database, the cardinality of the descriptor sets per object are significantly reduced but maintain high recognition accuracy.
In the information-theoretic approach to database pruning, a source variable X is defined as taking integer values from 1 to M, where X=i indicates that the ith object from the database was selected. Let the probability of X selecting the ith object be denoted by pr (X=i). Recall that the set Si contain the Ki descriptors for the ith object such that Siε{fi,j; j=1 . . . Ki}. Let {tilde over (S)}i represent the pruned descriptor set for the ith object. The pruning criterion can then be stated as:
max{tilde over (S)}[/(I(X;{tilde over (S)})] such that |{tilde over (S)}i|=|{tilde over (K)}i|,
where {tilde over (S)}={{tilde over (S)}1 . . . {tilde over (S)}M} and i=1 . . . M. eq. 1
The term I(X;{tilde over (S)}) represents the mutual information between X and {tilde over (S)}. The term {tilde over (K)}i denotes the desired cardinality of the pruned set {tilde over (S)}. In other words, to form the pruned database, it is desired to retain the descriptors from the original database that maximize the mutual information between X and the pruned database {tilde over (S)}. With such a criterion, features that are less informative about the occurrence of a database object in the input image may be eliminated. It is noted that maximization is prohibitive because it involves the joint and conditional distribution of descriptors given the entire database and is computationally expensive even for small M, Ki. Accordingly, it may be assumed that each descriptor is a statistically independent event, which implies that the mutual information in eq. 1 can be expressed as:
With the assumption of statistical independence of individual descriptors, the mutual information I(X;{tilde over (S)}) is expressed as the summation of the mutual information provided by individual descriptors in the pruned set. Maximizing the individual mutual information component I(X; fi,j) in eq. 2 is equivalent to minimizing the conditional entropy H(X|fi,j which is a measure of randomness about the source variable X given the descriptor fi,j. Therefore, lower conditional entropy for a particular descriptor implies that it is statistically more informative. The conditional entropy HX|fi,j is given as:
where pX=k|fi,j is the conditional probability of the source variable X equal to the kth object given the occurrence of descriptor fi,j(i=1 . . . M and j=1 . . . Ki). In a perfectly deterministic case, where the occurrence of a particular descriptor fi,j is associated with only one object in the database, the conditional entropy goes to 0; whereas, if a specific descriptor is equally likely to appear in all the M database objects then the conditional entropy is highest and is equal to log2M bits (assuming all objects are equally likely i.e., pr (X=k)=1/M. It is to be noted that selection of features based on the criteria that HX|fi,j<γ, where γ is set to, e.g., 1 bit, fails to consider keypoint properties such as scale and location in the section of the pruned descriptor set. Moreover, additional information may be imparted into the feature selection by associating a weighting factor to each descriptor, denoted by wi,j, and initialized to =1/Ki, where j=1 . . . Ki.
One or more of the matching keypoint descriptors within the set is removed leaving one or more keypoint descriptors (308), which helps retain the most significant keypoints that are related to the object for object detection. For example, the matching keypoint descriptors may be compounded into a single keypoint descriptor, e.g., by averaging or otherwise combining the keypoint descriptors, and all of the matching keypoint descriptors in the set may be removed. Thus, where the matching keypoint descriptors are compounded, the remaining keypoint descriptor is a new keypoint descriptor that is not from the set of matching keypoint descriptors. Alternatively, one or more keypoint descriptors from the set of matching keypoint descriptors may be retained, while the remainder of the set is removed. The one or more keypoint descriptors to be retained may be selected based on the dominant scale, the view that the keypoint belong to (e.g., it may be desired to retain the keypoints from a front view of the object), or it may be selected randomly. If desired, the keypoint location, scale information, object and view association of the remained keypoint descriptors may be retained which may be used for geometry consistency tests during outlier removal.
The significance of keypoint descriptors is determined and assigned to each remaining keypoint descriptor. For example, a weight may be determined and assigned to the one or more remaining keypoint descriptors (310). Where only one keypoint descriptor remains, the provided descriptor weight wi,j may be based on the number of matching keypoint descriptors in the set (Lj) with respect to the total number of possible keypoint descriptors (Kj), e.g., wi,j=Lj/Ki.
If there are additional keypoint descriptors for the ith object (312), the next keypoint descriptor is selected (313) and the process returns to block 306. When all of the keypoint descriptors for the ith object are completed, it is determined whether there are additional objects (314). If there are more objects, the next object is selected (315) and the process returns to block 304, otherwise, the intra-object pruning is finished (316).
The probability of belonging to a given object may be quantified for each descriptor f=fi,j (i=1 . . . M; j=1 . . . Ki) in the database as follows. The nearest neighbors are retrieved from the descriptor database of the keypoint descriptors remaining after intra-object pruning. The nearest neighbors may be retrieved using a search tree, e.g., using Fast Library for Approximate Nearest Neighbor (FLANN), and are retrieved based on an L2 (norm) less than a predetermined distance ε. The nearest neighbors are binned with respect to the object ID and may be denoted by fk,n where k is the object ID and n is the nearest neighbor index. The nearest neighbors are used to compute the conditional probabilities p(f=fi,j|X=k where k=1 . . . M. A mixture of Gaussians may be used to model the conditional probability and is provided as:
The probability of belonging to a given object is then used to compute the recognition-specific information content for each keypoint descriptor (324). The recognition-specific information content for each keypoint descriptor may be computed by determining as the posterior probability pX=k|f=fi,j using Bayes rule as follows:
The posterior probability can then be used to compute the conditional entropy HX|fi,j for an object, given a specific descriptor as described in eq. 3 above. The lower the conditional entropy for a particular descriptor implies that it is statistically more informative. Thus, for each object, keypoint descriptors are selected where the entropy is less than a predetermined threshold, i.e., HX|fi,j<γ bits and the remainder of the keypoint descriptors are removed (326). The object and view identification is maintained for the selected keypoint descriptors (328) and the inter-object pruning is finished (330). For example, for indexing purposes and geometric verification purposes (post descriptor matching), the object and view identification may be tagged with the selected feature descriptor in the pruned database.
Using the information-theoretic approach to pruning the database as described
above, the achievable database size reduction is lower bounded by
Besides database reduction, the information-optimal approach provides a formal framework to incrementally add or remove descriptors from the pruned set given feedback from a client mobile platform about recognition confidence level, or given system constraints, such as memory usage on the client, etc.
Using the information-optimal approach with the ZuBuD database, which has 201 objects and 5 views per object, from which approximately 1 million SIFT features were extracted, the feature dataset was reduced by approximately 8× to 40× based on a distance threshold of 0.4 for intra-object pruning and inter-object pruning and using 20 clusters (kc) per database image view and 3 to 15 keypoints (kl) per cluster, without significantly reduced recognition accuracy.
As discussed above, the server 210 may employ a distance comparison approach to perform the database pruning, as opposed to the information-theoretic approach. The distance comparison approach, similarly uses intra-object pruning, inter-object pruning, and location based pruning and keypoint clustering, but as illustrated in
Inter-object pruning 320 may then be performed to eliminate the keypoints that repeat across multiple objects. As discussed above, it is desirable to remove repeating keypoint features across multiple objects that might otherwise confuse the classifier. The inter-object pruning, which may be used with the distance comparison approach to pruning the database, identifies keypoint descriptors, fi1, l, and fi2, m (where l=1 . . . Ka, m=1 . . . K2), that do not belong to the same object, and checks to determine if the distance, e.g., Euclidean distance, between the features is less than a threshold, i.e., ∥fi2,l-fi2,m∥L
Using the distance comparison approach with the ZuBuD database, which has 201 objects and 5 views per object, from which approximately 1 million SIFT features were extracted, the feature dataset was reduced by approximately 80% based on threshold values τδ=0.15. Using the pruned database as a reference database, 115 query images provided as part of ZuBuD, were tested and a 100% recognition accuracy was achieved. Thus, using this approach, the size of the SIFT keypoint database may be reduced by approximately 80% without sacrificing object recognition accuracies.
Referring back to
The mobile platform 100 retrieves an image captured by the camera 120 (406) and extracts features and generates their descriptors (408). As discussed above, features may be extracted using Scale Invariant Feature Transform (SIFT) or other well known techniques, such as Speed Up Robust Features (SURF), Gradient Location-Orientation Histogram (GLOH), or Compressed Histogram of Gradients (CHoG). In general, SIFT keypoint extraction and descriptor generation includes the following steps: a) the input color images are converted to gray scales and a Gaussian pyramid is built by repeated convolution of the grayscale image with Gaussian kernels with increasing scale, the resulting images form the scale-space representation, b) difference of Gaussian (also known as DoG) scale-space images is computed, and c) local extrema of the DoG scale-space images are computed and used to identify the candidate keypoint parameters (location and scale) in the original image space. The steps (a) to (c) are repeated for various upsampled and downsampled versions of the original image. For each candidate keypoint, an image patch around the point is extracted and the direction of its significant gradient is found. The patch is then rotated according to the dominant gradient orientation and keypoint descriptors are computed. The descriptor generation is done by 1) splitting the image patch around the keypoint location into D1×D2 regions, 2) bin the gradients into D3 orientation bins, and 3) vectorize the histogram values to form the descriptor of dimension D1·D2·D3. The traditional SIFT description uses D1=D2=4, and D3=8, resulting in 128-dimensional descriptor. After the SIFT keypoints and descriptors are generated, they are stored in a SIFT database which is used for the matching process.
The extracted features are matched against the downloaded local database and confidence levels are generated per query descriptor (410) as discussed below. The confidence level for each descriptor can be a function of the posterior probability, distance ratios, distances, or some combination thereof. Outliers are then removed (420) using the confidence levels, with the remaining objects considered a match to the query image as discussed below. The outlier removal may include geometric filtering in which the geometry transformation between the query image and the reference matching image may be determined. The result may be used to render a user interface, e.g., render 3D game characters/actions on the input image or augment the input image on a display, using the metadata for the object that is determined to be matching (430).
The nearest neighbor descriptors for Qj are binned with respect to the object identification, e.g., denoted by fi,n, where i is the object identification and n is the nearest neighbor index (411a). The resulting nearest neighbors and distance measures binned with respect to the object are provided to a confidence level calculation block (418) as well as to determine the quality of the match (412), which may be determined using a posterior probability (412a), distance ratios (412b), or distances (412c) as illustrated in
The resulting posterior probability is provided to the confidence level calculation block (418) as well as to compute the probability p(Q=i) (413) indicating how likely is the query image to belong to one of the objects in the database as follows:
The probability p(Q=i) is provided to create the object candidate set (416). The posterior probability pQ=i|f=fi,n can also be used in a client feedback process to provide useful information that can improve pruning.
Additionally, instead of using the posterior probability (412a), the quality of the match between the retrieved nearest neighbors and the query keypoint descriptors may be performed based on a distance ratio test (412b). The distance ratio test is performed by identifying two nearest neighbors based on Euclidean distance between the d-dimensional SIFT descriptors (d=128 for traditional SIFT). The ratio of distances of the query keypoint to the closest neighbor and the next closest neighbor is then computed and a match is established if the distance ratio is less than a pre-selected threshold. A randomized kd-tree, or any such search tree method, may be used to perform the nearest neighbor search. At the end of this step, a list of pairs of reference object and input image keypoints (and their descriptors) are identified and provided. It is noted that the distance ratio test will have a certain false alarm rate given the choice of threshold. For example, for one specific image, a threshold equal to 0.8 resulted in a 4% false alarm rate. Reducing the threshold allows reduction of the false alarm rate but results in fewer descriptor matches and reduces confidence in declaring a potential object match. The confidence level (418) may be computed based on distance ratios, e.g., by generating numbers between 0 (worst) to 100 (best) depending upon the distance ratio, for example, using a one-to-one mapping function, where a confidence level of 0 would correspond to distance ratio close to 1, and a confidence level of 100 would correspond to distance ratio close to 0.
The quality of the match (412) between the retrieved nearest neighbors and the query keypoint descriptors may also be determined based on distance (412c). The distance test is performed, e.g., by identifying the Euclidean distance between keypoint descriptors from the query image and the reference database, where any two keypoint descriptors fi,l and fi,m (where l, m=1 . . . K) are determined to be a match if the Euclidean distance between the features is less than a threshold, i.e., |fi,l-fi,m∥L
The potential matching object set is selected (416) from the top matches, i.e., the objects with the highest probability p(Q=i). Additionally, a confidence measure can be calculated based on the probabilities, for example, using entropy which is given by:
The object candidate set and confidence measure is used in the outlier removal (420). If the confidence score from equation 8 is less than a pre-determined threshold, then the query object can be presumed to belong to new or unseen content category, which can be used to a client feedback process for incremental learning stage, discussed below. Note that in the above example, the confidence score is defined based on the classification accuracy, but it could also be a function of other quality metrics.
A confidence level computation (418) for each query descriptor is performed using the binned nearest neighbors and distance measures from (411a) and, e.g., the posterior probabilities from (412a). The confidence level computation indicates the importance of the contribution of each query descriptor towards overall recognition. The confidence level may be denoted by Ci(Qi), where Ci(Qj) is a function of p(Q=i|f=Qj and distances with nearest neighbors fi,n. The probabilities p(Q=i|f=Qj may be generalized by considering i as a two-tuple with the first element representing the object identification and the second element representing the view identification.
To refine the candidate set from (416), an outlier removal process is used (420). The outlier removal 420 receives the top candidates from the created candidate set (416) as well as the stored confidence level for each query keypoint descriptor Ci(Qj), which is used to initialize the outlier removal steps, i.e., by providing a weight to the query descriptors that are more important in the object recognition task. The confidence level can be used to initialize RANSAC based geometry estimation with the keypoints that matched well or contributed well in the recognition so far. The outlier removal process (420) may include distance filtering (422), orientation filtering (424), or geometric filtering (426) or any combination thereof. Distance filtering (422) includes identifying the number of keypoint matches between the query and database image for each object candidate and of its views in the candidate set. The distance filtering (422) may be influenced by the confidence levels determined in (418). The object-view combinations with the maximum number of matches may then be chosen for further processing, e.g., by orientation filtering (424) or geometric filtering (426), or the best match may be provided as the closest object match.
Orientation filtering (424) computes the histogram of the descriptor orientation difference between the query image and the candidate object-view combination in the database and finds the object-view combinations with a large number of inliers that fall within <θ0 degrees. By way of example, θ0 is a suitably chosen threshold, such as 100 degrees. The object-view combinations within the threshold may then be chosen for further processing, e.g., by distance filtering (422), e.g., if orientation filtering is performed first, or by geometric filtering (426). Alternatively, the object-view combination within a suitably tight threshold may be provided as the closest object match.
Geometric filtering (426) is used to verify affinity and/or estimate homography. During geometric filtering, a transformation model is fit between the matching keypoint spatial coordinates in the query image and the potential matching images from the database. An affine model may be fit, which incorporates transformations such as translation, scaling, shearing, and rotation. A homography based model may also be fit, where homography defines the mapping between two perspectives of the same object and preserves co-linearity of points. In order to estimate the affine and the homography models, RANdom SAmpling Consensus (RANSAC) optimization approach may be used. For example, the RANSAC method is used to fit an affine model to the list of pairs of keypoints that pass the distance ratio test. The set of inliers that pass the affine test may be used to compute the homography and estimate the pose of the query object with respect to a chosen reference database image. If a sufficient number of inliers match from the affinity model and/or homography model, the object is provided as the closest object match. If desired, the geometric transformation model may be used as input to a tracking and augmentation block (430, shown in
The mobile platform includes a means for capturing an image, such as camera 120, which may produce still or moving images that are displayed by the mobile platform 100. The mobile platform 100 may also include a means for determining the direction that the viewer is facing, such as orientation sensors 130, e.g., a tilt corrected compass including a magnetometer, accelerometers and/or gyroscopes.
Mobile platform 100 may include a receiver 140 that includes a satellite positioning system (SPS) receiver that receives signals from SPS satellite vehicles 102 (
The orientation sensors 130, camera 120, SPS receiver 140, and wireless transceiver 145 are connected to and communicate with a mobile platform control 150. The mobile platform control 150 accepts and processes data from the orientation sensors 130, camera 120, SPS receiver 140, and wireless transceiver 145 and controls the operation of the devices. The mobile platform control 150 may be provided by a processor 152 and associated memory 154, hardware 156, software 158, and firmware 157. The mobile platform control 150 may also include a means for generating an augmentation overlay for a camera view image such as an image processing engine 155, which is illustrated separately from processor 152 for clarity, but may be within the processor 152. The image processing engine 155 determines the shape, position and orientation of the augmentation overlays that are displayed over the captured image. It will be understood as used herein that the processor 152 can, but need not necessarily include, one or more microprocessors, embedded processors, controllers, application specific integrated circuits (ASICs), digital signal processors (DSPs), and the like. The term processor is intended to describe the functions implemented by the system rather than specific hardware. Moreover, as used herein the term “memory” refers to any type of computer storage medium, including long term, short term, or other memory associated with the mobile platform, and is not to be limited to any particular type of memory or number of memories, or type of media upon which memory is stored.
The mobile platform 100 also includes a user interface 110 that is in communication with the mobile platform control 150, e.g., the mobile platform control 150 accepts data and controls the user interface 110. The user interface 110 includes a means for displaying images such as a digital display 112. The display 112 may further display control menus and positional information. The user interface 110 further includes a keypad 114 or other input device through which the user can input information into the mobile platform 100. In one embodiment, the keypad 114 may be integrated into the display 112, such as a touch screen display. The user interface 110 may also include, e.g., a microphone and speaker, e.g., when the mobile platform 100 is a cellular telephone. Additionally, the orientation sensors 130 may be used as the user interface by detecting user commands in the form of gestures.
The methodologies described herein may be implemented by various means depending upon the application. For example, these methodologies may be implemented in hardware 156, firmware 157, software 158, or any combination thereof. For a hardware implementation, the processing units may be implemented within one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), processors, controllers, micro-controllers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, or a combination thereof.
For a firmware and/or software implementation, the methodologies may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. Any machine-readable medium tangibly embodying instructions may be used in implementing the methodologies described herein. For example, software codes may be stored in memory 154 and executed by the processor 152. Memory may be implemented within the processor unit or external to the processor unit. As used herein the term “memory” refers to any type of long term, short term, volatile, nonvolatile, or other memory and is not to be limited to any particular type of memory or number of memories, or type of media upon which memory is stored.
For example, software 158 codes may be stored in memory 154 and executed by the processor 152 and may be used to run the processor and to control the operation of the mobile platform 100 as described herein. A program code stored in a computer-readable medium, such as memory 154, may include program code to perform a search of a database using extracted keypoint descriptors from a query image to retrieve neighbors; program code to determine the quality of match for each retrieved neighbor with respect to associated keypoint descriptor from the query image; program code to use the determined quality of match for each retrieved neighbor to generate an object candidate set; program code to remove outliers from the object candidate set using the determined quality of match for each retrieved neighbor to provide the at least one best match; and program code to store the at least one best match.
If implemented in firmware and/or software, the functions may be stored as one or more instructions or code on a computer-readable medium. Examples include computer-readable media encoded with a data structure and computer-readable media encoded with a computer program. Computer-readable media includes physical computer storage media. A storage medium may be any available medium that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer; disk and disc, as used herein, includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The different curves in
Although the present invention is illustrated in connection with specific embodiments for instructional purposes, the present invention is not limited thereto. Various adaptations and modifications may be made without departing from the scope of the invention. Therefore, the spirit and scope of the appended claims should not be limited to the foregoing description.