The present disclosure relates to image recognition or search techniques. More precisely, the present disclosure relates to pruning of local descriptors that are encoded for image searching.
Various computer or machine based applications offer the ability to perform image searches. Image searching involves searching for an image or images based on the input of an image or images. Image searching is relevant in many fields, including the fields of computer vision, object recognition, video tracking, and video-based location determination/mapping.
Once the local images patches are computed at block 122, block 123 computes the local descriptor for each of the image patches. Each local descriptor may be computed using an algorithm for each patch, such as the Scale Invariant Feature Transform (SIFT) algorithm. The result is a local descriptor vector (e.g., a SIFT vector xi of size 128) for each patch. Once a local descriptor vector has been computed for each image patch, the resulting set of local descriptor vectors are provided for further processing at block 124. Examples of discussions of local descriptor extractions include: David Lowe, Distinctive Image Features From Scale Invariant Keypoints, 2004; and K. Mikolajczyk, A Comparison of Affine Region Dectors, 2006.
Block 130 may receive the local descriptors computed at block 120 and encode these local descriptors into a single global image feature vector. An example of a discussion of such encoders is in Ken Chatfield et al., The devil is in the details: an evaluation of recent feature encoding methods, 2011. Examples of image feature encoders include: bag-of-words encoder (an example of which is Josef Sivic et al., Video Google: A Text Retrieval Approach to Object Matching in Videos, 2003); Fisher encoder (an example of which is Florent Perronnin et al., Improving the Fisher Kernel for Large-Scale Image Classification, 2010), and VLAD encoder (example of which is Jonathan Delhumeau et al., Revisiting the VLAD image representation, 2013).
These encoders depend on the specific models of distribution of the local descriptors obtained in block 120. For example the Bag-of-Words and VLAD encoders use a codebook model obtained using K-means, while the Fisher encoder is based on a Gaussian Mixture Model (GMM).
Block 140 may receive the global feature vector computed at block 130 and perform image searching 140 on the global feature vector. Image search techniques can be broadly split into two categories, semantic search or image retrieval.
The perform image retrieval algorithm at block 142 may perform the image retrieval algorithm based on the global image feature vector of the input image and the feature vectors in the Large Feature Database 145. For example, block 142 may calculate the Euclidean distance between the global image feature vector of the input image and each of the feature vector in the Large Feature Database 145. The result of the computation at block 142 may be outputted at Output Search Results block 146. If multiple results are returned, the results may be ranked and the ranking may be provided along with the results. The ranking may be based on the distance between the input global feature vector and the feature vectors of the resulting images (e.g., the rank may increase based on increasing distance).
Generally, in image search retrieval methods the search system may be given an image of a scene, and the system aims to find all images of the same scene, even images of the same scene that were altered due to a task-related transformation. Examples of such transformations include changes in scene illumination, image cropping, scaling, wide changes in the perspective of the camera, high compression ratios, or picture-of-video-screen artifacts.
The Large Feature Database 160 may consist of global feature vectors of each of the images in a Large Image Search Database 158. The Large Image Search Database 158 may contain of all the images searched during an image retrieval search. The compute feature vector for each image block 159 may compute a global feature vector for each image in the Large Image Search Database 160 in accordance with the techniques described relative to
The problem with the existing extraction of local image descriptors and encoding of such local descriptors is that each local descriptor is assigned to either one of a codeword from the K-means codebook (for the case of bag-of-words or VLAD) or to a GMM mixture components via soft-max weights (for the case of the Fisher encoder). This poses a problem because there are local descriptors that are too far away from all codewords or GMM mixture components for the assignment to be reliable. Despite this limitation, existing schemes must assign these too-faraway descriptors. The assignment of these too-faraway descriptors results in a degradation of the quality of the search based on such encodings. Therefore, there is a need to prune these too-faraway local descriptors in order to improve the quality of the resulting search results.
Avila et al., Pooling in image representation: the visual codeword point of view, 2013 and Avila et al., BOSSA: Extended Bow Formalism for Image Classification, 2011 discuss keeping a histogram of distances between local descriptors found in an image and the codewords of the codebook, however this does not cure the existing problems. First, the Avila methods do not relate to pruning local descriptors. Instead, the Avila methods relate to creating sub-bins per Voronoi cell by defining up to five distance thresholds from the cell's code word. Avila does not consider using Mahalanobis metrics to compute the distances. The Avila methods are also limited to bag-of-words aggregators. Moreover, the Avila methods do not consider soft weight extensions of local descriptor pruning.
Likewise, the following do not describe a model of distribution of local descriptors (e.g., a codebook or a GMM model) and do not describe pruning in the local descriptor space (e.g., per-cell basis or otherwise): U.S. Pat. No. 8,705,876; Zhiang Wu et al., A novel noise filter based on interesting pattern mining for bag-of-features images, 2013; Sambit Bakshi et al., Postmatch pruning of SIFT pairs for iris recognition; Saliency Based Space Variant Descriptor; Dounia Awad et al., Saliency Filtering of SIFT Detectors: Application to CBIR; Eleonara Vig et al., Space-variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements.
There is a need for a mechanism that allows the pruning of local descriptors in order to improve searching performance.
An aspect of present principles is directed to methods, apparatus and systems for processing an image for image searching comprising. The apparatus or systems may include a memory, a processor, a local descriptor pruner configured to prune at least a local descriptor based on a relationship of the local descriptor and a codeword to which the local descriptor is assigned, wherein the local descriptor pruner assigns a weight value for the local descriptor based on the relationship of the local descriptor and the codeword and wherein the weight value is utilized by an image encoder during encoding. The method may include pruning a local descriptor based on a relationship of the local descriptor and a codeword to which the local descriptor is assigned; wherein pruning of the local descriptor includes assigning a weight value for the local descriptor based on the relationship of the local descriptor and the codeword and wherein the weight value is utilized during encoding of the pruned local descriptor.
An aspect of present principles is directed to, based on a determination by the local descriptor pruner, the local descriptor pruner assigns a hard weight value or a soft weight value that is either 1 or 0. In one example, the soft weight value is determined based on either exponential weighting or inverse weighting. The weight may be based on a distance between the local descriptor and the codeword. Alternatively, the weight based on the following equation wk(x)=[[(x−ck)TMk−1(x−ck)≦γσk2]], wherein k is an index value, x is the local descriptor, ck is the assigned codeword, and γ, σk, and Mk are parameters computed prior to initialization, and [[ . . . ]] is the evaluation to 1 if the condition is true and 0 otherwise. Alternatively, the weight may based on a probability value determined based on a GMM model evaluated at the local descriptor. Alternatively, the weight may be based on a parameter that is computed from a training set of images. The image encoder may be at least one selected from the group of a Bag of Words encoder, a Fisher Encoder or a VLAD encoder. There may be further an image searcher configured to retrieve at least an image result based on the results of the image encoder. There may be further computed at least an image patch and configured to extract a local descriptor for the image patch.
The features and advantages of the present invention may be apparent from the detailed description below when taken in conjunction with the Figures described below:
Examples of the present invention relate to an image processing system that includes a local descriptor pruner for pruning the local descriptors based on a relationship between the local descriptor and a codeword to which the local descriptor is assigned. The local descriptor assigns a weight value for the local descriptor based on the relationship of the local descriptor and the codeword and this weight value is then utilized during encoding.
Examples of the present invention also relate to a method for pruning local descriptors based on a relationship between the local descriptor and a codeword. The method assigns a weight value for the local descriptor based on the relationship of the local descriptor and the codeword and the weight value is utilized by an image encoder during encoding.
In one example, the local descriptor pruner or the pruning method can assign a hard weight value that is either 1 or 0. In one example, the local descriptor pruner or the pruning method can assign a soft weight value that is between 0 and 1. In one example, the local descriptor pruner or the pruning method can determine the soft weight value based on either exponential weighting or inverse weighting. In one example, the local descriptor pruner or the pruning method can determine the weight based on a distance between the local descriptor and the codebook cell. In one example, the local descriptor pruner or the pruning method can determine the weight based on the following equation wk(x)=[[(x−ck)TMk−1(x−ck)≦γσk2]], where k is an index value, x is the local descriptor, ck is the assigned codeword, and γ, σk, and Mk are parameters computed prior to initialization, and [[ . . . ]] is the evaluation to 1 if the condition is true and 0 otherwise. In one example, the local descriptor pruner or the pruning method can determine the weight based on a probability value determined based on a GMM model evaluated at the local descriptor. In one example, the local descriptor pruner or the pruning method can determine the weight based on a parameter that is computed from a training set of images.
In one example, the image encoder is at least one of a Bag of Words encoder, a Fisher Encoder or a VLAD encoder. The encoding of the method can be based on at least one of a Bag of Words encoder, a Fisher Encoder or a VLAD encoder.
In one example, the system or method further comprise an image searcher or an image searching method for retrieving an image based on the results of the image encoder or image encoding, respectively.
In one example, the system or method further comprise a local descriptor extractor or a local descriptor extracting method for computing at least an image patch and configured to extract a local descriptor for the image patch.
The scalars, vectors and matrices notation may be denoted by respectively standard, underlined, and underlined uppercase typeface (e.g., scalar a, vector a and matrix A). A variable vk may be used to denote a vector from a sequence v1, v2, . . . , vN, and vk to denote the k-th coefficient of vector v. The following notation [ak]k (respectively, [ak]k) denotes concatenation of the vectors ak (scalars ak) to form a single column vector. The following notation ,[[.]] denotes the evaluation to 1 if the condition is true and 0 otherwise.
The present invention may be implemented on any electronic device or combination of electronic devices. For example, the present invention may be implemented on any of variety of devices including a computer, a laptop, a smartphone, a handheld computing system, a remote server, or on any other type of dedicated hardware. Various examples of the present invention are described below with reference to the figures.
Exemplary Process of Image Searching with Local Descriptor Pruning
The Extract Local Descriptors block 220 may receive the input image from Input Image block 210. The Extract Local Descriptors block 220 may extract local descriptors in accordance with the processes described in connection with
The Extract Local Descriptors block 220 may compute one or more patches for the input image. In one example, the image patches may be computed using a dense detector, an example of which is shown in
For each image patch, the Extract Local Descriptors block 220 extracts a local descriptor using a local descriptor extraction algorithm. For example, the Extract Local Descriptors block 220 may extract local descriptors in accordance with the processes described in connection with
In one example, the Extract Local Descriptors block 220 extracts the local descriptors for each image patch by using a Scale Invariant Feature Transform (SIFT) algorithm on each image patch resulting in a corresponding SIFT vector for each patch. The SIFT vector may be of any number of entries. In one example, the SIFT vector may have 128 entries. In one example, the Extract Local Descriptors block 220 may compute N image patches for an image (image patch i, where i=1, 2, 3 . . . N). For each image patch i, a SIFT vector of size 128 is computed. At the end of processing, the Extract Local Descriptors block 220 outputs N SIFT local descriptor vectors, each SIFT local descriptor vector of size 128. In another example, the Extract Local Descriptors block 220 may use an algorithm other than the SIFT algorithm, such as, for example: Speeded Up Robust Features (SURF), Gradient Location and Orientation Histogram (GLOH), Local Energy based Shape Histogram (LESH), Compressed Histogram of Gradients (CHoG); Binary Robust Independent Elementary Features (BRIEF), Discriminative Binary Robust Independent Elementary Features (D-BRIEF) or the Daisy descriptor.
The output of the Extract Local Descriptors block 220 may be a set of local descriptors vectors. In one example, the output of Extract Local Descriptors block 220 may be a set I={xiεRd}i of local SIFT descriptor vectors, where each xi represents a local descriptor vector computed for a patch of the inputted image.
The Prune Local Descriptors block 230 receives the local descriptors from the Extract Local Descriptors block 220. The Prune Local Descriptors block 230 prunes the received local descriptors to remove local descriptors that are too far away from either codewords or GMM mixture components of an encoder. The pruning of such too-far away local descriptors prevents the degradation in quality of image searching. The present invention thus allows the return of more reliable image search results by pruning local descriptors that are too far away to be visually informative. This is particularly beneficial in multi-dimensional descriptor spaces, and particularly in high-dimensional local descriptor spaces, because in those spaces cells are almost always unbounded, meaning that they have infinite volume. Yet only a part of this volume is informative visually. The present invention allows the system to isolate this visually informative information by pruning non-visually informative local descriptors.
The Prune Local Descriptors block 230 may employ a local-descriptor pruning method applicable to any subsequently used encoding methods (BOW, VLAD and Fisher). In one example, the Prune Local Descriptors block 230 may receive a signal indicating the encoder that is utilized. Alternatively, the Prune Local Descriptors block 230 may prune the local descriptors vectors independent of the subsequent encoding method. Generally, the Prune Local Descriptors block 230 may prune local descriptors vectors for any feature encoding methods based on local descriptors where each local descriptor is related to a cell Ck or mixture component/soft cell (βk,ck,Σk), where k denotes the index. In one example, cell Ck denotes a Voronoi cell {x|xεRd,k=argminj|x−cj|} associated to codeword ck. In another example, βi,ci, Σi denotes the soft cell of the i-th GMM component, where βi=prior weight i; ci=mean vector i; Σi=covariance matrix i (assumed diagonal).
In one example, a cell Ck may denote a Voronoi cell {x|xεRd,k=argminj|x−cj|} associated with a codeword ck. The codewords ck may be learned using a set of training SIFT (or any other type of local descriptor) vectors from a set of training images and are kept fixed during encoding. The learning of the codewords ck may be performed at an initialization stage using K-means, where, for example, a vector may be computed to be the average of all the SIFT vectors assigned to cell number k. For example, for a codeword c1, each SIFT vector xi (i=1, 2, 3, 4, . . . ) that is closer to c1 than to any other ck, where k is a number other than 1, is assigned to cell number 1. Once all the ck are computed, the process is repeated until convergence, since changing the ck changes which SIFT vectors are closest to which ck.
In one example, each soft cell Ci is defined by the parameters βi=prior weight i; ci=mean vector i; Σi=covariance matrix i. These parameters for all the cells i=1, 2, 3, . . . , L are the output of a GMM learning algorithm implemented, for example, using standard approaches like the Expectation Maximization algorithm. When pruning descriptors based on GMM models, the same approach used for hard cells can be used: soft and hard weights wi(x) can be computed based on the distance between x and ci. An alternate hard pruning approach tailored to GMM models is to apply a threshold (learned experimentally so as to maximize the mAP on a training set) on the probability value p(x) produced by the GMM model at the point x. A soft-pruning approach might instead use the probability itself or a mapping of this probability. A possible mapping is p(x)a for some value of a between 0 and 1.
In one example, the Prune Local Descriptors block 230 prunes the local descriptors of the inputted image based on a determination of whether the local descriptors are too far away from their assigned cells or soft cells. For example, the block 230 determines whether the local descriptors are too far from the codeword of cell Ck or a mixture component (βk,ck,Σk) (soft cells). In one example, the Prune Local Descriptors block 230 may prune the local descriptors by removing the local descriptors that exceed a threshold distance between the local descriptor and the codeword ck at the center of the cell Ck containing the local descriptor.
In one example, the pruning process may receive a codebook including codewords relating to cells or soft cells at block 232. The codebook may either be received from local storage or through a communication link with a remote location. The codebook may be initialized at or before the initialization of the pruning process. In one example, a codebook {ck}k defines Voronoi cells {Ck}k where k denotes the index of the cell. In another example, a codebook may include soft cells Ci defined by the parameters βi=prior weight i; ci=mean vector i; Σi=covariance matrix i.
In one example, the pruning process assigns at block 233 each local descriptor to a cell or a soft cell received at block 232. In one example, the pruning process may assign each local descriptor to a cell by locating the cell whose codeword has the closest Euclidean distance to the local descriptor.
In one example the assigned local descriptors are pruned at block 234. In one example, the pruning process at block 234 evaluates each local descriptor to determine whether that local descriptor is too far away from its assigned cell or soft cell. In one example, the pruning process determines whether the local descriptor is too far away based on a determination if the distance between that local descriptor and the center or codeword of its assigned cell or soft cell exceeds a calculated or predetermined threshold. In an illustrative example, if a local descriptor is assigned to cell no. 5, the pruning process 234 may test whether the Euclidean distance between that local descriptor vector and the codeword vector no. 5 does not exceed a threshold. In another example, the pruning process may determine a probability value for the local descriptor relative to a cell(s) or a soft cell(s). The pruning process may determine if the probability value is below or above a certain threshold and prune local descriptors based on this determination. In an illustrative example, a GMM model may yield a probability value for a local descriptor x, where the probability value is between 0 and 1. In one example, the pruning process may prune the local descriptor x if the probability value is lower than a certain threshold. (e.g., less than thresh=0.01). The value of this threshold can be determined experimentally using a training set.
In one example, each local descriptor may be pruned by assigning a hard weight value (1 or 0) based on whether the local descriptor exceeds a threshold distance between the local descriptor and its assigned cell or soft cell. Alternatively, the local descriptors may be pruned by assigning a soft weight value (between 0 and 1) to each local descriptor based on the distance between the local descriptor and its assigned cell or soft cell.
In one example, each local descriptor x may be pruned based on whether the distance between local descriptor x and its assigned codeword ck exceeds the threshold determined by the following distance-to-ck condition:
(x−ck)TMk−1(x−ck)≦γσk2. (Equation 1)
The parameters γ, σk, and Mk may be computed prior to initialization and may be either stored locally or received via a communication link.
In one example, the value of γ is determined experimentally by cross-validation and the parameter σk is computed from the variance of a training set of local descriptors as follows:
σk=((x−ck)TMk−1(x−ck)) (Equation 2)
In one example, the matrix Mk can be any of the following
Anisotropic Mk: the empirical covariance matrix computed from ∩Ck;
Axis-aligned Mk: the same as the anisotropic Mk, but with all elements outside the diagonal set to zero;
Isotropic Mk: a diagonal matrix σk2I with σk2 equal to the mean diagonal value of the axis-aligned Mk.
While the anisotropic variant may offer the most geometrical modelling flexibility, it may also increase computational cost. The isotropic variant, on the other hand, enjoys practically null computational overhead, but may have the least modelling flexibility. The axis-aligned variant offers a compromise between the two approaches.
In one example, the pruning of local descriptors can be implemented by Equation 1 by means of 1/0 weights as follows, where [[.]] is the indicator function that evaluates to one if the condition is true and zero otherwise,
w
k(x)=[[(x−ck)TMk−1(x−ck)≦γσk2]] (Equation 3)
In another example, the pruning of local descriptors can be implemented by Equation 1 using soft weights.
In another example, the soft-weights may be computed using exponential weighting, where
In another example, the soft-weights may be computed using inverse weighting, where
In one example, the pruned local descriptors are outputted at block 235.
The Encode Pruned Descriptors block 240 may receive the pruned local descriptors from the Prune Local Descriptors 230. The Encode Pruned Descriptors block 240 may compute image feature vectors by encoding the pruned local descriptors received from the Prune Local Descriptors block 230. The Encode Pruned Descriptors block 240 may use an algorithm such as a Bag-of-Words (BOW), Fisher or VLAD algorithm, or any other algorithm based on a codebook obtained from any clustering algorithm such as K-means or from a GMM model. The Encode Pruned Descriptors block 240 may encode the pruned local descriptors in accordance with the process described in
In one example, the Encode Pruned Descriptors block 240 may utilize a bag-of-words (BOW) encoder. The BOW encoder may be based on a codebook {ckΣRd}k=1L obtained by applying K-means to all the local descriptors =UtIt of a set of training images. Letting Ck denote the Voronoi cell {x|xεRd,k=argminj|x−cj|} associated to codeword ck, the resulting feature vector for image I may be
where [[.]] is the indicator function that evaluates to 1 if the condition is true and 0 otherwise and where [ak]k denotes concatenation of the vectors ak (scalars ak) to form a single column vector.
In another example, the Encode Pruned Descriptors block 240 may utilize a Fisher encoder that may rely on a GMM model also trained on . Letting βi,ci, Σi denote, respectively, the i-th GMM component's 1. prior weight, 2. mean vector, and 3. covariance matrix (assumed diagonal), the first-order Fisher feature vector may be
In another example, the Encode Pruned Descriptors block 240 may use a hybrid combination between BOW and Fisher techniques called VLAD. The VLAD encoder which may offer a compromise between the Fisher encoder's performance and the BOW encoder's processing complexity. The VLAD encoder may, similarly to the state-of-the art Fisher aggregator, encode residuals x−ck, but may also hard-assign each local descriptor to a single cell Ck instead of using a costly soft-max assignment as in Equation (9). The resulting VLAD encoding may be
where Φk are orthogonal PCA rotation matrices obtained from the training descriptors Ck∩ in the Voronoi cell. After computing the sub-vectors rkB, rkF, or rkV, these are stacked as in Equations (6), (8) and (10) to obtain a single large vector rB, rF or rV (we use r to denote any of these variants). Two normalization steps are applied as per the standard approach in the literature: a power-normalization step, were each entry ri of r is substituted by sign(ri) abs(ri)a. (common values of a are 0.2 or 0.5) and an 1-2 normalization step were every entry of the power-normalized vector is divided by the Euclidean norm of the power normalized-vector.
The Search Encoded Images block 250 receives the feature vector(s) computed by Encode Pruned Descriptors block 240. The Search Images block 250 may perform a search for one or more images by comparing the feature vector(s) received from Encoded Pruned Descriptors block 240 and the feature vectors of a search images database. The Search Images block 250 may perform an image search in accordance with the processes described in
Exemplary Image Processing System
The display 320 may allow the user to interact with image processing device 310, including, for example, inputting criteria for performing an image search. The display 320 may also display the output of an image search.
The image processing device 310 includes memory 330 and processor 340 that allow the performance of local descriptor pruning 350. The image processing device 310 further includes any other software or hardware necessary to perform local descriptor pruning 350.
The image processing device 310 executes the local descriptor pruning 350 processing. In one example, the image processing device 310 performs the local descriptor pruning 350 based on an initialization of an image search process by a user either locally or remotely. The local descriptor pruning 350 executes the pruning of local descriptors in accordance with the processes described in
In one example, the image processing device 310 may store all the information necessary to perform the local descriptor pruning 350. For example, the image processing device 310 may store and execute the algorithms and database information necessary to execute the local descriptor pruning 350 processing. Alternatively, the image processing system 310 may receive via a communication link one or more of the algorithms and database information to execute the local descriptor pruning 350 processing.
Each of the processing of extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 may be executed in whole or in part on image processing device 310. Alternatively, each of the extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 may be executed remotely and their respective results may be communicated to image processing device 310 via a communication link. In one example, the image processing device may receive an input image and execute extract local descriptors 360 and prune local descriptors 350. The results of prune local descriptors 350 may be transmitted via a communication link. The encode pruned local descriptors 370 and perform image search 380 may be executed remotely, and the results of perform image search 380 may be transmitted to image processing device 310 for display on display 320. The dashed boxes of extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 thus indicate that these processes may be executed on image processing device 310 or may be executed remotely. The extract local descriptors 360, encode pruned local descriptors 370, and perform image search 380 processes may be executed in accordance with the processes described in relation to
In one example, image encoders operate on the local descriptors xεRd extracted from each image. Images may be represented as a set I={xiεRd}i of local SIFT descriptors extracted densely or with a Hessian Affine region detector.
In one example, local descriptors may be encoded using a BOW encoder. The BOW encoder may be based on a codebook {ckεRd}k=1L obtained by applying K-means to all the local descriptors =UtIt of a set of training images. Letting Ck denote the Voronoi cell {x|xεRd,k=argminj|x−cj|} associated to codeword ck, the resulting feature vector for image I may be
where [[.]] is the indicator function that evaluates to 1 if the condition is true and 0 otherwise.
In another example, local descriptors may be encoded using a Fisher encoder. The Fisher encoder relies on a GMM model also trained on . Letting βi,ci, Σi denote, respectively, the i-th GMM component's 1. prior weight, 2. mean vector, and 3. covariance matrix (assumed diagonal), the first-order Fisher feature vector may be
In another example, local descriptors may be encoded using a hybrid combination between BOW and Fisher techniques called VLAD which may offer a compromise between the Fisher encoder's performance and the BOW encoder's processing complexity. This hybrid encoder, similarly to the state-of-the art Fisher aggregator, may encode residuals x−ck, but may also hard-assign each local descriptor to a single cell Ck instead of using a costly soft-max assignment as in Equation 15. The resulting VLAD encoding may be
where Φk are orthogonal PCA rotation matrices obtained from the training descriptors Ck∩ in the Voronoi cell.
In one example, the following power-normalization and l2 normalization post-processing stages may be applied to any of the feature vectors r in Equations (12), (14) and (16):
p=[h
a(rj)]j, (Equation 18)
n=g(p). (Equation 19)
Here the scalar function ha(x) and the vector function n(v) carry out power normalization and l2 normalization, respectively:
In one example, the present invention employs a local-descriptor pruning method applicable to all three feature encoding methods described above (BOW, VLAD and Fisher), and in general to feature encoding methods based on stacking sub-vectors rk, where each sub-vector is related to a cell Ck or mixture component (βk,ck,Σk) (these can be thought as soft cells).
Unlike the case for low-dimensional sub-spaces, the cells Ck in high-dimensional local-descriptors spaces are almost always unbounded, meaning that they have infinite volume. Yet only a part of this volume is informative visually.
In one example, the visually informative information is pruned by removing the local descriptors that are too far away from the cell center ck when constructing the sub-vectors rk in Equations (13), (15) and (17). In one example, the pruning is performed by restricting the summations in Equations (13), (15) and (17) only to those vectors x that are in the cell Ck and satisfy the following distance-to-ck condition:
(x−ck)TMk−1(x−ck)≦γσk2. (Equation 22)
The value of γ is determined experimentally by cross-validation and the parameter σk is computed from a training set of local descriptors as follows:
σk=((x−ck)TMk−1(x−ck)) (Equation 23)
The matrix Mk can be either
Anisotropic Mk: the empirical covariance matrix computed from ∩Ck;
Axis-aligned Mk: the same as the anisotropic Mk, but with all elements outside the diagonal set to zero;
Isotropic Mk: a diagonal matrix σk2I with σk2 equal to the mean diagonal value of the axis-aligned Mk.
While the anisotropic variant offers the most geometrical modelling flexibility, it also drastically increases the computational cost. The isotropic variant, on the other hand, enjoys practically null computational overhead, but also the least modelling flexibility. The axis-aligned variant offers a compromise between the two approaches.
In one example, the pruning carried out by Equation 22 can be implemented by means of 1/0 weights
w
k(x)=[[(x−ck)TMk−1(x−ck)≦γσk2]] (Equation 24)
The weighs w(x) can be applied to the summation terms in Equations (13), (15) and (17). For example, for Equation 13 the weights would be used as follows:
In another example, the pruning carried out by Equation 22 can be implemented using soft weights.
In another example, the soft-weights may be computed using exponential weighting, where
In another example, the soft-weights may be computed using inverse weighting, where
The experiments underlying
The experiments underlying
The experiments underlying
Table 1 provides a summary of results for all variants, where each variant is specified by a choice of weight type (hard, exponential or inverse), metric type (isotropic, anisotropic or axes-aligned), and local detector (dense or Hessian affine).
The best result overall is obtained using axis-aligned exponential weighting (74.28% and 67.02% for dense and Hessian affine detections, respectively). Nonetheless, hard pruning yields improvements relative to the baseline, and one should note that it is less-computationally demanding than soft pruning. The best mAP for hard-pruning is obtained using the axes-aligned approach for both the dense and Hessian affine detectors (66.40% and 73.56% respectively). As illustrated in
Numerous specific details have been set forth herein to provide a thorough understanding of the present invention. It will be understood by those skilled in the art, however, that the examples above may be practiced without these specific details. In other instances, well-known operations, components and circuits have not been described in detail so as not to obscure the present invention. It can be appreciated that the specific structural and functional details disclosed herein may be representative and do not necessarily limit the scope of the present invention.
Various examples of the present invention may be implemented using hardware elements, software elements, or a combination of both. Some examples may be implemented, for example, using a computer-readable medium or article which may store an instruction or a set of instructions that, if executed by a machine, may cause the machine to perform a method and/or operations in accordance with the embodiments. Such a machine may include, for example, any suitable processing platform, computing platform, computing device, processing device, computing system, processing system, computer, processor, or the like, and may be implemented using any suitable combination of hardware and/or software. The computer-readable medium or article may include, for example, any suitable type of memory unit, memory device, memory article, memory medium, storage device, storage article, storage medium and/or storage unit. The instructions may include any suitable type of code, such as source code, compiled code, interpreted code, executable code, static code, dynamic code, encrypted code, and the like, implemented using any suitable high-level, low-level, object-oriented, visual, compiled and/or interpreted programming language
Number | Date | Country | Kind |
---|---|---|---|
14306386.5 | Sep 2014 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2015/069452 | 8/25/2015 | WO | 00 |