This disclosure relates to retrieving images related to a search image.
Image search methods generally exist in two categories, semantic search and image retrieval. In the first category, semantic search seeks to retrieve images containing visual concepts embodied in a search word or string. For example, the user might want to find images containing cats. In the second category, image retrieval seeks to find all images of the same scene even when the images have undergone some task-related transformation relative to a search or query image. Examples of simple transformations include changes in scene illumination, image cropping or scaling. More challenging transformations include wide changes in the perspective of the camera, high compression ratios, or picture-of-video-screen artifacts.
Common to both semantic search and image retrieval methods is the need to encode the image into a single, fixed-dimensional feature vector. There currently exist many successful image feature encoders and these generally operate on fixed-dimensional local descriptor vectors extracted from densely or sparsely sampled local regions of the search image. The feature encoder aggregates these local descriptors to produce a higher-dimension image feature vector. Examples of such feature encoders include the Bag-of-Words encoder, the Fisher encoder and the VLAD encoder. All these encoder perform common parametric post-processing steps that apply element-wise power computation and subsequent ι2 normalization. These encoders also depend on specific models of the data distribution in the local descriptor space. The Bag-of-Words and VLAD encoders use a model having a code book obtained using K-means, while the Fisher encoder uses a Gaussian Mixture Model (GMM). In both cases, the model defining the encoder uses an optimization objective unrelated to the image search task.
In the case of semantic search, recent work has focused on learning the feature encoder parameters to make the encoder better suited for its intended purpose. A natural learning objective that finds applicability in this situation is the max-margin objective otherwise used to learn support vector machines. Past efforts have enabled learning of the components of the GMM used in the Fisher encoder by optimizing, relative to the GMM mean and variance parameters, the same objective that produces a linear classifier commonly used to carry out the semantic search. Past approaches based on deep Convolutional Neural Networks (CNNs) can also be interpreted as feature learning methods, and these now define the new state-of-the art baseline in semantic search. Indeed, the Fisher encoder can be interpreted as a deep network, since both consist of alternating layers of linear and non-linear operations.
For the image retrieval task, however, there have been few efforts to apply feature learning. One existing proxy approach uses the max-margin objective thus yielding feature encoders that learn for semantic searching. Although the search tasks are not the same for sematic searching as compared to image retrieval, the max-margin objective approach yields improved image retrieval results, since both semantic search and image retrieval are based on human visual interpretations of similarity. Another approach to apply a learning objective to image retrieval focuses on learning the local descriptor vectors at the input of the feature encoder. The optimization objective used in this case is engineered to enforce matching of small image blocks centered on the same point in 3-D space based on the learned local descriptors but from images taken from different perspectives. One reason why these two approaches circumvent the actual task of image retrieval is the lack of any objective functions that are good surrogates for the mean Average Precision (mAP) measure commonly used to evaluate image retrieval systems. Surrogate objectives are necessary because the mAP measure is non-differentiable as it depends on a ranking of the searched images.
Thus, a need exists for an image retrieval that has a learning function that overcomes the aforementioned disadvantages.
Briefly, in accordance with an aspect of the present principles, a method for retrieving at least one search image matching a query image includes extracting a set of search images. Thereafter, the query images is encoded into a query image feature vector and the search images are encoded into search image feature vectors, both using an optimized encoding process that makes use of learned encoding parameters. The distances between the query image feature vector and the search image feature vectors are computed and the search images are ranked based on the computed distances. At least one highest-ranked search image is retrieved based on the ranking.
It is an object of the present principles to provide image retrieval with feature learning.
It is another object of the present principles to provide image retrieval with feature learning using a learning objective not dependent on image ranking.
It is another object of the present principles to provide image retrieval with feature learning using a learning objective minimized using a gradient-based optimization strategy resulting in application of the resulting objective to select power-normalization parameters of the encoder to improve image retrieval.
Further, it is another objective of the present principles to provide image retrieval with feature learning using a learning objective that makes use of an offset term in connection with per-cell rotation when aggregating local descriptors to yield the feature vector for the query image.
In accordance with an aspect of the present principles, an image retrieval method and apparatus makes use of a learning objective that serves as a good surrogate for mean Average Precision (mAP) measure to improve the quality of the image retrieval. Before proceeding to describe the image search technique of the present principles, the following discussion on notation will prove useful.
to denote the Jacobian matrix with (i,j)-th entry
As described in detail hereinafter, the processor 12 performs various features associated with the image retrieval with object learning in accordance with the present principles. First upon receipt of a query image for querying a database of images (i.e., “searched images”) to retrieve image therefrom constituting a match with the query image, the processor 12 will first compute a feature vector for the query image. In this context, the processor 12 acts as an encoder to encode the query image to yield an image feature vector using one of encoding techniques described above. Thereafter, the processor 12 will compute a distance, typically, the Euclidean distance, between the feature vector associated with query image and a feature vector for each search image in a database of search images (not shown). The searched images in the database may already exist in encoded form or require encoding in the same manner as the query image in which case the processor 12 will perform encoding prior to computing the distance. The processor 12 will sort (e.g., rank) the searched images in the database based on the computed distance
The memory 14 stores both program instructions for the processor 12. Further, the memory stores data supplied to, as well as data generated by the processor 12. In this regard, the memory 14 stores: (1) learned encoding parameters, in particular a and d, associated with the encoding of the query image by the processor 12, (2) the encoded feature vectors for all the searched images, as well as (3) the searched images themselves.
The processor 12 and the memory 14 also interact with each other during learning of the encoding parameters. As described in detail hereinafter, the processor 12 establishes a learning objective, i.e., a measure of the quality of the search. The processor 12 thereafter seeks to minimize that learning objective over pairs or triplets in a training set of images, typically by implementing a gradient-based optimization strategy, such as, but not limited to,
Stochastic Gradient Descent (SGD), over the pairs/ triplets in the training set, in order to learn the optimized encoding parameters in particular α and d. Rather than make use of Stochastic Gradient Descent, other optimization techniques could be used, such as gradient descent, newton descent, conjugate gradient methods, Levenberg-Marquardt minimization, BFGS, and hybrid mixes. The memory 14 stores the local descriptors for all the pairs or triplets of the images in the training set. Further, the memory 14 stores the optimized learned parameters obtained from the gradient-based optimization.
To understand the manner in which the processor 12 computes feature vectors by encoding, the following discussion will prove helpful. Image encoders operate on the local descriptors x ∈ Rd extracted from each image. Hence, for purposes of discussion, images are represented as a set I={xk ∈ Rd}k of local SIFT descriptors extracted densely or with the Hessian Affine region detector The Bag-of-Words encoder (BOW) constitutes one of the earliest image encoding methods and relies on a code book {ck ∈ Rd}LK=1 obtained by applying K-means to all the local descriptors ∪tIt of a set of training images. Letting Ck denote the Voronoi cell {x|x∈ Rd,k=argminj|x-cj|} associated to code-word ck, the resulting feature vector for image I is
where # yields the number of elements in the set.
The Fisher encoder relies on a GMM model also trained on ∪t It. Letting βi,ci,Σ
A hybrid combination between BOF and Fisher encoders called the VLAD encoder has been proposed that offers a good compromise between the performance of the Fisher encoder and the encoding complexity of the BOF encoder. Similar to the state-of-the art Fisher encoder, the VLAD encoder encodes residuals x-ck, but it hard-assigns each local descriptor to a single cell Ck instead of using a costly soft-max assignment as in equation (2) for the Fisher encoder. There has been a suggestion to incorporate several conditioning steps in the VLAD encoder to improve performance of the feature encoding. The following equations define VLAD encoding:
Here, the scalar function hα(x) and the vector function n(v) carry out power normalization and l-2 normalization, respectively:
The power normalization function defined in equation (8) is widely used as a post-processing stage for image features. This power normalization function serves to mitigate (respectively, enhance) the contribution of the larger (smaller) coefficients in the vector as illustrated in
In all the approaches using power normalization, the αj are kept constant for all entries in the vector, αj =α,∀j. This restriction comes from the fact that α is chosen empirically (often to α=0.5 or α=0.2), and choosing different values for each αj is hence difficult. As described hereinafter, applying the feature learning method of the present principles to the optimization of the αj can overcome this difficulty.
Experimentally, dense local descriptor sampling, (previously shown to outperform sparsely sampled blocks but for αj=0.2), with αj=0 yields very competitive performance, with the added advantage that the resulting descriptor is binary as shown in
Feature learning has been pursued in the context of image classification or for learning local descriptors akin to parametric variants of the SIFT descriptor. However, as discussed previously, few have pursued learning features specifically for the image retrieval task. As described below, an exemplary approach to feature learning in accordance with the present principles applies optimization of the parameters of VLAD feature encoding.
The main difficulty in learning for the image retrieval task lies in the non-smoothness and non-differentiability of the standard performance measures to assess the quality of image retrieval, such the mAP parameter discussed previously. Present-day image retrieval quality assessment measures all depend on recall and precision computed over a ground-truth dataset containing known groups of matching images. A given query image serves as the starting point to obtain a ranking (ik ∈{1, . . . , N})k of the N images in a dataset of searched images (for example, by an ascending sort of the feature distances of such images relative to the query feature). Given the ground-truth matches M={ikj}j for the query, the recall and precision at rank k are computed using the first k ranked images Fk={i1, . . . , ik} as follows (where # denotes set cardinality):
The average precision is then the area under the curve obtained by plotting p(k) versus r(k) for a single query image. A common performance measure is the mean, over all images in the dataset, of the average precision. This mean Average Precision (mAP) measure, and all measures based on recall and precision, are non-differentiable and difficult to use in an optimization framework. The image retrieval with feature learning technique of the present principles makes use of a surrogate objective function
To understand the surrogate objective of the present principles, assume receipt of a training set consisting of images labeled i=1, . . . , N. For each image i, also assume the labels Mi ⊂{1, . . . , N} of the images that are a match to image i. Further, assume that some feature encoding scheme has been chosen and parametrized by a vector θ that yields feature vectors ni(θ). The aim is to define a procedure to select good values for the parameters θ.
Consider the feature nj of a given query image. Since feature vectors are often normalized (|nj|2=1), the retrieval process consists of sorting the N images in descending order of n.
Let Hi ⊂{1, . . . , N} clenote the union of a) the labels of the top-ranked images (except i) and b) the labels Mi of the true matches. Letting yi j=1 if j ∈ Mij and −1 otherwise, we propose the following learning objective:
where M is the total number of terms in the double summation. Inspired by max-margin formulations, we use the hinge penalty
φ(n, m, y,b)=max (0,ε−y·(nTm−b)), (13)
noting that
The parameters ε and bi in φ (ni, nj, yij, bi) promote higher scores niTnj for positive pairs {i, j|j∈i} than for negative pairs {i, j|j ∈
i/
i}.
In
Parameter bi shifts the penalty so that it “separates” positive scores from negative scores. Given the piece-wise linear nature of the hinge loss, the value of bi minimizing the above expression is found at one of the vertices {max[0,ε−yi j(βij-βik)]|k=1, . . . , j} where βij=(nT inj−yiε). Thus, it suffices to compute the inner summation at all these candidate values for bi and choose the best one.
In practice setting bi heuristically to either a) the average of the positive scores or b) the minimum positive score also worked well, simplifying the objective to
i and o markers for positive scores where j ∈
i.
As mentioned previously, the formulation in equation (14) is similar to max-margin formulations used to learn linear SVM classifiers w. Feature learning approaches exist that use this same SVM objective to learn the encoder parameters θ for classification. Note that this is very different from the approach of the present principles since, in image retrieval, the retrieval scores are given by similarities between the features themselves, as exemplified by the nTinj components in the objective set forth in equation (14). Classification scores are instead given by similarities between the learned classifier vector w and the features ni.
Stochastic Gradient Descent (SGD) is a well-established, robust optimization method offering advantages when computational time or memory space is the bottleneck. The image retrieval with feature learning technique of the present principles uses SGD to optimize the learning objective set forth in equation (14). Given the parameter estimate θt at iteration t, SGD substitutes the gradient for the objective as follows:
by an estimate from a single i,j pair drawn at random at a time t.
The resulting SGD update rule is
θ
t+1=θt−γt·}φit jt(θt) (17)
where γt is a learning rate that can be made to decay with t, e.g., γt=γ0/(t+t0). SGD is guaranteed to converge to a local minimum for sufficiently small values of γt and here we use constant values (γt=γ∀t) set by cross-validation.
When the power normalization and ι2 normalization post-processing stages represented by equations (6) and (7) are used, the gradient in equation (16) required in equation (17) can be computed using the chain rule as follows, using the notation
where θ can contain the αj parameters of the power normalization step or the offset parameters d=[dk]k of equation (4). The partial derivatives in the above expression are given below, where k, ι ∈ {i, j}:
To better appreciate the image retrieval with feature learning technique of the present principles, and especially the application of the Stochastic Gradient Descent (SGD) algorithm, refer to
Following step 602, the extracted local features are aggregated into a single vector of size P (e.g., the feature vector) during step 604. Traditionally, the aggregation of the features to obtain the feature vector included assigning each descriptor xi to the closest code word ck and rotating each sub-vector rk by Φk using the input parameters s depicted in steps 606 and 608, respectively. Following aggregation of the local descriptors, power normalization is applied during step 610, typically using power normalization where a=0.2 or 0.5 as indicated in step 612. During step 614 ι2 normalization is applied, completing the encoding process. Thus, the steps 602-614 collectively comprise the traditional encoding process, following output of the feature vector during step 616.
The image retrieval with feature learning method of the present principles includes several improvements to the traditional encoding process. Rather than use a codebook 606 learned using K-means, the proposed method uses a codebook 618 that was learned by minimizing a task-related objective so as to pick good values for the codebook {cl, . . . , cL}.
In addition, rather than simply rotating the vectors as depicted in step 608 for conventional encoding, the image retrieval with feature learning method of the present principles learns Per-cell matrices 620 that are not constrained to be orthogonal by minimizing a task-related objective. In addition, the image retrieval with feature learning method of the present principles also makes use of a learned offset vector d as indicated in step 622. Also, instead of using a fixed value of α as with step 612, the image retrieval with feature learning method of the present principles makes used of learned power normalization parameters α1, α2, . . . , αP.
Experimental testing of the image retrieval with feature learning technique of the present principles was undertaken using as a data set a collection of images known as INRIA Holidays containing 1491 high-resolution personal photos of various locations and objects divided into 800 groups of matching images. The retrieval performance in all the experimentation was measured by mAP (mean average precision), with the query image not included in the resulting ranked list.
To experimentally learn a, the sample data-set consisted of some 8000 (i, j) image pairs obtained from the INRIA HOLIDAY images composed of positive and negative pairs in equal number. For each image i, pairs (i, j) are built using all positive images belonging to Mi and equal number of high-ranked negative images for same image i. Experimentation was carried out using descriptors extracted using Hessian-affine detector [ ] and Dense detector [ ] separately. The Learning rate parameter γt was kept fixed and equal to 1.0 in both cases.
In connection with the experimental testing discussed above convergence plots were generated after 30 passes over the entire image pairs sample as shown in
The foregoing works can be extended as follows. The learning objectives described in equations (12) and (14) result in minima that are very sensitive to the method used to select bi. An alternative exists that dispenses of bi and enforces correct ranking but using image triplets. Given an image with label i, correct matches Mi and incorrect matches Ni, the alternate proposed objective is:
and ε enforces some small, non-zero margin that can be held constant (e.g., ε=1e−2) or increased gradually during the optimization (e.g., between 0 and 1e−1).
In this case, the gradient with respect to parameter θ is given by
SGD update rule for this case operates at each time instant t, on a triplet Iii
k
θ
t+1=θt−γt·∇φi
The binarization thresholds d=[dk]k in (4) can also be learned using gradients computed via equations (18) or (25) with θ=d. The required Jacobian is
Numerical issues due to powers of α−1: The entries |qi|α-1 in equation (28) can pose numerical problems when the qi are close to zero. One way to avoid this is to keep the corresponding entry for di fixed during the update step. This amounts to removing the i-th entry of ∇φi, j,k(θ) in equation (25), updating only dj for j ±i
The learning objectives proposed herein allows us to learn feature encoders that are robust to specific transformations in a structured manner. As discussed in the introduction, image retrieval applications are defined by a transformation that is inherent to the specific task.
A few examples of relevant applications include:
Although not discussed in detail, the proposed image retrieval objective can also be used to learn the code book {ck} k or the rotation matrices Φ in equation (4)
The foregoing describes a technique for image retrieval using a learning objective.
The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method or a device), the implementation of features discussed may also be implemented in other forms (for example a program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, Smartphones, tablets, computers, mobile phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Implementations of the various processes and features described herein may be embodied in a variety of different equipment or applications, particularly, for example, equipment or applications associated with data encoding, data decoding, view generation, texture processing, and other processing of images and related texture information and/or depth information. Examples of such equipment include an encoder, a decoder, a post-processor processing output from a decoder, a pre-processor providing input to an encoder, a video coder, a video decoder, a video codec, a web server, a set-top box, a laptop, a personal computer, a cell phone, a PDA, and other communication devices. As should be clear, the equipment may be mobile and even installed in a mobile vehicle.
Additionally, the methods may be implemented by instructions being performed by a processor, and such instructions (and/or data values produced by an implementation) may be stored on a processor-readable medium such as, for example, an integrated circuit, a software carrier or other storage device such as, for example, a hard disk, a compact diskette (“CD”), an optical disc (such as, for example, a DVD, often referred to as a digital versatile disc or a digital video disc), a random access memory (“RAM”), or a read-only memory (“ROM”). The instructions may form an application program tangibly embodied on a processor-readable medium. Instructions may be, for example, in hardware, firmware, software, or a combination. Instructions may be found in, for example, an operating system, a separate application, or a combination of the two. A processor may be characterized, therefore, as, for example, both a device configured to carry out a process and a device that includes a processor-readable medium (such as a storage device) having instructions for carrying out a process. Further, a processor-readable medium may store, in addition to or in lieu of instructions, data values produced by an implementation.
As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry as data the rules for writing or reading the syntax of a described embodiment, or to carry as data the actual syntax-values written by a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, elements of different implementations may be combined, supplemented, modified, or removed to produce other implementations. Additionally, one of ordinary skill will understand that other structures and processes may be substituted for those disclosed and the resulting implementations will perform at least substantially the same function(s), in at least substantially the same way(s), to achieve at least substantially the same result(s) as the implementations disclosed. Accordingly, these and other implementations are contemplated by this application.
Number | Date | Country | Kind |
---|---|---|---|
14306387.3 | Sep 2014 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2015/069398 | 8/25/2015 | WO | 00 |