The present disclosure generally relates to a new bag-of-words image feature encoder, more particularly to a new bag-of-words encoder having its encoding function differentiable.
Image search methods can be broadly split into two categories. In the first category, such as semantic search, the search system is given a visual concept, and the aim is to retrieve images containing the visual concept. For example, the user might want to find images containing a cat.
In the second category, such as image retrieval, the search system is given an image of a scene, and the aim is to find all images of the same scene modulo some task-related transformation. Examples of simple transformations include changes in scene illumination, image cropping or scaling. More challenging transformations include wide changes in the perspective of the camera, high compression ratios, or picture-of-video-screen artifacts.
Common to both semantic search and image retrieval methods is the need to encode the image into a single, fixed-dimensional image feature vector. Many successful image feature encoders have been proposed, and these image feature encoders generally operate on fixed-dimensional local descriptor vectors extracted from densely or sparsely sampled local regions of the image. Such a feature encoder aggregates these local descriptors to produce a higher dimension image feature vector. Examples of such feature encoders include the conventional bag-of-words encoder, the Fisher encoder, and the VLAD encoder. All these encoding methods depend on specific models of the data distribution in the local-descriptor space. For bag-of-words and VLAD, the model is a codebook obtained using K-means, while the Fisher encoding is based on a Gaussian Mixture Model (GMM). In both cases, the model defining the encoding scheme is built in an unsupervised manner using an optimization objective unrelated to the image search task.
Recent work has focused on learning the feature encoder model to make it better suited to the task at hand. A natural learning objective to use in this situation is the max-margin objective otherwise used to learn support vector machines. Notably, in Vladyslav Sydorov, Mayu Sakurada, and CH Lampert, Deep Fisher Kernels End to End Learning of the Fisher Kernel GMM Parameters, in Computer Vision and Pattern Recognition, 2014, the system learns the components of the GMM used in the Fisher encoding by optimizing, relative to the GMM mean and variance parameters, the same objective that produces the linear classifier commonly used to carry out semantic search. Approaches based on deep Convolutional Neural Networks (CNNs) can also be interpreted as feature learning methods, and these define the new state-of-the art baseline in semantic search. Indeed Sydorov et al. discuss how the Fisher encoder can be interpreted as a deep network, since both consist of alternating layers of linear and non-linear operations.
The main reason why Sydorov et al use the Fisher encoder is its differentiability. The bag-of-words model, on the other hand, is not differentiable, yet offers a number of advantages including interpretability and low computational cost. Therefore, it would be desirable to have a differentiable bag-of-words encoder.
In accordance with an aspect of the present invention, a method for processing in an encoder is disclosed. The method comprises receiving, by the encoder, a set of local descriptors derived from an image, obtaining, by the encoder, K code words, wherein K>1; and determining, by the encoder, a first element of a bag-of-words image feature vector by using a differentiable function having a difference between each of the local descriptors and one of the K code words as a first parameter, wherein each of the K code words is used in the differentiable function for determining a different element of the bag-of-words image feature vector.
In accordance with an aspect of the present invention, an image feature encoder is disclosed. The image feature encoder comprises memory means for storing an image and a set of local descriptors derived from the image; and processing means, characterized in that the processing means is configured to receive a set of local descriptors derived from an image, obtain K code words, wherein K>1; and determine a first element of a bag-of-words image feature vector by using a differentiable function having a difference between each of the local descriptors and one of the K code words as a first parameter, wherein each of the K code words is used in the differentiable function for determining a different element of the bag-of-words image feature vector.
In one embodiment, the memory means is a memory and the processing means is a processor.
In one embodiment, the differentiable function is a power function with a base of 2 or more, such as an exponential function. The exponential function may have the first parameter a sum of exponential of a norm of the difference between each of the local descriptors and the code word used for determining the first element of the bag-of-words image feature vector. The differentiable function may have a covariance matrix as a second parameter. The differentiable function may include a dividing function dividing the differentiable function by a number of the local descriptors.
In one embodiment, obtaining K code words comprises retrieving the K code words from a memory.
In another embodiment, the code word used for determining the first element of the bag-of-words image feature vector may be updated according to a derivative of the first element of the image feature vector with respect to the code word used for determining the first element of the image feature vector; and the updated code word is used to update the first element of the bag-of-words image feature.
In another embodiment, other bag-of-words image feature vectors are respectively determined in a similar manner for different sets of local descriptors derived from other images, all the determined image feature vectors are used to determine a classifier by optimizing a second function having the bag-of-words image feature vectors and the classifier as parameters, and the classifier is used to classifying an image as including or not including a particular scene.
In yet another aspect of the invention, a non-transitory computer readable medium is disclosed. The non-transitory computer readable medium has stored thereon instructions of program code for executing steps of methods disclosed herein according to the principles of the invention.
The aforementioned brief summary of exemplary embodiments of the present invention is merely illustrative of the inventive concepts presented herein, and is not intended to limit the scope of the present invention in any manner.
The image feature encoding system 10 may be any device, such as an online server, a cell phone, a PC, a laptop, or a tablet, that requires image feature encoding for, for example, the purpose of searching images for a scene or a visual concept.
The processor 101 may include one or more of processing units, such as microprocessors, digital signal processors, or combination thereof. The processor 101 is operative or configured to execute software/firmware code including software/firmware code for implementing the new image feature encoding function according to the principles of the embodiment of the invention. For example, the processor 101 is operative or configured to encode a set of local descriptors from an image to generate a bag-of-words image feature vector having K elements by obtaining K code words respectively corresponding to different elements of the bag-of-words image feature vector, wherein K>1; and determining a first element of the bag-of-words image feature vector as a differentiable function of a difference between each of the local descriptors and the code word corresponding to the first element of the K image feature vector. Other elements of the bag-of-words image feature vector can be determined in a similar manner. For another example, the processor 101 is operative or configured to perform image searching by obtaining K code words, wherein K>1; generating bag-of-words image feature vectors, each having K elements and derived from a different set of local descriptors derived respectively from a different one of N training images, at least one of the N training images comprising a particular scene and at least one not comprising the particular scene, each element of each bag-of-words image feature vector corresponding to a different one of K code words, wherein N>1, each element of one of the bag-of-words image feature vectors is determined from a differentiable function having a first parameter, which is a difference between each of the set of descriptors from which the one of the bag-of-words image vectors is derived and a corresponding code word; determining a classifier comprising K elements by optimizing a second function having the bag-of-words image feature vectors and the classifier as parameters; and classifying an image as including the particular scene according to the classifier.
The processor 101 is also operative or configured to perform and/or enable other functions of the image feature encoding system 10 including, but not limited to, detecting and responding to user inputs from the user input terminal 109 for operating and/or maintaining the image feature encoding system 10, displaying user menus or instructions, training images or other images, to the display device 107, reading and writing data including images, local descriptors, image feature vectors, constants, matrixes from and to the memory 205, and/or other functions.
The memory 105 is operative to perform data storage functions of the image feature encoding system 10. According to an exemplary embodiment, the memory 205 stores data including, but not limited to, images, local descriptors, image feature vectors, matrixes, image classifiers, software code, and/or other data. The memory 105 may include volatile and/or non-volatile memory regions and storage devices such hard disk drives, DVD drives. A part of memory is a non-transitory program storage device readable by the processor 101, tangibly embodying a program of instructions executable by the processor 101 to perform program steps as described herein according to the principles of the invention.
The network interface 103 is operative or configured to perform network interfacing functions of the image feature encoding system 10. According to an exemplary embodiment, the network interface 103 is operative or configured to receive signals such as images from a server, such as a Google server. The network interface 103 may include wireless, such as WI_FI, and/or wired, such as Ethernet, interfaces.
The inventors recognize that the gradient of the encoded image feature is necessary to learn the model for a specific task and derive a differentiable bag-of-words encoder. The proposed bag-of-words formulation is an approximation to conventional bag-of-words formulation, and the performance in semantic search is nonetheless comparable with that of the conventional bag-of-words formulation.
In accordance with an aspect of the present principles, a new bag-of-words encoding method and apparatus makes use of the new bag-of-words encoding method are disclosed. Before proceeding to describe the new bag-of-words encoding method of the present principles, the following discussion on notation will prove useful.
Scalars, vectors and matrices are denoted, respectively, using standard, bold, and uppercase-bold typeface (e.g., scalar a, vector a and matrix A). The symbol vk denotes a vector from a sequence v1, v2, . . . , vN, and vk denotes the k-th coefficient of vector v. The symbol [ai]i (respectively, [ai,j]i,j) denotes concatenation of scalars ai (ai,j) to form a single vector (matrix). Finally, the symbol
denotes the Jacobian matrix with (i,j)-th entry
In the following, the semantic search approach is used as an example for explaining the state-of-the-art approach for feature learning. The present principles of the invention can also be applied to the image retrieval approach. In the context of a binary classification problem, a set of N training images is given and from which a training set of annotated images {li,yi}i, yiε{1,−1} are obtained in a conventional manner, where index, i, ranges from 1 to N, N>1. At least one training images contain a particular visual concept, such as a rat, and at least one does not. For a training image containing the particular concept, yi is assigned to 1 and for a training image that does not contain the particular concept, yi is assigned to −1. Each annotated image li consists of a set of local descriptors {sjεd}j=1Mc, which are encoded to produce an image feature vector x(Θ; li)εD (or simply xi(Θ)), where sj is a vector of size d and xi is a vector of size D. The symbol Mi represents the total number of local descriptors in annotated image li. The encoding process depends on the parameters represented by Θ. For example, for the case of bag-of-words encoding, Θ represents the codebook; for Fisher encoding, Θ represents all the GMM parameters.
The encoded feature vectors are used to learn a linear classifier, w, which is a vector having the same number of elements as an image feature vector, by minimizing the following function:
where C(z) can denote the hinge loss max(0,1−z) or logistic loss
and is the total number of training images. According to the principles of the embodiment of the invention, feature learning is performed by jointly learning w and Θ so as to minimize this same cost:
argminw,ΘR(w,Θ). (2)
In one embodiment, a Stochastic Gradient Descent (SGD)-based block-coordinate descent approach is used to optimize equation (1), where one or more SGD steps are first applied to optimize with respect to w, and subsequently to optimize with respect to Θ. This can be explored experimentally.
In order to obtain the SGD update rule for equation (2) with respect to equation (1) is re-written as
Letting
denote the Jacobian of a possibly vector-valued function f with respect to its parameters g, the SGD update rule for the optimization with respect to Θ in equation (1) is
The partial derivative
can be obtained via back-propagation as follows:
where
depends on the penalty function C(z) used, and
depends on the encoding function used. Since 0 represents more than one parameter, the partial derivative can be done per parameter.
In the codebook learning for bag-of-words encoding, the parameters includes code words in a codebook, Θ=[ck]k, where ck. is a code word, which is a vector.
Let Ck denote the Voronoi cell associated with code word ck, and define an indicator function as follows:
Then we can write the conventional bag-of-words encoding function as
where xBow denotes the conventional bag-of-words image feature vector, which is a vector, derived from local descriptors of an image, and M If is the total number of local descriptors of that image.
It is clear that equation (8) is not differentiable with respect to a code word, ck. According the principles of the invention, an approximation of equation (8) is provided, so that the equation becomes differentiable. One example of the approximation is to substitute νk,ε(s) in equation (8) by a differentiable function. This substituted differentiable function should preserve the concept of a bag, e.g., the cell as used in equation (7). Examples of such differentiable functions are power functions with a base of 2 or more, such as an exponential function. Although the bags under a substituted differentiable function may overlap with one another, a substituted differentiable function provides more weight for a local descriptor closer to the code word associated with a bag. As such, a local descriptor has the stronger association with the closest code word and the concept of a bag is preserved. In the following example, the substituted differentiable function is an un-normalized Gaussian exp (−(s−ck)TRk−1(s−ck)), where Rk is a covariance matrix or the correlation matrix, which is element dependent. By doing so, equation (8) can be rewritten as follows:
xexp represents the bag-of-words image feature vector encoded using equation (9), which is a function of a difference between each of the local descriptors derived from an image and a corresponding code word k according to the principles of the invention. Element k of xexp is associated with the code word ck and each element xexp can be computed sequentially or at the same time along with one or more elements. The resulting representation has the advantage that it is differentiable and that its derivatives with respect to ck and Rk are non-zero almost everywhere. In order to (1) enforce the positive-definiteness of Rk−1 and (2) avoid differentiating with respect to a matrix inverse, we let Rk−1=QkQkT and differentiate with respect to Qk. In this case, parameters represented by Θ include code words ck and matrixes Qk. The Jacobian
required in equation (6) for Θ=ck or Qk is given below (note that
for j≠k):
where we have used the following definitions for convenience
The expression (6) using (10), summed over all samples and set to a 0 vector, produces the following expression (where αi are constants depending on yi)
where
Plugging our proposed approximate bag-of-words model into the learning machinery described in equations (3)-(6) results in a non-convex optimization problem, and in this model, the initialization of the model is important. In fact, if the initialized (i.e., before learning”) approximate bag-of-words model produces results that are substantially equal to the results of the conventional (non-approximate bag-of-words) model, then it is guaranteed that the learning process will yield significant improvements. Hence the initialization scheme affects the results. We describe the proposed initialization scheme here and show in Table 1 that the proposed approximation using exponential weighting according to the principles of the embodiment of the invention indeed produces results that are substantially equal to the conventional bag-of-words feature encoder. The results displayed in Table 1 are for class cow of the Pascal Visual Object Classes (VOC) dataset.
Our proposed initialization is carried out by first learning a codebook using K-means. In this embodiment, a user specifies the number of code words ck, for example, L, i.e., k from 1 to L, to the K means algorithm and input all the local descriptors of the N training images. The algorithm then randomly selects L local descriptors as L code words. Then, the algorithm assigns each of all the local descriptors to the closest code word and substitutes each code word ck by the mean vector of all the local descriptors, sj, assigned to it. The algorithm iterates until a difference, such as Euclidean distance, between the present result and the previous result is below a threshold.
According to the principles of the embodiment of the invention, the resulting ck from the K means algorithm are used as initial code words, and the correlation matrices are initialized according to the following formula:
R
k=αkRk′,
where the Rk′ are the empirical covariance (correlation) matrices computed using {sj|sjεCk} in a conventional manner. The scale factors αk ensures that, at initialization time, an adequate amount of samples sj produces a non-negligible weight in the summations of (9). They can be chosen by means of a numerical optimization to satisfy the following equation:
The two averages need not be exactly the same. As long as they are substantially the same, it is sufficient. In one embodiment, a user is allowed to specify the maximum allowable difference, such as the Euclidean distance, between the two average image feature vectors.
At step 615, the processor 101 is operative or configured to determine a first element of a bag-of-words image feature vector by using a differentiable function having a difference between each of the local descriptors and one of the K code words as a first parameter, wherein each of the K code words is used in the differentiable function for determining a different element of the bag-of-words image feature vector. Other elements of the image feature vector may be determined in a similar manner. In one embodiment, the differentiable function includes an exponential function. In another embodiment, the differentiable function has a sum of exponential of a norm of the difference between each of the local descriptors and the code word used for determining the first element of the bag-of-words image feature vector as the first parameter. In yet another embodiment, the differentiable function further has a covariance matrix as a second parameter. The differentiable function can include a dividing function dividing the differentiable function by the number of local descriptors. In yet another embodiment, the differentiable function is the one shown in equation (9). In yet another embodiment, the differentiable function is a power function with a base of 2 or more.
According to the principles of the embodiment of the invention, a code word in the memory 105 may be updated according to a rule. For example, a code word can be updated according to the a gradient or a derivative of the element of the image feature vector determined by using the code word with respect to the code word, such as the one specified in equation (5), and the steps 615 and/or 610 are repeated to update the element of the bag-of-words image feature vector using the updated code word. The gradient or the derivative can be obtained using equation (10). As such, the code word can be optimized according to the principles of the embodiment of the invention. Even though the processor 101 obtains the initial K code words generated from an external device, the processor 101 may update the code words in the memory 105 and in the next iteration obtain the code words by retrieving the updated code word from the memory 105.
In one embodiment, the differentiable function further has a covariance matrix, Qk, as a parameter. A different covariance matrix is used in the differentiable function to determine a different element of the bag-of-words image feature vector. In this embodiment, the covariance matrix, Qk, can be updated according to a gradient or derivative of the element of the image feature vector determined by using the covariance matrix, Qk, such as the one specified in equation (5) where Qk is one of the parameters, Θ, and the updated covariance matrix is used to update the element of the bag-of-words image feature vector. The gradient or the derivative can be obtained using equation (11). Thus, both the code words and the covariance matrixes can be optimized according to the principles of the embodiment of the invention. Other elements of the image feature vector can be determined in a similar manner.
At step 715, the processor 101 is operative or configured to generate bag-of-words image feature vectors, each having K elements and derived from a different set of local descriptors derived respectively from a different one of N training images, at least one of the N training images comprising a particular scene and at least one not comprising the particular scene, each element of each bag-of-words image feature vector is determined by using a different one of K code words, wherein N>1, each element of one of the bag-of-words image feature vectors is determined from a differentiable function having a first parameter, which is a difference between each of the set of descriptors, from which the one of the bag-of-words image vectors is derived, and a code word used to determine the element of one of the bag-of-words image vectors.
In one embodiment, the particular scene is a visual concept and a user can just type in the visual concept, such as a cat, and the system will search images containing the visual concept. In another embodiment, the particular scene is an image of a scene and the system will search for images containing the scene modulo some task-related transformation.
In one embodiment, the differentiable function includes power function with a base of 2 or more, such as but not limited to an exponential function. In another embodiment, the differentiable function has a sum of exponential of a norm of the difference between each of the local descriptors and the code word used to the element of the image feature vector as a parameter. In yet another embodiment, the differentiable function further has a covariance matrix as a parameter. The differentiable function can include a dividing function dividing the differentiable function by the number of local descriptors. In yet another embodiment, the differentiable function is the one shown in equation (9).
At step 720, the processor 101 is operative or configured to determine a classifier w comprising K elements by optimizing a second function having the bag-of-words image feature vectors and the classifier as parameters. In one embodiment, the function defined in equation (1) or (2) is the second function.
In one embodiment, the processor 101 is further operative or configured to compute a gradient or derivative of a first element of a first one of the bag-of-words image feature vectors with respect to a first one of code words used to determine the first element of the first one of the bag-of-words image vectors; update the first one of the code words in the memory 105 according to the derivative; re-determine the first element of the first one of the image feature vectors, and re-compute the classifier w with the updated first element of the first one of the image feature vectors and others elements of the first one of the image feature vectors. An example of computing a derivative is shown in equation (10), an example of updating the code word is shown in equation (5) where the code word is one of the parameters, Θ, and an example of re-computing the classifier w is shown in equation (1) or (2). Other code words can be updated in a similar manner and the classifier is re-computed accordingly. As such, a code word can be optimized according to the principles of the embodiment of the invention.
In another embodiment, the differentiable function further has a covariance matrix, Qk, as a second parameter. A different covariance matrix is used in the differentiable function to determine a different element of one of the bag-of-words image feature vectors. The processor 101 is further operative or configured to compute a gradient or derivative of a first element of a first one of the bag-of-words image feature vectors with respect to a covariance matrix, Qk, used to determine the first element of the first one of the bag-of-words image vectors; update a covariance matrix, Qk, in the memory 105 according to the derivative; re-determine the first element of the first one of the image feature vectors, and re-compute the classifier w with the updated first element of the first one of the image feature vectors and others elements of the first one of the image feature vectors. An example of computing a derivative is shown in equation (11), an example of updating the a covariance matrix, Qk, is shown in equation (5) where Qk is one of the parameters, Θ, and an example of re-computing the classifier w is shown in equation (1) or (2). Other covariance matrixes can be updated in a similar manner and the classifier is re-computed accordingly. As such, both a code word and a covariance matrix can be optimized according to the principles of the embodiment of the invention.
At step 725, the processor 101 is operative or configured to classify an image as including the particular scene according to the classifier. For example, if the dot product of the classifier and the image feature vector of the image is positive, the image is classified to contain the particular scene, and if the dot product is not positive, the image is classified to not contain the particular scene.
In another embodiment, a covariance matrix is a function of an empirical matrix computed from the local descriptors assigned to or associated with a first code word used to determine the same element of a bag-of-words image feature vector as the covariance matrix. The covariant matrix may be a product of a constant and the empirical matrix. The constant is a scale factor, which is selected to ensure that at the initialization, the local descriptors produce a non-negligible weight in the summations defined in equation (9). In one embodiment, the constant is selected such that an average of image feature vectors derived using a conventional bag-of-words encoding technique is substantially the same as an average of the bag-of-words image feature vectors derived according to the principles of the embodiment of the invention, as shown in equation (17). In one embodiment, the covariance matrix is singular.
The implementations described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method or a device), the implementation of features discussed may also be implemented in other forms (for example a program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, Smartphones, tablets, computers, mobile phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Implementations of the various processes and features described herein may be embodied in a variety of different equipment or applications, particularly, for example, equipment or applications associated with data encoding, data decoding, view generation, texture processing, and other processing of images and related texture information and/or depth information. Examples of such equipment include an encoder, a decoder, a post-processor processing output from a decoder, a pre-processor providing input to an encoder, a video coder, a video decoder, a video codec, a web server, a set-top box, a laptop, a personal computer, a cell phone, a PDA, and other communication devices. As should be clear, the equipment may be mobile and even installed in a mobile vehicle.
Additionally, the methods may be implemented by instructions being performed by a processor, and such instructions (and/or data values produced by an implementation) may be stored on a processor-readable medium such as, for example, an integrated circuit, a software carrier or other storage device such as, for example, a hard disk, a compact diskette (“CD”), an optical disc (such as, for example, a DVD, often referred to as a digital versatile disc or a digital video disc), a random access memory (“RAM”), or a read-only memory (“ROM”). The instructions may form an application program tangibly embodied on a processor-readable medium. Instructions may be, for example, in hardware, firmware, software, or a combination. Instructions may be found in, for example, an operating system, a separate application, or a combination of the two. A processor may be characterized, therefore, as, for example, both a device configured to carry out a process and a device that includes a processor-readable medium (such as a storage device) having instructions for carrying out a process. Further, a processor-readable medium may store, in addition to or in lieu of instructions, data values produced by an implementation.
As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry as data the rules for writing or reading the syntax of a described embodiment, or to carry as data the actual syntax-values written by a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, elements of different implementations may be combined, supplemented, modified, or removed to produce other implementations. Additionally, one of ordinary skill will understand that other structures and processes may be substituted for those disclosed and the resulting implementations will perform at least substantially the same function(s), in at least substantially the same way(s), to achieve at least substantially the same result(s) as the implementations disclosed. Accordingly, these and other implementations are contemplated by this application.
Number | Date | Country | Kind |
---|---|---|---|
14 306 686.8 | Oct 2014 | EP | regional |