The present invention relates to systems and methods for determining image representations at a pixel level.
Sparse coding refers to a general class of techniques that automatically select a sparse set of vectors from a large pool of possible bases to encode an input signal. While originally proposed as a possible computational model for the efficient coding of natural images in the visual cortex of mammals, sparse coding has been successfully applied to many machine learning and computer vision problems, including image super-resolution and image restoration. More recently, it has gained popularity among researchers working on image classification, due to its state-of-the-art performance on several image classification problems.
Many image classification methods apply classifiers based on a Bag-of-Words (BoW) image representation, where vector-quantization (VQ) is applied to encode the pixels or descriptors of local image patches, after which the codes are linearly pooled within local regions. In this approach, prior to encoding, a codebook is learned with an unsupervised learning method, which summarizes the distribution of signals by a set of “visual words.” The method is very intuitive because the pooled VQ codes represent the image through the frequencies of these visual words.
Sparse coding can easily be plugged into the BoW framework as a replacement for vector quantization. One approach uses sparse coding to construct high-level features, showing that the resulting sparse representations perform much better than conventional representations, e.g., raw image patches. A two stage approach has been used where sparse coding model is applied over hand-crafted SIFT features, followed by a spatial pyramid max pooling. When applied to general image classification tasks, this approach has achieved state-of-the-art performance on several benchmarks when used with a simple linear classifier. However, this is achieved using sparse coding on top of hand-designed SIFT features.
A limitation of the above approaches is that they encode local patches independently, ignoring the spatial neighborhood structure of the image.
In one aspect, a two-layer sparse coding model is used for modeling high-order dependency of patches in the same local region of an image. The first layer encodes individual patches, and the second layer then jointly encodes the set of patches that belong to the same group (i.e., image or image region). Accordingly, the model has two levels of codebooks, one for individual patches, and another for sets of patches. In a codebook learning phase, the model learns the two codebooks jointly, where each code in the higher-level codebook represents a dependency pattern among the low-level code words.
In another aspect, a method processes an image having a plurality of pixels by capturing an image using an image sensor; forming a first-layer to encode local patches on an image region; and forming a second layer to jointly encode patches from the image region.
In yet another aspect, systems and methods process an image having a plurality of pixels includes an image sensor to capture an image; a first-layer to encode local patches on an image region; and a second layer to jointly encode patches from the same image region.
Advantages of the preferred embodiment may include one or more of the following. The system uses fully automatic methods to learn features from the pixel level. The system is advantageous in terms of both modeling and computation. Because the individual patches of the same group are jointly encoded, the first-layer codebook yields a more invariant representation compared with standard sparse coding. Moreover, the use of a higher-level codebook, whose codewords directly model the statistical dependency of the first layer codewords, allows the method to encode more complex visual patterns. Computationally, the encoding optimization is jointly convex over both layers. Finally, the method generates sparse representations on the image pixel level, which shows the promise of learning features fully automatically. The unsupervised two-layer coding scheme generates image representations that are more invariant and discriminative than those obtained through one-layer coding, leading to improved accuracies for both image classification tasks.
More details of Hierarchical Sparse Coding are discussed next. In one embodiment, x1, . . . , xnεRd represents a set of n patches within an image. For ease of discussion, the spatial information of the patches is used. However, it is straightforward to incorporate a dependence on location. The goal is to obtain a sparse representation for this set of patches. In one embodiment, X=[x1 x2 . . . xn]εRd×n represents a set of patches in matrix form. Let BεRd×p be a dictionary of codewords for the first level (the patch-level), as in standard sparse coding. In addition, a second level or set-level dictionary Φ=(φ1 φ2 . . . φq)εR+p×q can be used, where each element of Φ is non-negative. The set-level codebook Φ will be used to model the statistical dependencies among the representations of the patches xi in the patch-level.
Sparse representations cam be determined simultaneously at the patch-level and the set-level by carrying out the following optimization:
Here W=(w1 w2 . . . wn)εRp×n is the patch-level representation, αεRq is the set-level representation, and
The l1 penalty on each wi and α encourages sparsity in the representations at both levels.
Taking λ2=0 reduces the procedure to standard sparse coding, which encodes each patch independently. On the other hand, if λ2>0, then the term involving a implements a type of weighted l2 regularization of wi. Note, however, that pooling these terms together results in an expression of the form
where
is the sample covariance of the patch-level representations. Thus, the loss function L(W,α) may be written more succinctly as
If the wi vectors were sampled independently from a Gaussian with covariance matrix Σ(α)=Ω(α)−1, the log-likelihood of W would be tr(S(W)Ω(α)), plus a constant that doesn't depend on W. Thus, the set-level code can be seen to model the covariance structure of the patch-level representations.
Hierarchical sparse coding, as defined above, is similar to but fundamentally different from the group sparse coding procedure. The method incorporates a group lasso penalty |W|2 to encourage similar sparsity patterns for the patches in a group. However, there is no second codebook that is constructed at a higher level. Experimental results show the set-level codebook can be used that results in a hierarchical coding scheme that is interpretable, where the set-level codebook is effectively a shift-invariant representation of correlated patch-level bases.
Importantly, the encoding optimization problem above is jointly convex in both W and α. To see this, recall that the matrix-fractional function ƒ(x,Y)=xTY−1x is jointly convex as a function of the vector x and the positive-semidefinite matrix Y, and Σk=1qαkdiag(φk) is affine in α.
An alternating optimization procedure can be used to actually compute the solution, by iteratively optimizing W with α fixed, and then optimizing α with W fixed. The details of these optimizations are described next.
The optimization of the patch-level representation W for fixed α can be seen as a modified elastic net problem, using a weighted l2 norm regularization. Specifically, the optimization
is a generalized elastic net problem. It can be transformed into a canonical LASSO problem as
where
and 0p×1 denotes a vector of p zeros. Fast methods based on iterative soft thresholding are available for efficiently solving this quadratic program.
The optimization problem for updating the set-level representation α with W fixed is
Again, the method transforms it into another formulation in order to take the advantage of well-developed lasso solvers,
where diag(Σ)=σ and λ4=γ/λ3. This optimization is jointly convex with respect to both Σ and α. As λ3→∞, this formulation is equivalent to the original one. In the implementation λ3 is a very large number.
An alternating minimization procedure is used, which alternates between the updates of σ and α. For fixed α, the optimization for each element of a can be done independently, which implies that it is very fast to solve these one-dimensional optimization problems. On the other hand, the optimization for α is a standard nonnegative LASSO problem, which can also be efficiently solved.
Effective image coding requires high-quality codebooks B and Φ. Next, methods to learn the codebooks to capture the structural information of data are discussed.
In one embodiment, X=(X1, . . . , Xm) be m image patch sets, obtained from local regions of training images. The formulation of codebook learning aims at solving the following optimization problem.
where
where Σj is a diagonal matrix and diag(Σj)=σj. The objective function is the same as the one in the coding phase if the codebooks are given. One important feature of the above formulation is that the set-level dictionary Φ is required to be nonnegative.
The optimization problem can be solved by iteratively alternating the following two steps: 1) given the codebooks B and Φ, compute the optimal coding using the methods described above; 2) given the new coding, re-optimize the codebooks. One implementation of Step 2) allows B and Φ to be optimized independently.
For solving B, the optimization problems can be solved via their dual formulation, which become a convex optimization with solely nonnegative constraints. A projected Newton method can efficiently solve the resulting optimization. The projected Newton method has superlinear convergence rate under fairly mild conditions.
Optimizing Φ is more interesting due to the extra nonnegative constraint on its elements. Fortunately, the optimization is still convex. A projected gradient method is used for solving the optimization problem. For the projected gradient, each iteration step consists of two sub-steps. First, each column of φk goes one step along the gradient direction
(φk)1/2=φk−η∇φ
where ∇φ
where φkl is the lth element of φk. This optimization is to project (φk)1/2 onto a probabilistic simplex, and it can be solved very efficiently.
The hierarchical sparse coding is readily applicable to learning image representations for classification. As revealed by the data encoding procedure discussed above, the whole model operates on a set X of image patches in a local region, first nonlinearly mapping each x from the region to its sparse code w, and then (implicitly) pooling the codes of the set to obtain Σ, which is akin to the sample (diagonal) covariance of the sparse codes in that region, and corresponds to a way of “energy pooling”. In the next level, the model encodes Σ nonlinearly to obtain the sparse code α for the set X. The encoding procedure is implemented by solving a joint convex optimization problem.
Next, the modeling of spatial dependence is discussed. A slight modification can lead to a more general formulation, in the sense that Σ acts as not the sample covariance for only one region, but for several neighboring regions jointly. Then the learned bases Φ will capture the spatial dependence among several regions. Without loss of generality, consider a joint model for 2×2 local regions: if each region contains n patches, let X and W denote all the 4×n patches and their first-layer codes in these 4 regions. Then L(W,α) in (2) is modified as
where
is the inverse diagonal covariance for the (s,t)-th region, s=1,2, t=1,2. In this model, each local descriptor has its own first-level coding, while the 2×2 regions share the joint second-layer coding α. Each basis φk=[φk(1,1), φk(1,2), φk(2,1), φk(2,2)]εRp×4 describes a spatial co-occurrence pattern across 2×2 regions.
Hierarchical convolution coding is discussed next. This improvement applies the above joint model to convolute over an image. Again, without loss of generality, let the image be partitioned into 4×4 regions, indexed by (s,t), then convolution of the two-layer hierarchical coding over every 2×2 region neighborhood leads to 3×3 coding results α(u×v)εRq, where u=1,2,3 and v=1,2,3. Here each (u,v) indexes a “receptive field” of the hierarchical coding. X and W denote all the patches and their first-layer codes in the image. Then L(W,α) in (2) is modified as
where φ(W(s,t),α(u,v) is defined to be zero if the (s,t)-region is not in the (u,v) receptive field, otherwise
φ(W(s,t),α(u,v))=tr(S(W(s,t))Ω(s,t)(α(u,v)))
where
Here, r(s,t,u,v) indexes the relative position of the (s,t) region in the (u,v) receptive field. The coding method and codebook learning method are basically the same as those described in the previous section.
Next, Image Representation is discussed. The system samples image patches densely at a grid of locations. One embodiment partitions the patches into different non-overlapping regions based on their spatial locations, and then treats each window of several regions as a receptive field. For example, a typical setting can be
Each receptive field will give rise to a q-dimensional second-layer code vector. The system pools the second-layer code vectors by using max pooling. In order to obtain better shift and scale invariance, the system partitions each image in different scales, for example, into 1×1 and 2×2 blocks, and pool the second-layer codes within each block. The system concatenates the block-wise results to form the image representation.
Although a two-layer model is described, multi-layer models can be used for “deep learning”. Such systems can learn a stack of sparse coding models.
The architecture of two-layer convolution coding has an interesting analogy to sparse coding on a SIFT feature vector. For each SIFT descriptor, its receptive field contains 4×4 smaller non-overlapping regions—within each region, responses of a 8-dimensional coding, corresponding to a histogram of 8 orientations, are pooled together. A SIFT descriptor is then resulted from concatenating the 4×4 pooling results, outputting a 128 dimensional vector. Then sparse coding is the second-layer coding applied on top of SIFT. Sparse coding on SIFT leads to state-of-the-art results on a number of image classification benchmarks. The method presented here follows a similar processing architecture, but is a fully automatic approach learning features from raw pixels.
The hierarchical sparse coding produces image representations that improve accuracy on the MNIST digit recognition problem and the Caltech101 object recognition benchmark. The system gives excellent results for hand-written digit recognition on MNIST and object recognition on the Caltech101 benchmark, in which the later is the first time such result achieved by automatically feature learning from the pixel level, rather than using hand-designed descriptors. The results show that automatic learning features from image pixels is accurate and computationally efficient.
The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is discussed next in
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.
This application claims priority to provisional application Ser. No. 61/350,653 filed on Jun. 2, 2010, the content of which is incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61350653 | Jun 2010 | US |