This invention relates generally to learning dictionaries, and more particularly to learning the dictionaries in an online setting for coding images.
Sparse Coding
Sparse coding represents data vectors with a sparse linear combination of atoms from a possibly overcomplete dictionary. It is a powerful tool in data representation and has shown to consistently outperform conventional vector quantization methods.
Sparse coding represents data as a sparse linear combination of some predefined atoms, which can be posed as the following optimization problem:
where X is a matrix with data points as columns, B is a known, fixed and usually overcomplete set of bases, and A is the set of coefficients such that X≈BA, and F is the Frobenius norm. A regularization parameter is λS. The regularization term promotes sparsity in the coefficients of A, in that 0≦p≦1, with ∥A∥p defined as:
When p=1, the regularization term; and subsequently the entire equation (1.1) is convex.
Note “sparsity” is a term of art in numerical analysis, and not a relative term. Conceptually, sparsity corresponds to data that are loosely coupled.
Dictionary Learning
Dictionary learning is often used in sparse coding applications because dictionary learning offers a more accurate robustness and data dependent representations when compared to conventional sparsifying dictionaries, such as discrete cosine transforms (DCT), and wavelets. Dictionary learning for sparse coding is a powerful tool in many low level image processing applications, such as denoising, inpainting, and demosaicing. Dictionary learning finds a dictionary {circumflex over (D)} such that:
A dictionary {circumflex over (D)} is learned such that
Dictionary learning determines a sparse set of coefficients A, while optimizing the bases in D to better represent the available data. The function to be minimized in equation (1.2) is not jointly convex in A and D, but is convex in one variable while keeping the other variable fixed. Hence, dictionary learning typically alternates between a sparse coding stage using greedy or convex methods, and a dictionary update stage.
Dictionaries to be learned for this purpose are typically overcomplete. This means that the dictionaries have a large number of columns compared to an ambient dimension. In other words, the dictionaries are usually “fat” matrices. Henceforth, when we refer to the “size” of the dictionary, we mean the number of atoms in the dictionary, or equivalently the number of columns.
Dictionary learning often has a computational complexity of O(k2m+km+ks2)≈O(k2), where k is the dictionary size and m and s are the dimension of data and sparsity of the coefficients (in the sparse coding stage), respectively. A sample complexity of dictionary learning is O(√{square root over (k)}).
The size of the dictionary has a strong impact on the speed of the method, both in terms of computational and sample complexity. However, the size of the dictionary is typically selected and fixed before learning. Thus, a tradeoff has to be made here. A larger dictionary slows the learning method, but provides a highly sparse and redundant representation of the data, and a better fit. A smaller dictionary, on the other hand, does not provide as good a fit to the data, but speeds up the learning method.
The goal is to efficiently learn a dictionary that is “optimal” in size, so that the dictionary provides a sparse representation of the data, and it is not too large to burden computational resources. Also, it is desired to learn a small dictionary that provides a very good fit to the data.
Online Dictionary Learning
This problem has been addressed in the prior art. In an enhanced K-singular value decomposition (KSVD) method, agglomerative clustering discards similar atoms and atoms that are seldom used. While “pruning” of the dictionary might be helpful in some situations, it cannot be directly applied in an online setting, where the need for computational gains is the most.
In particular, in an online setting, the method might prune elements of the dictionary, which might become more important at a later stage. This is a possibility because learning is data dependent, and one cannot make too many accurate predictions about data not yet processed in the online setting.
One method predicts the optimum dictionary size for an Orthogonal Matching Pursuit (OMP), using notions of sphere covering. There, the atoms of the dictionary are assumed to be a subset of a known set of vectors, which is not the case in the usual learning scenario. Certain restrictive conditions on the dictionary are assumed, and it is not clear if they hold in general.
Along the same lines, a dictionary of appropriate size can be learned by selecting from an existing set of potential atoms. The relationships between a reconstruction accuracy ∥X−{circumflex over (X)}∥F2, and sparsity is a direct function of the dictionary size.
Some dictionary atoms that are unused and can be replaced by the data points themselves. However, this implies that there is no reduction in the size of the dictionary. The dictionary can be pruned by discarding atoms whose norms vanish. A regularizer can also be included in the objective function to construct dictionaries of a smaller size.
The embodiments of the invention provide a method for learning a dictionary of atoms, online, as test samples are acquired. The resulting dictionary can be used by a number of low level image processing applications, such as denoising and inpainting, as well as sparse coding and representation of images.
While there has been extensive work on the development of dictionary learning methods to perform the above applications, the problem of selecting an appropriate dictionary size is difficult.
The embodiments of the invention provide a clustering based method that reduces a size of the dictionary while learning the dictionary as data samples are processed. The method learns the dictionary in an online setting, by synthesizing new atoms from atoms already in the dictionary.
The method performs as well as the prior art online dictionary learning methods, in terms of representation and reconstruction of images, which achieving considerable speedup in training times. In addition, the method learns a smaller and more representative dictionary. As an added advantage, the dictionary is more incoherent, when compared with coherent prior art dictionaries.
For comparison purposes the conventional online dictionary learning method, without clustering and merging, are shown as italics in steps 110, 120, 130, and 140 of
The dictionary D, as well as “history” matrices A, B are initialized 103-104. The dictionary and matrices are described in much greater detail below.
Samples are acquired or selected 110. From the selected samples, reconstruction coefficients are computed 120, and used to update 130 matrices A and B.
The number of iterations (minIters) is compared 135 to a threshold t1 to determine if the threshold is exceeded. If not, the dictionary is updated 140. During the minimum number of iterations an initial dictionary is learned from training samples.
A threshold 12 is tested 150 for a termination condition for a maximum number of iterations (maxIters), and if yes, the dictionary 101 is output 160, otherwise the next iteration proceeds at step 110.
If the result in step 135 is yes, then a density is estimated 200, and atoms in the dictionary are merged 175, see
If 180 any atoms change, then matrices A, and B are reset 185 a kernel size is updated, and continue at step 140, otherwise go straight to step 140.
The steps, described in greater detail below, can be performed in a processor connected to memory and input/output interfaces as known in the art.
Notation
We use the following notations. We let xiεRn be the ith sample of training data XεRn×p in the samples 102. The dictionary 101 is denoted by D, where R are real numbers.
We start with an initial dictionary of size k0, so that initially D0εRn×k
After resizing, the dictionary is denoted by
Superscripts di, ai indicate columns of the corresponding matrices denoted by capital letters D, A, B, etc.
As used herein, the raw data, acquired, received, or processed sequentially, are called data points x, and the data points are represented in the dictionary by atoms d.
The COLD Method
We outline our method, which employs clustering within the online dictionary learning framework.
Clustering and Merging
Clustering step applies a density estimator to determine the clusters of the atoms in the current dictionary. For each cluster that contains more than one atom, we merge the multiple atoms of a cluster into a single atom.
In other words, after the clustering determines clusters of the atoms in the dictionary using a density estimator, we apply merging to combine atoms in a cluster into a single atom when there are multiple atoms in the cluster, using for example, the mean or mode of the atoms in the cluster.
We call our method COLD, Clustering based Online Learning of Dictionaries, implying both density estimation and merging, i.e., clustering (170)=density estimator (200)+merging (175).
The Density Estimator Method
For clustering, we use a density estimator procedure 200 as shown in
The procedure computes a kernel density estimate over the dictionary atoms, and assigns, to each data point, the mode of the kernel density estimate nearest in Euclidean distance to that point. The clustering is done on the empirical distribution of the dictionary atoms, and not on the data itself.
We briefly describe the density estimator based clustering method, and then outline our method, after providing an intuitive description of benefits.
The density estimator 200 method offers a non-parametric way to cluster data by associating with each data point x a corresponding mode of the kernel density estimate p(x) of the data. Specifically, given a kernel K(•) of size h, the d-, the d-dimensional density estimate p(x) 210 at x given n data xi is given by:
where h is the kernel size of the kernel K(•).
Kernel defines a space of influence. For example, an n-dimensional hyper-sphere with a radius h, or a hyper-cube with 2h sides. The kernel can be data within the space of influence according to their distances from the kernel center, using e.g. a Gaussian function. The kernel determines the weight of nearby data for the density estimation. We use radially symmetric kernels, i.e. K(z)=ck(∥z2∥) where c is a constant depending on the kernel used. This leads to the gradient ∇p(x) 220 of the density estimate p(x):
where g(•) is the negative gradient g(z)=−∇k(z) of k(•). The shift vector vx 230 at data x is
The shift vector vx always points to the direction of the maximum increase of the density estimate.
The density estimator procedure alternates between the steps: determine the density estimate 210, and its gradient 220;
So, by successive computation of the shift vector and the shifting of the data point along the vector, we guarantee to converge to a point where the gradient of the density is zero. Thus, the density estimator procedure is a steepest ascent over the kernel density estimate.
For each atom, we initially center the kernel on the current atom and compute the shift vector 230 in equation 2.2 using the kernel function and the atoms within the kernel window. We then translate the initial location of the kernel according to the shift vector. We repeat translating the kernel window until the translation becomes small or a maximum number of iterations is reached to reach the final location of the kernel.
For each atom, we assign the final kernel position as the mode of the corresponding atom. As a post-process, we combine nearby modes to determine a cluster center. We then assign the atoms of the combined modes to the same cluster center.
Intuition for Our Method
We provide the following intuition as to show the clustering of dictionary atoms learns smaller dictionaries, without loss in accuracy.
Consider the data X 102. For simplicity, we assume the data to lie on a unit sphere in three dimensions, and the data form three clusters. The data are assumed to be pixel intensities from an image having a background (higher variance) and two objects. Though simplistic, this allows us to clearly describe the notion of “optimal dictionary size.”
We start with an initial dictionary of size k0 atoms distributed randomly over the unit sphere. After training, the atoms of the dictionary align themselves according to the data. After alignment, some atoms are clustered in pairs or triplets. A smaller (but still overcomplete) dictionary prevents this.
When two or more atoms are similar (very close to each other), it is highly unlikely that more than one of the atoms is used to represent a data point simultaneously due to the sparsity constraint on the representation coefficients. In other words, when one of the atoms is selected (representation coefficient is non-zero), then, with a high likelihood, the other atoms are not. Therefore, only one atom in a cluster of atoms can be used to represent all of the data points in the cluster. The idea is that during the learning process, whenever dictionary atoms get “too close” to each other, i.e., appear similar, we can merge these atoms.
Cold
We first give a brief overview of our dictionary learning method, followed by a detailed description. Dictionary learning involves alternating between sparse coding and dictionary update steps. Because our method operates online, we process data points sequentially, and do not know new data points in advance.
Hence, in the sparse coding step in equation (1.1), we have a single data point xi and not the matrix X, and we compute 120 the corresponding reconstruction coefficients αt:
To update the dictionary for known αt's, we obtain the solution for:
The coefficients are stored in “history” matrices A and B. Matrix A stores the sum of outer products between the sparse coefficients, while the matrix B does the same for the data points and the sparse coefficients of that point. The solution for the above equation is obtained by using columns from matrices A and B. The jth columns of the matrices are aj and bj, respectively, and
The dictionary atoms are then updated as:
We restrict the dictionary atoms to lie on the surface of the unit Euclidian sphere, and not in it. This prevents atom norms to become zero. This enables merging of the atoms at a later stage. Of course, allowing the number of atoms to reduce to zero, and discarding the atoms is another method of dictionary size reduction.
Inputs:
xεRn
Initialize 103-104:
AεRk
Check 150
t2≦maxIters
Select 110 Draw
xt:p(x)
Compute 120
Update 130
A←A+αtαtT
B←B+xtαtT
Check 135
tl≧minIters.
Estimate 200 Density
Dictionary Changed 180
Update
A←0, B←
Compute Dt by KSVD, with
Output 160: Dt 101.
We see that method Cold-I uses density estimator clustering. The density estimator clustering of the dictionary atoms can be done at any stage before the sparse coding, or before the dictionary update.
We do not apply the density estimator until a minimum number of iterations (minIters) 135 have passed. During the initial minimum iterations test samples are selected from the set of samples. This is because, in the online case, we need to wait until the dictionary learning procedure has adapted to a sufficient amount of data to construct the initial, before we modify the dictionary.
One can think of a maximum number of iterations do we update the dictionary until a termination condition is reached.
In a degenerated case where, after the first iteration, all the dictionary atoms are perfectly aligned, so that the density estimator procedure results in a dictionary of size 1. To prevent this, we wait for minIters iterations during dictionary training. In most cases, waiting for k0 iterations where k0 is the initial dictionary size, before performing the density estimator procedure is constitutes a sufficient waiting time.
After the clustering, i.e., the density estimator and the merging procedure, if the dictionary is changed and smaller, i.e.,
It makes sense to treat the method procedure as if restarting the learning method with the new dictionary, as an initialization. This can be seen as analogous to periodically deleting the history in the conventional online learning method. The method can be improved by not discarding history corresponding to the atoms in the dictionary that are retained, but a search requires more computations, possibly canceling the gains acquired by our clustering.
The method can be made faster by avoiding the density estimator procedure 200 at every iteration. The Method Cold-I is the simplest possible variant. We can apply the density estimator after every w iterations, which can be predetermined. That way, we only perform the density estimator (or any other clustering used) procedure after every w iteration.
Performing the density estimator after every w iterations might also be more beneficial, because it has the effect of allowing the dictionary to be trained after resizing, allowing for the atoms to reorient sufficiently according to the data.
Also, many fast versions of the density estimator procedure or a blurring procedure are known. Using those procedures considerably speeds up the density estimator procedure, and thus the overall method.
Another method to speed up is to stop the density estimator based on convergence as a termination condition. In method Cold-I, as the kernel size approaches zero and becomes very small h→0, the density estimator does not affect the dictionary, because every atom in the dictionary is a mode in the empirical distribution. Hence, continuing to perform the density estimator only adds computations. To prevent this, we can monitor the change in the dictionary after every density estimator iteration, and stop after the difference between the “new” and “old” dictionaries
This minimization scheme is inherently non-convex, and so convergence to a global optimum cannot be guaranteed. However, we can prove that the method converges by reducing the kernel size h sequentially as described above. As h→0 and
We assign the kernel size as a function of the number of iterations h=h/(t−minIters+1). In other words, the kernel size becomes smaller with each iteration. We also use constant kernel size to obtain more smaller dictionaries.
Method Cold-II describes a faster version of COLD, using density estimator less frequently and using a faster implementation of density estimator, and also with a constant h.
Method: Fast Cold-II
Inputs:
xεRn
Initialize:
AεRk
Check
t2<maxIters
Select, Draw xt: p(x)
Compute
Update
A←A+αtαtT
B←B+xtαtT
Check
t1≧minIters and mod(t, w)=0
Estimate
Dictionary Changed
Update
A←0,B←
Compute Dt using KSVD, with
Output: Dt
Remarks
The method starts with an overcomplete initial dictionary, and subsequently decreases the number of atoms in the dictionary to result in a dictionary that has far fewer atoms than the initial dictionary.
The method replaces a cluster of atoms in the dictionary with by a newly synthesized atom. The clumping of atoms, and their subsequent clustering and merging is data dependent, as should be the case.
A question arises as to whether we can then increase the dictionary size adaptively as well, depending on the data distribution.
In some cases, the data arrives in the online setting such that the available dictionary atoms are insufficient to efficiently encode the data. The efficiency of encoding is determined according to a metric of choice. This happens if the data point in question is (nearly) orthogonal to all the atoms present in the dictionary. In such cases, we add the current data point as an atom in the dictionary to increase the dictionary, increasing its size by 1.
Analysis of the Method
This embodiment of COLD learns a more incoherent dictionary, achieves speedups in computational complexity, and is more likely to converge to a local minimum as a termination condition.
Increase in Incoherence
Incoherence of the dictionary atoms is an important characteristic of role in the theoretical guarantees of the sparse coding methods. An incoherent dictionary prevents overfitting the data, thus improving performance. By merging similar atoms into one atom, we promote incoherence among the remaining dictionary atoms. Thus, incoherency is a term of art.
Any merger of atoms after the density estimator never leads to an increase in coherence of the dictionary for the following reason.
Assume the initial coherence of the dictionary is defined as
where <•,•> implies the conventional Euclidean inner product. Suppose the maximum in the above definition occurs for some fixed i and j. Then, we have
where θij is an angle between the dictionary atoms di and dj, and (i) follows from the fact that the dictionary atoms are unit normed. Note that, the dictionary atoms are unit-length vectors in the n-dimensional space, i.e., the atoms are on the n-dimensional unit sphere, and the angle between the atoms indicate a dissimilarity between atoms, i.e., the larger the angle, the more dissimilar the atoms.
If the coherence μ(D) is large, then one of two things are implied. Either the angle is small θij≈0, meaning the atoms are similar close to each other, in which case atoms di and dj are merged, or the angle θij≈π, in which case the atoms are not merged. Also, the atoms are not merged if μ(D) is small, implying nearly orthogonal (or equivalently, well separated) atoms. Thus, atoms are merged only when θij≈0. If the coherence of the “new” dictionary is μ(
where the inequality follows from the fact the merging of the dictionary atoms removes atoms that have θij=0, depending on the kernel size of h.
The small shaded area 301 corresponds to the angle between the atoms, which decides the initial coherence μ(D).
The bumps 302 outside the disc correspond to the modes of the kernel density estimate over the atoms. The atoms after clustering are
In one embodiment, we perform clustering whenever the coherence score is greater than some predetermined threshold.
Clustering and Merging
In
In
Reduction in Complexity
As stated before, the dictionary learning procedure has a computational complexity of O(k02+2k0)≈O(k02), where k0 is the dictionary size, and the sample complexity of dictionary learning is O(√{square root over (k0)}). Smaller dictionaries automatically reduce the sample complexity.
The reduction in complexity depends on the version of density estimator clustering used. The conventional density estimator requires the computation of pairwise distances between the dictionary atoms, which for a size k dictionary is O(k2). A fast density estimator procedure significantly reduces this complexity. We consider the conventional O(k2) and show that even this achieves a reduction in complexity. So by default, faster-density estimator procedures perform much better.
Assume that the number of iterations for training, maxIters=n. In this case, n is n, not to be confused with the same variable name we used above to indicate the data size. It is natural that every density estimator does not result in a reduction of dictionary size. Suppose M of them do, so that for every mj iterations, j=0, 1, . . . M, the dictionary size reduces sequentially from k0 to kj. Of course, Σj=0Mmj=n. Considering the density estimator itself to have a (maximum) complexity of O(kj2), we have the total complexity of COLD to be less than conventional online dictionary learning provided that we have
This inequality strictly holds as long as mj is large and kj<<k0 for j≈M. For a highly overcomplete dictionary, this holds in general, because in the initial stages, the reduction in size of the dictionary is profound.
This can be supported by empirical validation. Another thing to note is that, equation (3.1) is the condition that is needs to be satisfied if we use the basic density estimator procedure. Faster density estimator procedures considerably reduce the complexity by making it linear in kjkj, and we can reduce equation (3.1) to
corresponding to only the dictionary learning of a dictionary of size kjkj. Of course, equation (3.2) always holds, because
kj≦k0∀j≠0.
It might be interesting to consider a case where we merge only a single pair of atoms at every stage where the density estimator procedure is applied. In this case, because the size of the dictionary decreases by merely 1 at each iteration, the condition in equation (3.1) is not satisfied, and we obtain a slower method. In other words, the gains obtained by merely discarding a single atom is negligible when opposed to the loss incurred due to performing a scheme that selects to merge only a single pair of atoms. Thus, the ability of the density estimator (or any other clustering method) to significantly reduce the dictionary size in a single pass is the key for the speed gains obtained by our method, when compared with the conventional online dictionary learning method.
Convergence
We state
The dictionary after the resizing is “close” to the dictionary before resizing. Assume that, after the density estimator procedure is applied, the jth column (atom) of D, dj is mapped into a corresponding mode
Dt=[d1,d2, . . . ,dl],
and
so that Dt and
For a given kernel, after t iterations, if the kernel size used is h, then we have:
∥Dt−
where we denote by
This is true because, for every column of D dj, the associated mode (after the density estimator method) is
where in, the subscript indicates the i, jth element of the matrix Dt−
We can also state:
because
Now, we can show that the density estimator produces a new dictionary that can still be used as a restart for the dictionary update step.
With the above definition of i, and a recursively shrinking kernel size,
From the triangle inequality, we have
∥Dt+1−
Hence, if we allow h→0, we surely achieve convergence of the learned dictionary even with the density estimator procedure. This, as described before, is because, as h approaches zero, the density estimator stops modifying the dictionary. Again, a similar situation holds with a constant h as well, but it is harder. In both cases, the key is that
The Offline Setting
Although we describe the online dictionary learning setting, the same method can be applied in to offline dictionary learning. To use clustering in the offline case, we simply apply the density estimator procedure after every sparse coding step in the KSVD method. We can apply the clustering procedure either before or after the dictionary update stage. Similar parallel analysis and experiments can be carried out in this scenario.
Reduced Processing Time
Our method reduces the time required to perform online dictionary learning by using inherent clustering of dictionary atoms. The choice of the clustering method is arbitrary.
Optimal Dictionary Size
In one embodiment, we use a density estimator procedure because it is non-parametric. The reconstruction error is not affected by the reduction in the size of the dictionary. Thus, we enable an “optimum-sized” dictionary to be learned.
This is shown in the following table, comparing the prior art online dictionary learning (OLD) with our COLD method.
The first column indicates the final dictionary size after learning. Note that in case of ODL, the final size=initial size. We can see that, as the initial dictionary size increases, COLD is much faster, while the loss in MSE is negligible.
Convergence
Our dictionary converges to about 2× overcomplete. This suggests that a dictionary that is 2× overcomplete generally suffices to represent detail in most images acquired of natural scenes.
Another embodiment selects the kernel size parameter, for density estimator clustering in a principled way. We can also examine a characteristic relationship between the choice of the kernel size and the reconstruction error. This combined with the characterization of the error and the sparsity parameter allows for a relationship between the trade-off between the sparsity and the kernel size, as a function of the reconstruction error desired, allowing us to optimize the parameters.
We can also increase the dictionary size while learning. One way to increase the size is to check if the dictionary at the current iteration performs adequately in the sparse coding stage, and if not, append the current data point as a new atom in the dictionary.
Our work differs from the prior art in several aspects.
Reduced Memory and Processing Time
First, we assume an online setting where a reduction in the computational time and memory requirements is most needed.
Clustering
Second, although we use clustering methods, we do not prune the dictionary by discarding atoms, but use a density based approach to synthesize new atoms from several atoms “near” to each other, i.e., similar. This offers resiliency to outliers in an online setting in the following sense.
If an atom is used rarely, then the atom is retained so long as there are not too many nearby atoms, so that the outlier data that is not represented is still be well represented by the dictionary. The loss in redundancy arising from the clustering of atoms does not affect coding accuracy.
No Assumptions
Third, we do not make restrictive assumptions on the dictionary or the data itself, except that the atoms in the dictionary lie on a unit sphere. This is a valid assumption to prevent the reconstruction coefficients from arbitrarily scaling. Also, by imposing this constraint, we ensure that the number of dictionary atoms do not reduce to zero.
Incoherency
Our method has the effect of preventing atoms of the dictionary from clumping together. This has another advantage.
Incoherent dictionaries perform better in terms of image representation than coherent (clumped) dictionaries, by preventing overfitting to the data. Incoherence of the dictionary atoms also plays a role in determining the performance of sparse coding methods. Because incoherence depends on the separation between the atoms, merging nearby dictionary atoms into a single atom improves incoherence.
We provide the following improvements:
1. We describe a new framework for iteratively reducing the dictionary in an online learning setting;
2. We show that our method allows for faster learning and that it promotes incoherence between the dictionary atoms; and
3. We show that the smaller learned dictionary performs as well as a larger, “non-shrunk” dictionary, in low level image processing applications.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications may be made within the spirit and scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.
Entry |
---|
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19-60. |
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(5), 603-619. |
Number | Date | Country | |
---|---|---|---|
20130236090 A1 | Sep 2013 | US |