The present invention relates to a system and method for the sparse representation of signals. The invention is particularly relevant for applications such as compression, regularization in inverse problems, feature extraction, denoising, separation of texture and cartoon content in images, signal analysis, signal synthesis, inpainting and restoration.
In recent years there has been a growing interest in the study of sparse representation of signals. Using an overcomplete dictionary that contains prototype signal-atoms, signals are described as sparse linear combinations of these atoms. Applications that use sparse representation are many and include compression, regularization in inverse problems, feature extraction, and more. Recent activity in this field concentrated mainly on the study of pursuit algorithms that decompose signals with respect to a given dictionary. Designing dictionaries to better fit the above model can be done by either selecting one from a pre-specified set of linear transforms, or by adapting the dictionary to a set of training signals. Both these techniques have been considered in recent years, but this topic is largely still open.
Using an overcomplete dictionary matrix D∈n×K that contains K prototype signal-atoms for columns, {dj}j=1K, it is assumed that a signal y∈n can be represented as a sparse linear combination of these atoms. The representation of y may either be exact y=Dx, or approximate, y≈Dx, satisfying ∥y−Dx∥p≦ε. The vector x∈K displays the representation coefficients of the signal y. In approximation methods, typical norms used for measuring the deviation are the lp-norms for p=1, 2 and ∞.
If n<K and D is a full-rank matrix, an infinite number of solutions are available for the representation problem, hence constraints on the solution must be set. The solution with the fewest number of nonzero coefficients is certainly an appealing representation. This, sparsest representation, is the solution of either
where ∥·∥0 is the l0 norm, counting the non zero entries of a vector.
Applications that can benefit from the sparsity and overcompleteness concepts (together or separately) include compression, regularization in inverse problems, feature extraction, and more. Indeed, the success of the JPEG2000 coding standard can be attributed to the sparsity of the wavelet coefficients of natural images. In denoising (removal of noise from noisy data so to obtain the unknown or original signal), wavelet methods and shift-invariant variations that exploit overcomplete representation, are among the most effective known algorithms for this task. Sparsity and overcompleteness have been successfully used for dynamic range compression in images, separation of texture and cartoon content in images, inpainting (changing an image so that the change is not noticeable by an observer) and restoration, and more.
In order to use overcomplete and sparse representations in applications, one needs to fix a dictionary D, and then find efficient ways to solve (1) or (2). Recent activity in this field has been concentrated mostly on the study of so called pursuit algorithms that represent signals with respect to a known dictionary, and approximate the solutions of (1) and (2). Exact determination of sparsest representations proves to be an NP-hard problem. Thus, approximate solutions are considered instead, and several efficient pursuit algorithms have been proposed in the last decade. The simplest ones are the Matching Pursuit (MP) and the Orthogonal Matching Pursuit (OMP) algorithms. Both are greedy algorithms that select the dictionary atoms sequentially. These methods are very simple, involving the computation of inner products between the signal and dictionary columns' and possibly deploying some least squares solvers. Both (1) and (2) are easily addressed by changing the stopping rule of the algorithm.
A second well known pursuit approach is the Basis Pursuit (BP). It suggests a convexisation of the problems posed in (1) and (2), by replacing the l0-norm with an λ1-norm. The Focal Under-determined System Solver (FOCUSS) is very similar, using the lp-norm with p≦1, as a replacement to the l0-norm. Here, for p<1 the similarity to the true sparsity measure is better, but the overall problem becomes non-convex, giving rise to local minima that may divert the optimization. Lagrange multipliers are used to convert the constraint into a penalty term, and an iterative method is derived based on the idea of iterated reweighed least-squares that handles the lp-norm as an l2 weighted one.
Both the BP and FOCUSS can be motivated based on Maximum A Posteriori (MAP) estimation and indeed several works used this reasoning directly. The MAP can be used to estimate the coefficients as random variables by maximizing the posterior P(x|y,D)αP(y|D,x)P(x). The prior distribution on the coefficient vector x is assumed to be a super-Gaussian Independent Identically-Distributed (iid) distribution that favors sparsity. For the Laplace distribution this approach is equivalent to BP.
Extensive study of these algorithms in recent years has established that if the sought solution, x, is sparse enough, these techniques recover it well in the exact case. Further work considered the approximated versions and has shown stability in recovery of x. The recent front of activity revisits those questions within a probabilistic setting, obtaining more realistic assessments on pursuit algorithms performance and success. The properties of the dictionary D set the limits that may be assumed on the sparsity that consequently ensure successful approximation. Interestingly, in all the works mentioned so far, there is a preliminary assumption that the dictionary is known and fixed. There is a great need to address the issue of designing the proper dictionary in order to better fit the sparsity model imposed.
An overcomplete dictionary D that leads to sparse representations can either be chosen as a pre-specified set of functions, or designed by adapting its content to fit a given set of signal examples.
Choosing a pre-specified transform matrix is appealing because it is simpler. Also, in many cases it leads to simple and fast algorithms for the evaluation of the sparse representation. This is indeed the case for overcomplete wavelets, curvelets, contourlets, steerable wavelet filters, short-time-Fourier transforms, and more. Preference is typically given to tight frames that can easily be pseudo-inverted. The success of such dictionaries in applications depends on how suitable they are to sparsely describe the signals in question. Multiscale analysis with oriented basis functions and a shift-invariant property are guidelines in such constructions.
There is need to develop a different route for designing dictionaries D based on learning, and find the dictionary D that yields sparse representations for the training signals. Such dictionaries have the potential to outperform commonly used pre-determined dictionaries. With ever-growing computational capabilities, computational cost may become secondary in importance to the improved performance achievable by methods which adapt dictionaries for special classes of signals.
Sparse coding is the process of computing the representation coefficients, x, based on the given signal y and the dictionary D. This process, commonly referred to as “atom decomposition”, requires solving (1) or (2), and this is typically done by a “pursuit algorithm” that finds an approximate solution. Three popular pursuit algorithms are the Orthogonal Matching Pursuit (OMP), Basis Pursuit (BP) and the Focal Under-determined System Solver (FOCUSS).
Orthogonal Matching Pursuit is a greedy step-wise regression algorithm. At each stage this method selects the dictionary element having the maximal projection onto the residual signal. After each selection, the representation coefficients with regarding to the so far chosen atoms are found via least-squares. Formally, given a signal y ∈n, and a dictionary D with K l2-normalized columns {dk}k=1K, one starts by setting r0=y, k=1, and performing the following steps:
The algorithm can be stopped after a predetermined number of steps, hence after having selected a fixed number of atoms. Alternatively, the stopping rule can be based on norm of the residual, or on the maximal inner product computed in the next atom selection stage.
OMP is an appealing and very simple to implement algorithm. Unlike other methods, it can be easily programmed to supply a representation with an a priori fixed number of non-zero entries−a desired outcome in the training of dictionaries. There are several variants of the OMP that suggest (i) skipping the least-squares and using inner product itself as a coefficient, or (ii) applying least-squares per every candidate atom, rather than just using inner-products at the selection stage, or (iii) doing faster and less precise search, where instead of searching for the maximal inner product, a nearly maximal one is selected, thereby speeding up the search.
Theoretic study has shown that the OMP solution is also the sparsest available one (solving (1)) if some conditions on the dictionary and on the exact solution prevail. More recent work has shown that the above is also true for the approximation version (2). These results and some later ones that apply to the basis pursuit and FOCUSS involve a key feature of the dictionary D called the mutual incoherence and defined as:
This measure quantifies how similar two columns of the dictionary can be. Given μ, the sparse representation to be found has fewer than O(1/μ) non-zeros, the OMP and its variants are guaranteed to succeed in recovering it.
Basis Pursuit (BP) algorithm proposes the replacement of the l0-norm in (1) and (2) with an l1-norm. Hence solutions of:
in the exact representation case, and
in the approximate one, lead to the BP representations. Solution of (4) amounts to linear programming, and thus there exists efficient solvers for such problems.
Recent research addressed the connection between the (P0) and (P1). The essential claims are quite similar to the ones of OMP, namely, if the signal representation to be found has fewer than O(1/μ) non-zeros, the BP is guaranteed to succeed in recovering it. Similar results exist for the approximated case, proving that recovered representations are very close to the original sparse one in case of high sparsity.
Focal Under-determined System Solver (FOCUSS) is an approximating algorithm for finding the solutions of either (1) or (2), by replacing the l0-norm with an lp one for p≦1.
For the exact case problem, (P0), this method requires solving
The use of a Lagrange multiplier vector λ∈n here yields the Lagrangian function
ζ(x, λ)=∥x∥p+λT(y−Dx). (7)
Hence necessary conditions for a pair x, λ to be a solution of 6 are
∇xζ(x, λ)=pΠ(x)x−DTλ=0 and ∇λζ(x, λ)=Dx−y=0, (8)
where Π(x) is defined to be a diagonal matrix with |xi|p−2 as its (i, i)th entry. The split of the lp-norm derivative into a linear term multiplied by a weight matrix is the core of the FOCUSS method, and this follows the well-known idea of iterated reweighed least-squares. Several simple steps of algebra leads to the solution:
x=Π(x)−1DT(DΠ(x)−1DT)−1y. (9)
While it is impossible to get a closed form solution for x from the above result, an iterative replacement procedure can be proposed, where the right hand side is computed based on the currently known xk−1, and this leads to the updating process,
x
k=Π(xk−1)−1DT(DΠ(xk−1)−1DT)−1y. (10)
A regularization can, and should, be introduced to avoid near-zero entries in the weight matrix Π(x).
For the treatment of (P0,ε) via the (Pp,ε) parallel expressions can be derived quite similarly, although in this case the determination of the Lagrange multiplier is more difficult and must be searched within the algorithm.
Recent work analyzed the (Pp) problem and showed its equivalence to the (P0), under conditions similar in flavor to the sparsity conditions mentioned above. Hence, this approach too enjoys the support of some theoretical justification, like BP and OMP. However, the analysis does not say anything about local minima traps and prospects in hitting those in the FOCUSS-algorithm.
There has been some work in the field regarding the training of dictionaries based on a set of examples. Given such set Y={yd}d=1N, we assume that there exists a dictionary D that gave rise to the given signal examples via sparse combinations, i.e., we assume that there exists D, so that solving (P0) for each example yk gives a sparse representation xk. It is in this setting that the question is raised what the proper dictionary D is.
There is an intriguing relation between sparse representation and clustering (i.e., vector quantization). In clustering, a set of descriptive vectors {dk}k=1K is learned, and each sample is represented by one of those vectors (the one closest to it, usually in the l2 distance measure). One can think of this as an extreme sparse representation, where only one atom is allowed in the signal decomposition, and furthermore, the coefficient multiplying it must be 1. There is a variant of the vector quantization (VQ) coding method, called Gain-Shape VQ, where this coefficient is allowed to vary. In contrast, in sparse representations relevant to the invention, each example is represented as a linear combination of several vectors {dk}k=1K. Thus, sparse representations can be referred to as a generalization of the clustering problem.
Since the K-Means algorithm (also known as generalized Lloyd algorithm—GLA) is the most commonly used procedure for training in the vector quantization setting, it is natural to consider generalizations of this algorithm when turning to the problem of dictionary training The K-Means process applies two steps per each iteration: (i) given {dk}k=1K, assign the training examples to their nearest neighbor; and (ii) given that assignment, update {dk}k=1K to better fit the examples.
The approaches to dictionary design that have been tried so far are very much in line with the two-step process described above. The first step finds the coefficients given the dictionary—a step we shall refer to as “sparse coding”. Then, the dictionary is updated assuming known and fixed coefficients. The differences between the various algorithms that have been proposed are in the method used for the calculation of coefficients, and in the procedure used for modifying the dictionary.
Maximum likelihood methods use probabilistic reasoning in the construction of D. The proposed model suggests that for every example y the relation
y=Dx+v, (11)
holds true with a sparse representation x and Gaussian white residual vector v with variance σhu 2. Given the examples Y={yi}i=1N these works consider the likelihood function P (Y|D) and seek the dictionary that maximizes it. Two assumptions are required in order to proceed—the first is that the measurements are drawn independently, readily providing
The second assumption is critical and refers to the “hidden variable” x. The ingredients of the likelihood function are computed using the relation
P(yi|D)=∫P(yi, x|D)dx=∫P(yi|x, D)·P(x)dx (13)
Returning to the initial assumption in (11), we have
The prior distribution of the representation vector x is assumed to be such that the entries of x are zero-mean iid, with Cauchy or Laplace distributions. Assuming for example a Laplace distribution we get
This integration over x is difficult to evaluate, and indeed, it has been handled by replacing it with the extremal value of P(yi, x|D). The overall problem turns into
This problem does not penalize the entries of D as it does for the ones of xi. Thus, the solution will tend to increase the dictionary entries' values, in order to allow the coefficients to become closer to zero. This difficulty has been handled by constraining the l2-norm of each basis element, so that the output variance of the coefficients is kept at an appropriate level.
An iterative method was suggested for solving (16). It includes two main steps in each iteration: (i) calculate the coefficients xi using a simple gradient descent procedure; and then (ii) update the dictionary using
This idea of iterative refinement, mentioned before as a generalization of the K-Means algorithm, was later used again by other researchers, with some variations.
A different approach to handle the integration in (15) has been suggested. It consisted in approximating the posterior as a Gaussian, enabling an analytic solution of the integration. This allows an objective comparison of different image models (basis or priors). It also removes the need for the additional re-scaling that enforces the norm constraint. However, this model may be too limited in describing the true behaviors expected. This technique and closely related ones have been referred to as approximated ML techniques.
There is an interesting relation between the maximum likelihood method and the Independent Component Analysis (ICA) algorithm. The latter handles the case of a complete dictionary (the number of elements equals the dimensionality) without assuming additive noise. The maximum likelihood method is then similar to ICA in that the algorithm can be interpreted as trying to maximize the mutual information between the inputs (samples) and the outputs (the coefficients).
The Method of Optimal Directions (MOD), a dictionary-training algorithm, follows more closely the K-Means outline, with a sparse coding stage that uses either the OMP or FOCUSS, followed by an update of the dictionary. The main contribution of the MOD method is its simple way of updating the dictionary.
Assuming that the sparse coding for each example is known, we define the errors ei=yi−Dxi. The overall representation mean square error is given by
∥E∥
F
2
=∥[e
1
, e
2
, . . . , e
N]∥F2=∥Y−DX∥F2. (18)
Here we have concatenated all the examples y, as columns of the matrix Y, and similarly gathered the representations coefficient vectors x, to build the matrix X. The notation ∥A∥F stands for the Frobenius Norm, defined as ∥A∥F=√{square root over (ΣUAU2)}.
Assuming that X is fixed, we can seek an update to D such that the above error is minimized. Taking the derivative of (10) with respect to D we obtain the relation (Y−DX)XT=0, leading to
D
(N+1)
=YX
(n)
·(X(n)X(n)
In updating the dictionary, the update relation given in (19) is the best that can be achieved for fixed X. The iterative steepest descent update in (17) is far slower. Interestingly, in both stages of the algorithm, the difference is in deploying a second order (Newtonian) update instead of a first-order one. Looking closely at the update relation in (17), it could be written as
Using infinitely many iterations of this sort, and using small enough η, this leads to a steady state outcome, which is exactly the MOD update matrix (19). Thus, while the MOD method assumes known coefficients at each iteration, and derives the best possible dictionary, the ML method by Olshausen and Field only gets closer to this best current solution, and then turns to calculate the coefficients. Note, however, that in both methods a normalization of the dictionary columns is required and done.
The same researchers that conceived the MOD method also suggested a maximum a-posteriori probability (MAP) setting for the training of dictionaries, attempting to merge the efficiency of the MOD with a natural way to take into account preferences in the recovered dictionary. This probabilistic point of view is very similar to the ML methods discussed above. However, rather than working with the likelihood function P(Y|D), the posterior P(D|Y) is used. Using Bayes rule, we have P(D|Y)αP(Y|D)P(D), and thus we can use the likelihood expression as before, and add a prior on the dictionary as a new ingredient.
Research currents considered several priors P(D) and per each proposed an update formula for the dictionary. The efficiency of the MOD in these methods is manifested in the efficient sparse coding, which is carried out with FOCUSS. The proposed algorithms in this family deliberately avoid a direct minimization with respect to D as in MOD, due to the prohibitive n×n matrix inversion required. Instead, iterative gradient descent is used.
When no prior is chosen, the update formula is the very one used in (17). A prior that constrains D to have a unit Frobenius norm leads to the update formula
D
(n+1)
=D
(n)
+ηEX
T
+η·tr(XETD(n))D(n). (21)
As can be seen, the first two terms are the same ones as in (17). The last term compensates for deviations from the constraint. This case allows different columns in D to have different norm values. As a consequence, columns with small norm values tend to be under-used, as the coefficients they need are larger and as such more penalized.
This led to the second prior choice, constraining the columns of D to have a unit l2-norm. The new update equation formed is given by
d
i
(n+1)
=d
i
(n)+η(I−di(n)di(n)
where xTi is the i-th column in the matrix XT.
Compared to the MOD, this line of work provides slower training algorithms.
Recent work considered a dictionary composed as a union of orthonormal bases
D=[D1; D2. . . , DL],
where Dj∈n×n, j=1, 2, . . . , L are orthonormal matrices. Such a dictionary structure is quite restrictive, but its updating may potentially be made more efficient.
The coefficients of the sparse representations X can be decomposed to L pieces, each referring to a different ortho-basis. Thus,
X=[X
1
, X
2
, . . . , X
L]T,
where Xi is the matrix containing the coefficients of the orthonormal dictionary Di.
One of the major advantages of the union of ortho-bases is the relative simplicity of the pursuit algorithm needed for the sparse coding stage. The coefficients are found using the Block Coordinate Relaxation (BCR) algorithm. This is an appealing way to solve (P1,ε) as a sequence of simple shrinkage steps, such that at each stage Xi is computed, while keeping all the other pieces of X fixed. Thus, this evaluation amounts to a simple shrinkage.
Assuming known coefficients, the proposed algorithm updates each orthonormal basis Dj sequentially. The update of Dj is done by first computing the residual matrix
Then, by computing the singular value decomposition of the matrix EjXTj=UΛVT, the update of the j-th ortho-basis is done by Dj=UVT. This update rule is obtained by solving a constrained least squares problem with ∥Ej−DjXj∥F2 as the penalty term, assuming fixed coefficients Xj and error Ej. The constraint is over the feasible matrices Dj, which are forced to be orthonormal.
This way the proposed algorithm improves each matrix Dj separately, by replacing the role of the data matrix Y in the residual matrix Ej, as the latter should be represented by this updated basis.
Grinbonval suggested a slightly different method. Apart from the fact that here the dictionary is structured, handling a union of orthonormal bases, it updates each orthonormal bases sequentially, and thus reminds the sequential updates done in the K-means. Experimental results show weak performance compared to previous methods. This could partly be explained by the fact that the update of Dj depends strongly on the coefficients Xj.
In VQ, a codebook C that includes K codewords (representatives) is used to represent a wide family of vectors (signals) Y={yi}l=1N(N>K) by a nearest neighbor assignment. This leads to an efficient compression or description of those signals, as clusters in n surrounding the chosen codewords. Based on the expectation maximization procedure, the K-Means can be extended to suggest a fuzzy assignment and a covariance matrix per each cluster, so that the data is modeled as a mixture of Gaussians.
The dictionary of VQ codewords is typically trained using the K-Means algorithm. We denote the codebook matrix by C=[c1, c2, . . . , cK], the codewords being the columns When C is given, each signal is represented as its closest codeword (under l2-norm distance). We can write yi=Cxi, where xi=ej is a vector from the trivial basis, with all zero entries except a one in the j-th position. The index j is selected such that
∀k≠j∥yi−Cej∥22≦∥yi−Cek∥22.
This is considered as an extreme case of sparse coding in the sense that only one atom is allowed to participate in the construction of yi and the coefficient is forced to be 1. The representation MSE per yi is defined as
e
x
2
=∥y
i
−Cx
i∥22. (23)
and the overall MSE is
The VQ training problem is to find a codebook C that minimizes the error E, subject to the limited structure of X, whose columns must be taken from the trivial basis,
The K-Means algorithm is an iterative method used for designing the optimal codebook for VQ. In each iteration there are two stages—one for sparse coding that essentially evaluates X, and one for updating the codebook.
The sparse coding stage assumes a known codebook C(J−1), and computes a feasible X that minimizes the value of (25). Similarly, the dictionary update stage fixes X as known, and seeks an update of C so as to minimize (25). Clearly, at each iteration either a reduction or no change in the MSE is ensured. Furthermore, at each such stage, the minimization step is optimal under the assumptions. As the MSE is bounded from below by zero, and the algorithm ensures a monotonic decrease of the MSE, convergence to at least a local minimum solution is guaranteed. Stopping rules for the above-described algorithm can vary a lot but are quite easy to handle.
Almost all previous methods can essentially be interpreted as generalizations of the K-Means algorithm, and yet, there are marked differences between these procedures. In the quest for a successful dictionary training algorithm, there are several desirable properties:
(i) Flexibility: the algorithm should be able to run with any pursuit algorithm, and this way enable choosing the one adequate for the run-time constraints, or the one planned for future usage in conjunction with the obtained dictionary. Methods that decouple the sparse-coding stage from the dictionary update readily have such a property. Such is the case with the MOD and the MAP based methods.
(ii) Simplicity: much of the appeal of a proposed dictionary training method has to do with how simple it is, and more specifically, how similar it is to K-Means. It is desirable to have an algorithm that may be regarded as a natural generalization of the K-Means. The algorithm should emulate the ease with which the K-Means is explainable and implementable. Again, the MOD seems to have made a substantial progress in this direction, although there is still room for improvement.
(iii) Efficiency: the proposed algorithm should be numerically efficient and exhibit fast convergence. The above described methods are all quite slow. The MOD, which has a second-order update formula, is nearly impractical in reasonable dimensions, because of the matrix inversion step involved. Also, in all the above formulations, the dictionary columns are updated before turning to re-evaluate the coefficients. This approach inflicts a severe limitation on the training speed.
(iv) Well Defined Objective: for a method to succeed, it should have a well defined objective function that measures the quality of the solution obtained. This almost trivial fact was overlooked in some of the preceding work in this field. Hence, even though an algorithm can be designed to greedily improve the representation Mean Square Error (MSE) and the sparsity, it may happen that the algorithm leads to aimless oscillations in terms of a global objective measure of quality.
It is an object of the present invention to design a dictionary based on learning from training signals, wherein the dictionary yields sparse representations for a set of training signals. These dictionaries have the potential to outperform commonly used pre-determined dictionaries.
The invention thus relate to a novel system and algorithm for adapting dictionaries so as to represent signals sparsely. Given a set of training signals {yi}i=1N, we seek the dictionary D that leads to the best possible representations for each member in this set with strict sparsity constraints. The invention introduces the K-SVD algorithm that addresses the above task, generalizing the K-Means algorithm. The K-SVD is an iterative method that alternates between sparse coding of the examples based on the current dictionary, and an update process for the dictionary atoms so as to better fit the data. The update of the dictionary columns is done jointly with an update of the sparse representation coefficients related to it, resulting in accelerated convergence. The K-SVD algorithm is flexible and can work with any pursuit method, thereby tailoring the dictionary to the application in mind.
The sparse representation problem can be viewed as a generalization of the VQ objective (25), in which we allow each input signal to be represented by a linear combination of codewords, which we now call dictionary elements. Therefore the coefficients vector is now allowed more than one nonzero entry, and these can have arbitrary values. For this case, the minimization corresponding to Equation (25) is that of searching the best possible dictionary for the sparse representation of the example set Y,
A similar objective could alternatively be met by considering
for a fixed value ε. By disclosing the treatment for the first problem (26), any person skilled in the art would immediately realize that the treatment is very similar.
In the algorithm of the invention, we minimize the expression in (26) iteratively. First, we fix D and aim to find the best coefficient matrix X that can be found. As finding the truly optimal X is impossible, we use an approximation pursuit method. Any such algorithm can be used for the calculation of the coefficients, as long as it can supply a solution with a fixed and predetermined number of nonzero entries, T0.
Once the sparse coding task is done, a second stage is performed to search for a better dictionary. This process updates one column at a time, fixing all columns in D except one, dk, and finding a new column dk and new values for its coefficients that best reduce the MSE. This is markedly different from all the K-Means generalizations. K-Means generalization methods freeze X while finding a better D. The approach of the invention is different, as we change the columns of D sequentially, and allow changing the relevant coefficients. In a sense, this approach is a more direct generalization of the K-Means algorithm, because it updates each column separately, as done in K-Means.
The process of updating only one column of D at a time is a problem having a straightforward solution based on the singular value decomposition (SVD). Furthermore, allowing a change in the coefficients' values while updating the dictionary columns accelerates convergence, since the subsequent columns updates will be based on more relevant coefficients. The overall effect is very much in line with the leap from gradient descent to Gauss-Seidel methods in optimization.
A hypothetical alternative would be to skip the step of sparse coding, and use only updates of columns in D, along with their coefficients, applied in a cyclic fashion, again and again. This however will not work well, as the support of the representations will never be changed, and such an algorithm will necessarily fall into a local minimum trap.
The invention is useful for a variety of applications in signal processing including but not limited to: compression, regularization in inverse problems, feature extraction, denoising, separation of texture and cartoon content in images, signal analysis, signal synthesis, inpainting and restoration.
Typically, all the training signals involved are from the same family and thus have common traits. For examples, the signals can all be pictures, music, speech etc.
In the following detailed description of various embodiments, reference is made to the accompanying drawings that form a part thereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention.
In the present invention, we address the problem of designing dictionaries, and introduce the K-SVD algorithm for this task. We show how this algorithm can be interpreted as a generalization of the K-Means clustering process, and demonstrate its behavior in both synthetic tests and in applications on real data.
The present invention relates to a signal processing method adapted for sparse representation of signals and a system for implementing said method, said system comprising:
The training signals are typically from the same family and thus all training signals share common traits and have common behavior patterns. For example, all training signals can be pictures, including pictures of human faces, or the training signals can be sound files including music files, speeches, and the like.
The purpose of the dictionary of the present invention is to discover the common building blocks with which all the training signals can be represented. All the training signals can be represented by linear combinations of the dictionary atoms (building blocks). The term “atom” as referred to herein means dictionary atom or signal-atom.
In some cases, the building blocks or some of the building blocks of the training signals are known or can be approximated intuitively, while in other cases the invention helps to discover them.
According to a preferred embodiment, the dictionary is updated one atom at a time. It is possible however to also update the dictionary a group of atoms at a time, for example two or three atoms at a time, or defining the group of atoms to be updated containing any number of atoms.
In one embodiment of the present invention, the dictionary is an overcomplete dictionary. An overcomplete dictionary contains more atoms (building blocks, functions) than strictly necessary to represent the signal space. An overcomplete dictionary thus allows a suitable representation of a signal with fewer encoded atoms. This is important for applications in which a low bit rate is required.
Signals can be represented in many forms. In one embodiment of the present invention, the representation of each training signal is a coefficient matrix. The representation of the training signals may take any other form such as a vector.
There are many ways to generate a coefficient matrix representing the training signals. In one embodiment of the present invention, the generation of the coefficients matrix is achieved by a pursuit algorithm. The pursuit algorithm can include: Orthogonal Matching Pursuit, Matching Pursuit, Basis Pursuit, FOCUSS or any combination or variation thereof.
Updating the dictionary can be performed sequentially or in any other order. In yet another embodiment of the present invention, the dictionary is updated in a predefined order of the signal-atoms. Depending on the application used and the nature of the training signals, updating the dictionary in a predefined order of signal-atoms will yield different results and thus can be exploited by the application.
In another embodiment of the present invention, only selected signal-atoms of said dictionary are updated. Again, depending on the nature of the application in mind, one may decide to leave certain signal-atoms (building blocks) fixed, and consequently only update the remaining signal-atoms.
In some cases, it may happen that a dictionary is built wherein two signal-atoms are very similar to each other but not equal to each other. The similarity, for the purpose of the application used, may be too big, and thus the differentiation between the two atoms may be considered negligible. The application will thus wish to modify one of the similar atoms. In yet another embodiment of the present invention, a signal-atom is modified when the difference between said signal-atom to another signal atom is below a predefined value.
A signal-atom may be defined by the system as a building block for representing the training signals, but the actual signal-atom may never be used to construct any of the given training signals. One may thus wish to modify this atom. In a further embodiment of the present invention, a signal-atom is modified when it is not used in any representation.
A signal-atom may be found to be used only rarely to construct training signals. It may be thus preferred not to work with such a building block, and modify this atom to one used more frequently in training signals representation. In one embodiment of the present invention, a signal-atom is modified when its usage frequency in the representation of signal-atoms is below a predefined value.
When updating the dictionary, either a single atom or a group of atoms at a time, there are many possibilities to define the best results for the atom values. In yet another embodiment of the present invention, updating the group of atoms and their coefficients best reduces the Mean Square Error (MSE).
In some cases, again depending on the nature of the application used and of the training signals, it may be desired to design dictionaries with one or more custom properties. For example, the dictionary can be shift-invariant. A system is shift-invariant if f(x−α,y−β)→g(x−α,y−β) for arbitrary α and β. Another embodiment of the invention may design a dictionary with non-negative dictionary values, wherein each atom contains only non-negative entries. Another option is to force zeros in predetermined places in the dictionary. It is possible to design the dictionary with any matrix structure. Multiscale dictionaries or zeros in predefined places are two examples of a structure, but any structure can be used depending on the nature of the training signals and application in mind A person skilled in the art will easily design other properties in the dictionary according the training signals and the nature of the application. Such custom properties are all considered to be with the scope of the present invention.
In yet another embodiment of the present invention, multiscale dictionaries are built. An image, for example, can be defined using multiscale dictionaries, wherein each dictionary represents the image in a different size. Obviously, a smaller image will show fewer details than a bigger image.
The invention can be used for a variety of applications, including but not limited to: for compression, regularization in inverse problems, feature extraction, denoising, separation of texture and cartoon content in images, signal analysis, signal synthesis, inpainting and restoration.
As mentioned previously, the objective function of the K-SVD is
Let us first consider the sparse coding stage, where we assume that D is fixed, and consider the optimization problem as a search for sparse representations with coefficients summarized in the matrix X. The penalty term can be rewritten as
Therefore the problem posed in (28) can be decoupled to N distinct problems of the form
This problem is adequately addressed by the pursuit algorithms mentioned before, and we have seen that if T0 is small enough, their solution is a good approximation to the ideal one that is numerically infeasible to compute.
We now turn to the second, and slightly more involved process of updating the dictionary together with the nonzero coefficients. Assume that both X and D are fixed, and we put in question only one column in the dictionary, dk, and the coefficients that correspond to it, the i-th row in X, denoted as xiT (this is not the vector xi which is the i-th column in X). Returning to the objective function (28), the penalty term can be rewritten as
We have decomposed the multiplication DX to the sum of K rank-1 matrices. Among those, K-1 terms are assumed fixed, and one—the k-th—remains in question. The matrix Ek stands for the error for all the N examples when the k-th atom is removed.
Here, it would be tempting to suggest the use of the SVD (Singular Value Decomposition) to find alternative dk and xkT. The SVD finds the closest rank-1 matrix (in Frobenius norm) that approximates Ek, and this will effectively minimize the error as defined in (30). However, such a step will be a mistake, because the new vector xkT is very likely to be filled, since in such an update of dk we do not enforce the sparsity constraint.
A remedy to the above problem, however, is simple and also quite intuitive. Define wi as the group of indices pointing to examples {yi} that use the atom dk, i.e., those where xkT (i) is nonzero. Thus,
w
k
={i|1≦i≦K, xTk(i)≠0}. (31)
Define Ωk as a matrix of size N×|wi|, with ones on the (wk(i), i)-th entries, and zeros elsewhere. When multiplying xkR=xkTΩk, this shrinks the row vector xkT by discarding of the zero entries, resulting with the row vector xkR of length |wk|. Similarly, the multiplication YRk=YΩk creates a matrix of size n×|wk| that includes a subset of the examples that are currently using the dk atom. The same effect happens with ERk=EkΩk, implying a selection of error columns that correspond to examples that use the atom dk.
With this notation, we can now return to (30) and suggest minimization with respect to both dk and xkT, but this time force the solution of xkT to have the same support as the original xkT. This is equivalent to the minimization of
∥E
kΩk−dkxTkΩk∥F2=∥EkR−dkxRk∥F2. (32)
and this time it can be done directly via SVD. Taking the restricted matrix ERk, SVD decomposes it to ERk=UΔVT. We define the solution for dk as the first column of U, and the coefficient vector xkR as the first column of V multiplied by Δ(1, 1). In this solution, we necessarily have that (i) the columns of D remain normalized; and (ii) the support of all representations either stays the same or gets smaller by possible nulling of terms.
This algorithm has been herein named “K-SVD” to parallel the name K-Means. While K-Means applies K computations of means to update the codebook, the K-SVD obtains the updated dictionary by K SVD computations, each determining one column. A full description of the algorithm is given in
In the K-SVD algorithm we sweep through the columns and use always the most updated coefficients as they emerge from preceding SVD steps. Parallel versions of this algorithm can also be considered, where all updates of the previous dictionary are done based on the same X. Experiments show that while this version also converges, it yields an inferior solution, and typically requires more than 4 times the number of iterations. These parallel versions and variation are all encompassed by the present invention.
An important question that arises is: Will the K-SVD algorithm converge? Let us first assume we can perform the sparse coding stage perfectly, retrieving the best approximation to the signal y, that contains no more than T0 nonzero entries. In this case, and assuming a fixed dictionary D, each sparse coding step decreases the total representation error ∥Y−DX∥2F, posed in (28). Moreover, at the update step for dk, an additional reduction or no change in the MSE is guaranteed, while not violating the sparsity constraint. Executing a series of such steps ensures a monotonic MSE reduction, and therefore, convergence to a local minimum is guaranteed.
Unfortunately, the above statement depends on the success of pursuit algorithms to robustly approximate the solution to (29), and thus convergence is not always guaranteed. However, when T0 is small enough relative to n, the OMP, FOCUSS, and BP approximating methods are known to perform very well. While OMP can be naturally used to get a fixed and pre-determined number of non-zeros (T0), both BP and FOCUSS require some slight modifications. For example, in using FOCUSS to lead to T0 non-zeros, the regularization parameter should be adapted while iterating. In those circumstances the convergence is guaranteed. We can ensure convergence by external interference—by comparing the best solution using the already given support to the one proposed by the new run of the pursuit algorithm, and adopting the better one. This way we shall always get an improvement. Practically, we saw in all our experiments that a convergence is reached, and there was no need for such external interference.
When the model order T0=1, this case corresponds to the gain-shape VQ, and as such it is important, as the K-SVD becomes a method for its codebook training. When T0=1, the coefficient matrix X has only one nonzero entry per column. Thus, computing the error ERk in (32), yields
This is because the restriction Ωk takes only those columns in Ek that use the dk atom, and thus, necessarily, they use no other atoms, implying that for all j, xTjΩk=0.
The implication of the above outcome is that the SVD in the T0=1 case is done directly on the group of examples in wk. Also, the K updates of the columns of D become independent of each other, implying that a sequential process as before, or a parallel one, both lead to the same algorithm.
We could further constraint our representation stage and beyond the choice T0=1, limit the nonzero entries of X to be 1. This brings us back to the classical clustering problem as described earlier. In this case we have that xkR is filled with ones, thus xkR=1T. The K-SVD then needs to approximate the restricted error matrix ERk=YRk by a rank-1 matrix dk·1T. The solution is the mean of the columns of YRk, exactly as K-Means suggests.
Just like the K-Means, the K-SVD algorithm is susceptible to local minimum traps. Our experiments show that improved results can be reached if the following variations are applied:
(i) When using approximation methods with a fixed number of coefficients, we found out that FOCUSS proves to be the best in terms of getting the best out of each iteration. However, from a run-time point of view, OMP was found to lead to far more efficient overall algorithm.
(ii) When a dictionary element is not being used “enough” (relative to the number of dictionary elements and to the number of samples) it could be replaced with the least represented data element, after being normalized (the representation is measured without the dictionary element that is going to be replaced). Since the number of data elements is much larger than the number of dictionary elements, and since our model assumption suggests that the dictionary atoms are of equal importance, such replacement is very effective in avoiding local minima and over-fitting.
(iii) Similar to the idea of removal of unpopular elements from the dictionary, we found that it is very effective to prune the dictionary from having too-close elements. If indeed such a pair of atoms is found (based on their absolute inner product exceeding some threshold), one of those elements should be removed and replaced with the least-represented signal.
Similarly to the K-Means, we can propose a variety of techniques to further improve the K-SVD algorithm. Appealing examples on this list are multi-scale approaches and tree-based training where the number of columns K is allowed to increase during the algorithm. All these variations, adaptations and improvements are encompassed by the present invention.
We have first tried the K-SVD algorithm on synthetic signals, to test whether this algorithm recovers the original dictionary that generated the data, and to compare its results with other reported algorithms.
Step 1—generation of the data to train on: A random matrix D (referred to later-on as the generating dictionary) of size 20×50 was generated with iid uniformly distributed entries. Each column was normalized to a unit l2-norm. Then, 1500 data signals {yi}i=11500 of dimension 20 were produced, each created by a linear combination of 3 different generating dictionary atoms, with uniformly distributed iid coefficients in random and independent locations. White Gaussian noise with varying Signal to Noise Ration (SNR) was added to the resulting data signals.
Step 2—applying the K-SVD: The dictionary was initialized with data signals. The coefficients were found using OMP with fixed number of 3 coefficients. The maximum number of iterations was set to 80.
Step 3—comparison to other reported works: we implemented the MOD algorithm, and applied it on the same data, using OMP with fixed number of 3 coefficients, and initializing in the same way. We executed the MOD algorithm for a total number of 80 iterations. We also executed the MAP-based algorithm of Rao and Kreutz-Delgado (Kreutz-Delgado et al., Dictionary learning algorithms for sparse representation. Neural Computation. 15(2):349-396, 2003). This algorithm was executed as is, therefore using FOCUSS as its decomposition method. Here, again, a maximum of 80 iterations were allowed.
Results: the computed dictionary was compared against the known generating dictionary. This comparison was done by sweeping through the columns of the generating dictionary, and finding the closest column (in l2 distance) in the computed dictionary, measuring the distance via
1−|diT{tilde over (d)}i|, (34)
where di is a generating dictionary atom, and {tilde over (d)}i is its corresponding element in the recovered dictionary. A distance less than 0.01 was considered a success. All trials were repeated 50 times, and the number of successes in each trial was computed. The results for the three algorithms and for noise levels of 10 dB, 20 dB, 30 dB and ∞ dB (no noise) are displayed in
We should note that for different dictionary size (e.g., 20×30) and with more executed iterations, the MAP-based algorithm improves and gets closer to the
K-SVD detection rates.
Several experiments have been conducted on natural image data, trying to show the practicality of the algorithm of the invention and the general sparse coding theme. These preliminary tests prove the concept of using such dictionaries with sparse representations.
Training Data: The training data was constructed as a set of 11,000 examples of block patches of size 8×8 pixels, taken from a database of face images (in various locations). A random collection of 500 such blocks, sorted by their variance, is presented in
Removal of the DC: Working with real images data we preferred that all dictionary elements except one has a zero mean. For this purpose, the first dictionary element, denoted as the DC, was set to include a constant value in all its entries, and was not changed afterwards. The DC takes part in all representations, and as a result, all other dictionary elements remain with zero mean during all iterations.
Running the K-SVD: We applied the K-SVD, training a dictionary of size 64×441. The choice K=441 came from our attempt to compare the outcome to the overcomplete Haar dictionary of the same size. The coefficients were computed using the OMP with fixed number of coefficients, where the maximal number of coefficients is 10. A better performance can be obtained by switching to FOCUSS. The test was conducted using OMP because of its simplicity and fast execution. The trained dictionary is presented in
Comparison Dictionaries: The trained dictionary was compared with the overcomplete Haar dictionary which includes separable basis functions, having steps of various sizes and in all locations (total of 441 elements). In addition, we built an overcomplete separable version of the DCT dictionary by sampling the cosine wave in different frequencies to result a total of 441 elements. The overcomplete Haar dictionary is presented in
Applications. The K-SVD results were used, denoted here as the learned dictionary, for two different applications on images. All tests were performed on one face image which was not included in the training set. The first application is filling-in missing pixels: random pixels in the image were deleted, and their values were filled using the various dictionaries decomposition. Then the compression potential of the learned dictionary decomposition was tested, and a rate-distortion graph was presented. These experiments will be described in more detail hereafter.
One random full face image was chosen, which consists of 594 non-overlapping blocks (none of which were used for training). For each block, the following procedure was conducted for r in the range {0.2, 0.9}:
(i) A fraction r of the pixels in each block, in random locations, were deleted (set to zero).
(ii) The coefficients of the corrupted block under the learned dictionary, the overcomplete Haar dictionary, and the overcomplete DCT dictionary were found using OMP with an error bound of ∥0.02·1∥2, where 1∈Rn is a vector of all ones (the input image is scald to the dynamic range [0, 1]), (allowing an error of ±5 gray-values in 8-bit images). All projections in the OMP algorithm included only the non-corrupted pixels, and for this purpose, the dictionary elements were normalized so that the non-corrupted indices in each dictionary element have a unit norm. The resulting coefficient vector of the block B is denoted xB.
(iii) The reconstructed block {tilde over (B)} was chosen as {tilde over (B)}=D·xB.
(iv) The reconstruction error was set to: √{square root over (∥B−{tilde over (B)}∥F2/64)} (64 is the number of pixels in each block). The mean reconstruction errors (for all blocks and all corruption rates) were computed, and are displayed in
A compression comparison was conducted between the overcomplete learned dictionary, the overcomplete Haar dictionary, and the overcomplete DCT dictionary (as described before), all of size 64×441. In addition, a comparison was made to the regular (unitary) DCT dictionary (used by the JPEG algorithm). The resulted rate-distortion graph is presented in
In each test we set an error goal ε, and fixed the number of bits-per-coefficient Q. For each such pair of parameters, all blocks were coded in order to achieve the desired error goal, and the coefficients were quantized to the desired number of bits (uniform quantization, using upper and lower bounds for each coefficient in each dictionary based on the training set coefficients). For the overcomplete dictionaries, the OMP coding method was used. The rate value was defined as
where
with the same notation as before.
By sweeping through various values of ε and Q we get per each dictionary several curves in the R-D plane.
Although the invention has been described in detail, nevertheless changes and modifications, which do not depart from the teachings of the present invention, will be evident to those skilled in the art. Such changes and modifications are deemed to come within the purview of the present invention and the appended claims.
Number | Date | Country | |
---|---|---|---|
60668277 | Apr 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11910568 | Oct 2007 | US |
Child | 13425142 | US |