This application claims priority from “HIPAD—A fast Hybrid Interior-Point Alternating Directions algorithm for SVM and feature selection”, U.S. Provisional Application No. 61/536,092 of Qin, et al., filed Sep. 19, 2011, the contents of which are herein incorporated by reference in their entirety.
This disclosure is directed to methods for classifying data sets with large feature dimensions in machine learning applications.
Classification tasks on data sets with large feature dimensions are very common in real-world machine learning applications. Typical examples are microarray data for gene selection and text documents for topic classification. Despite the large number of features present in these data sets, usually only small subsets of the features are relevant for the particular learning task, and local correlations among the features are often observed. Hence, feature selection is required for good model interpretability. However, when data is scarce, it is crucial that little prediction accuracy be lost due to errors incurred during optimization during training. Data scarcity is common for biomedical applications, e.g. fMRI data and gene expression data for specific tumors. Another typical case is training classifiers at low levels of a class hierarchy in a hierarchical classification setting, where the training set is small because the labels are very specific. In addition, whenever prior domain knowledge is available, it should be used as much as possible to improve prediction accuracy.
The above considerations suggest three desired features for a learning algorithm: (1) The algorithm should be able to automatically select features/variables which are possibly grouped and highly correlated. (2) The algorithm should be able to efficiently optimize in the training phase with high accuracy. (3) The learning model needs to be flexible enough so that domain knowledge can be easily incorporated. Existing methods are known that individually satisfy the above requirements individually. To handle local correlation among groups of features, the elastic-net regularization has been successfully applied to support vector machines (ENSVM), however, incorporating extra domain knowledge is challenging. An alternating direction method of multipliers (ADMM) has been proposed for the elastic-net SVM which can quickly find an approximate solution to the ENSVM, but ADMM is known to be slow to reach high accuracy. The interior-point methods (IPM) for SVM are known to be able to achieve high accuracy in their solutions with a polynomial iteration complexity, and the dual SVM formulation is independent of the feature space dimensionality. However, because of the need to solve a Newton system in each iteration, the efficiency of IPM quickly deteriorates with increasing feature dimension size.
Exemplary embodiments of the invention as described herein generally are directed to classification tasks in the regime of scarce labeled training data in high dimensional feature space, where specific expert knowledge is also available. A hybrid learning algorithm according to an embodiment of the invention solves an elastic-net support vector machine (ENSVM) through an alternating direction method of multipliers (ADMM) in a first phase, followed by an interior-point method (IPM) for a standard SVM in a second phase. An algorithm according to an embodiment of the invention addresses the challenges of automatic feature selection, high optimization accuracy, and algorithmic flexibility for knowledge incorporation. Embodiments of the invention have been compared with existing methods on a collection of synthetic and real-world data.
According to an aspect of the invention, there is provided a method for training a classifier for selecting features in sparse data sets with high feature dimensionality, including providing a set of N data items x and associated labels y, each data item being a vector in Rm where m>>1, minimizing a functional of the data items x and associated labels y
formed from a program
subject to
to solve for hyperplane w and offset b of a classifier using an alternating direction method of multipliers that successively iteratively approximates w and b, auxiliary variables a and c, and multiplier vectors γ1 and γ2, wherein λ1, λ2, μ1, and μ2 are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and their associated labels y, and providing non-zero elements of the hyperplane vector w and those components of X and Y that correspond to the non-zero elements of the hyperplane vector w as arguments to a convex quadratic program solver that uses an interior point method to solve for hyperplane vector w and offset b, wherein the hyperplane vector w and offset b define a classifier than can associate each data item x with the correct label y.
According to a further aspect of the invention, the alternating direction method of multipliers includes, at each iteration k, updating (wk, bk) by solving
updating a from
wherein
updating c from
wherein Tλ(ω)=sgn(ω)max {0,|ω|−λ}, updating γ1 from γ1k+1←γ1k+μ1(e−Y(Xwk+1+bk+1e)−ak+1), and updating γ2 from γ2k+1←γ2k+μ2(wk+1−ck+1), where said updates are repeated until
wherein εtol is a predetermined constant.
According to a further aspect of the invention, the linear system
is solved using a preconditioned conjugate gradient method.
According to a further aspect of the invention, the convex quadratic program solver minimizes a quadratic program ½wTw+ceTξ with respect to w, b, s, and ξ subject to
wherein c is a predetermined penalty parameter for ξ, and yi and xi are components of
According to a further aspect of the invention, the convex quadratic program solver minimizes a quadratic program ½αTQα−eTα that is a dual of ½wTw+ceTξ, wherein ½αTQα−eTα is minimized with respect to α subject to
and c is a predetermined penalty parameter for ξ.
According to another aspect of the invention, there is provided a method for training a classifier for selecting features in sparse data sets with high feature dimensionality that incorporates prior domain knowledge, including providing a set of data items x and associated labels y, each data item being a vector in Rm where m>>1, a set of positive class constraints Bx≦dwTx+b≧1, wherein BεRk
wTx+b≦−1, wherein DεRk
to solve for hyperplane w and offset b of a classifier using an alternating direction method of multipliers that successively iteratively approximates w and b, constraint variables u and v, auxiliary variables a, c, p and q, and Lagrangian multipliers γ1k+1, γ2k+1, γ3k+1, γ4k+1, γ5k+1, and γ6k+1 wherein λ1, λ2, ρ1, ρ2, ρ3, ρ4, μ1, μ2, μ3, and μ4 are predetermined constants, e is a unit vector, and X and Y are respective matrix representations of the data items x and their associated labels y, and providing non-zero elements of the hyperplane vector w, those components of X and Y that correspond to the non-zero elements of the hyperplane vector w, and ηu0≡dTu−b+1,ηv0≡gTv+b+1 as arguments to a convex quadratic program solver that uses an interior point method to solve for hyperplane vector w and offset b, wherein the hyperplane vector w and offset b define a classifier than can associate each data item x with the correct label y.
According to a further aspect of the invention, updating (wk, bk) comprises solving
where
κ1=λ2+μ2+ρ1+ρ3,
κ2=μ3+μ4,
rw=XTYγ1k+μ1XTY(e−ak)−γ2k+μ2ck+ρ3DTvk−ρ1BTuk,
and
rb=eTYγ1k+μ1eTY(e−ak)+γ3k+μ3(dTuk+1−qk)−γ4k−μ4(gTvk+1−pk).
According to a further aspect of the invention,
is solved using a preconditioned conjugate gradient method.
According to a further aspect of the invention, updating constraint variables u and v comprises solving (ρ1BBT+μ3ddT+μ5I)uk+1/2=ru, wherein ru−ρ1Bwk+1+μ3dbk+1+μ3d(qk−1)−dγ3k+γ5+μ5sk, and (ρ3DDT+μ4ggTμ6I)vk+1=rv, where rv=ρ3Dwk+1−μ4gbk+1−gγ4k−μ4g(1−pk)+γ6+μ6tk, and slack variables s and t are updated according to
According to a further aspect of the invention, the auxiliary variables a, c, p and q are updated according to
wherein
and Tλ(ω)=sgn(ω)max {0,|ω|−λ}.
According to a further aspect of the invention, the Lagrangian multipliers γ1k+1, γ2k+1, γ3k+1, γ4k+1, γ5k+1, and γ6k+1 are updated according to γ1K+1←γ1K+μ1(e−Y(Xwk+1+bk+1e)−ak+1), γ2K+1←γ2K+μ2(wk+1−ck+1), γ3K+1←γ3K+μ3(dTuk+1−bk+1−qk+1+1), γ4K+1←γ4K+μ4(gTvk+1+bk+1−pk+1+1), γ5K+1←γ5K+μ5(sk+1−uk+1), and γ6K+1←γ6K+μ6(tk+1−vk+1).
According to a further aspect of the invention, the convex quadratic program solver minimizes a quadratic program ½βTHβ+fTβ with respect to β subject to
wherein β=(w, b, ξ, u, v, ηu, ηv)εRM, M=m+n+k1+k2+3, f=(0, 0, ceT, 0, 0, ρ2, ρ4),
and the values of x1, xu=±∞, respectively, for w, and x1=0, xu=∞ for the other components of β.
According to another aspect of the invention, there is provided a non-transitory program storage device readable by a computer, tangibly embodying a program of instructions executed by the computer to perform the method steps for training a classifier for selecting features in sparse data sets with high feature dimensionality that incorporates prior domain knowledge
Exemplary embodiments of the invention as described herein generally include systems and methods for hybrid interior-point alternating directions for support vector machines and feature selection. Accordingly, while the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the invention to the particular forms disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.
Mathematical Notation: All matrices are expressed in uppercase Latin letters. Lowercase Latin letters in bold denote vectors, and those in plain typeface denote scalars. Greek letters can be either vectors or scalars depending on the context. The sample-label pairs are denoted by (xi, yi), i=1, . . . , N, where N is the number of samples, and xiεRm. xi is the i-th entry in a vector x.
According to an embodiment of the invention, for data sets with many features, the high dimensionality feature space still poses a computational challenge for the interior-point methods (IPM). Fortunately, many data sets of this kind, such as text documents, are sparse, and the resulting classifier w is also expected to be sparse, i.e. only a small subset of the features are expected to carry significant weights in classification. Obviously, IPM should train a classifier only on the most important features. According to an embodiment of the invention, a two-phase algorithm can appropriately shrink the feature space to leverage the high accuracy of IPM while maintaining efficiency. Specifically, a first phase of an according to an embodiment of the invention can solve an elastic-net support vector machine (ENSVM) or doubly-regularized SVM (DrSVM). The elastic-net regularization performs feature selection with a grouping effect and has been shown to be effective on data sets with many but sparse features and high local correlations. This is the case for text classification data, microarray gene expression, and fMRI data sets. The support of the weight vector w usually stabilizes well before the algorithm converges to the optimal solution. Taking advantage of that stabilization, the first phase of the hybrid algorithm can be terminated early and a second phase can solve a standard SVM with the reduced feature set using a solver that implements an IPM.
1.1. ENSVM and ADMM
A support vector machine (SVM) can be written in the regularized regression form as:
where the first term is an averaged sum of the hinge losses and the second term can be viewed as a ridge regularization on w. One may see from this form that a standard SVM does not enforce sparsity in the solution, and w is expected to be completely dense. The ENSVM adds an L1 regularization on top of the ridge regularization term, giving:
The resulting regularization
is the same as that used in an elastic-net regression, which can select highly correlated features in groups (i.e. the grouping effect) while still enforcing sparsity in the solution. This is a useful feature for classification of text documents, which is common in the hierarchical classification setting. Adopting elastic-net regularization as in EQ. (2) yields the same benefit to SVM for training classifiers.
To approximately solve EQ. (2), embodiments of the invention use hybrid methods that efficiently combine an alternating directions method of multipliers (ADMM) with an IPM. ADMM is an extreme case of the inexact augmented Lagrangian (IAL) method for the structured unconstrained system
where both functions f( ) and g( ) are convex. These two functions can be decoupled by introducing an auxiliary variable y to convert EQ. (3) into an equivalent constrained formulation:
This technique is often called variable-splitting. Here, x and y are used as generic variables. An IAL method minimizes in each iteration the augmented Lagrangian of EQ. (4)
approximately, followed by an update to the Lagrange multiplier γ←γ−μ(Ax−y). The IAL method is guaranteed to converge to the optimal solution of EQ. (3), as long as the augmented Lagrangian can be approximately minimized with an increasing accuracy. ADMM can be viewed as a practical implementation of IAL, where the task of approximately minimizing L(x, y, γ) is performed with respect to x and y alternating once. The general form of an ADMM according to an embodiment of the invention is stated in Algorithm 1, with reference to the steps of the flowchart of
It can be shown that ADMM converges for the case of two-way splitting as in Algorithm 1:
Theorem 1. Consider EQ. (3), where both f and g are proper, closed, convex functions, and AεRn×m has full column rank. Then, starting with an arbitrary μ>0 and x0, y0εRm, the sequence {xk} generated by Algorithm 1.1 converges to a solution x* of EQ. (3), if EQ. (3) has a solution. If EQ. (3) does not have a solution, at least one of the sequences {xk} and {yk} diverges.
A variable-splitting and ADMM according to an embodiment of the invention applied to EQ. (2) uses auxiliary variables (a, c) and linear constraints so that the non-smooth hinge loss and L1-norm in the objectives function are de-coupled, allowing optimization over each single auxiliary variable. Specifically, EQ. (2) can be transformed into an equivalent form:
The augmented Lagrangian,
is then alternately minimized with respect to (w, b), a, and c in each iteration, followed by an update to the Lagrange multipliers γ1 and γ2. The original formulation is thus decomposed into three subsystems that include computing the proximal operator of the hinge loss function (with respect to a), solving a special linear system (with respect to (w, b)), and performing a soft-thresholding operation (with respect to c), which can all be performed efficiently. In particular, the linear system at Line 3 of Algorithm 1 can be solved by a small number of preconditioned conjugate gradient (PCG) iterations, however, other methods as are known in the art may also be used, such as a Cholesky Decomposition or an LU factorization.
Before presenting a two-phase algorithm according to an embodiment of the invention for bottom-level training in Algorithm 2, the following operators are defined:
Both of these operators apply component-wise to the input vector w.
An exemplary, non-limiting alternating direction algorithm for training an elastic-net support vector machine is as follows, with reference to the steps in the flowchart of
In line 3, above, the N appearing in the lower right of the matrix on the left-hand side of the linear system is a matrix with N's in the main diagonal, where N is the number of samples. Note that Theorem 1 does not apply to a general k-wise update of the variables, where k>2. The above application of ADMM to ENSVM according to an embodiment of the invention may first appear to employ a three-split of the variables, but the updates of a and c are actually independent from each other, given (w, b). Hence, these two updates can be viewed as a single update, performed simultaneously, leading to a two-way update of the variables (w, b) and (a, c), and Theorem 1 still applies in this case.
1.2 A Two-Phase Algorithm
A first phase of an algorithm according to an embodiment of the invention can appropriately reduce the feature space dimensionality without impacting the final prediction accuracy. As discussed above, the suitability of ADMM for the first phase depends on the support of the feature vector quickly converges. Here, this is empirically demonstrated.
An exemplary data set has 50 samples with 300 features each. An ADMM according to an embodiment of the invention converged in 558 iterations. The output classifier w contains only 13 non-zero features, and the feature support converged in less than 50 iteration. Hence, one should monitor the change in the signs and indices of the support and terminate the first phase promptly. An embodiment of the invention uses the following criterion to monitor the relative change in the iterates as a surrogate of the change in the support:
Thus, in the algorithm of
Upon starting a second phase of an algorithm according to an embodiment of the invention, the IPM should warm-start from the corresponding sub-vector of the ADMM solution. IPM is also applied during the second phase to solve the standard SVM using an L2-regularization instead of the ENSVM used in the first phase. Although ENSVM can be reformulated as a quadratic program (QP), the size of the system is larger than an SVM due to the additional linear constraints introduced by the L1-norm. In addition, since the feature support has been at least approximately identified in the first phase of the algorithm, enforcing sparsity in the reduced feature space becomes less critical.
An exemplary, non-limiting hybrid interior point and alternating direction algorithm for bottom-level training of a classifier is as follows, with reference to the steps in the flowchart of
Note that in embodiments of the invention, the ADMM-ENSVM of Phase 1 will use the stopping criteria of EQ. (7) rather than looping over all k. According to embodiments of the invention, the system being solved in Phase 2 of Algorithm 3 is much smaller than the system being solved in Phase 1. An IPM according to an embodiment of the invention uses second order derivative information, so that it can converge much faster than a method using first order derivatives. However, an IPM needs to store a Hessian matrix of second derivatives, which requires n2 storage elements for an output vector of n elements. If the full vector output by Phase 1 is passed to an IPM solver according to an embodiment of the invention, the storage required for the Hessian matrix by the IPM solver would be prohibitive. Also, the solution of the Newton system of equations would be time consuming (order n2 in the number of features). However, the number of non-zero elements of the wADMM output by Phase 1 would typically be several orders of magnitude less than the full vector, and thus an IPM solver according to an embodiment of the invention would be able to handle the storage requirements for the non-zero elements.
An exemplary, non-limiting method according to an embodiment of the invention for solving an SVM using IPM makes use of an open source, publicly available QP solver known as OOQP, a software package for convex quadratic programming that is disclosed in E. Gertz, S. Wright, “Object-oriented software for quadratic programming”, ACM Transactions on Mathematical Software (TOMS), 29(1):58-81, 2003, the contents of which are herein incorporated by reference in their entirety. The OOQP Package is an object-oriented implementation of a primal-dual interior-point algorithm in C++ for the general convex QP system:
In a HIPAD method according to an embodiment of the invention, the SVM is formulated as a linearly equality-constrained convex QP. Hence, the first set of constraints in EQ. (8) can be ignored.
The primal formulation of an SVM (PSVM) according to an embodiment of the invention is
where c is a user-defined penalty parameter for ξ. The Hessian H from EQ. (8) is simply an augmented identity padded with zeros at the diagonal entries corresponding to (b, ξ, s).
The dual (DSVM) of the PSVM of EQ. (9) is
where Qij=yiyjxiTxj=
where SV is the set of sample indices corresponding to the support vectors. The optimal bias term b can be computed from the complementary slackness condition
αi(yi(xiTw+b)−1+ξi)
Whether to solve EQ. (9) (SVM-P) or EQ. (10) (SVM-D) for a given data set depends on its dimensions as well as sparsity. If X is a sparse matrix, Q in (SVM-D) is still likely to be dense, whereas the Hessian matrix in (SVM-P) is the identity. The primal formulation (SVM-P), however, has a larger variable dimension and more constraints. In the context of the classification tasks considered herein, since IPM is used in the second phase, the number of non-zero features (m) is relatively small, and the number of observations (N) may be assumed small too. When m<N or X is sparse, embodiments of the invention may solve (SVM-P). For dense systems with m≧N, embodiments of the invention may solve (SVM-D), since computing Q=
Often, one has prior domain knowledge for specific classification tasks. Domain knowledge is most helpful when the training data does not form a comprehensive representation of the underlying unknown population, resulting in poor generalization performance of SVM on the unseen data from the same population. This often arises in situations where labeled training samples are scarce, while there is an abundance of unlabeled data. A simple two-dimensional example is used to illustrate this idea in
The optimal separating hyperplane obtained from SVM for the training points 63, 64 is clearly the line with equal distance to the two axes, the solid line 61 in
For high dimensional data, ENSVM performs feature selection during training to produce a simpler model and to achieve better prediction accuracy. However, the quality of the feature selection depends entirely on the training data. In cases like the one described above, the feature support identified by ENSVM may not form a good representation of the population. Hence, when domain knowledge about certain features is available, it should be taken into consideration during the training phase and the relevant features in the feature support should then be deemed important for classification.
Embodiments of the invention can incorporate domain knowledge in the form of class-membership information associated with features. In the context of the previous example, such knowledge could be implication relationships as follows: Every data sample x with feature x2 greater than 2 should be classified as negative, and every sample x which lies on the negative side of the hyperplane
should be classified to the positive class. Such information can be incorporated (or such rules be enforced) in a SVM by adding equivalent linear constraints to the SVM QP formulation (KSVM). To be specific, the above information can be modeled with the linear implication
Bx≦dwTx+b≧1, (11)
where BεRk
BTu+w=0,dTu−b+1≦0,u≧0 (12)
For the sake of completeness, embodiments of the invention can consider the following linear implication for the negative class membership:
Dx≦gwTx+b≦−1,DεRk
where k2 is the number of negative class constraints, which can be similarly represented by the set of linear equations in v:
DTv−w=0,gTv+b+1≦0,v≧0. (14)
Hence, to incorporate the domain knowledge represented by EQS. (11) and (13) into an SVM according to an embodiment of the invention, one can add the linear constraints of EQS. (12) and (14) to EQ. (9) (SVM-P):
If the data from the two classes is not linearly separable with the augmented constraint set, one can modify EQ. (9) (SVM-P) as follows to form a soft-margin formulation:
The two L1-norm terms in the objective function come from the relaxation of the equality constraints, (e.g. BTu+w=0) to the corresponding box constraints (e.g. −zu≦BTu+w≦zu). Compared to EQ. (9) (SVM-P), this formulation, with the L1 terms expressed as box constraints, increases both the variable dimension and the number of linear constraints by at least 2m. This is clearly undesirable when m is large, which is the scenario considered herein.
Note that if the quadratics ∥BTu+w∥22 and ∥DTv−w∥22 penalized instead of their L1 counterparts, the above optimization formulation is a smaller convex QP. This is due to the non-differentiability of the L1 norm at zero, since it is a sum of absolute values, which effectively doubles the number of constraints. Hence, embodiments of the invention consider the following primal formulation for domain knowledge incorporation (KSVM-P):
Embodiments of the invention can combine ENSVM and KSVM, and embodiments of the invention can solve the combined system in a HIPAD framework. This combination can exploit domain knowledge to improve the feature selection, and hence, the generalization performance of HIPAD.
3.1 ADMM Phase
Embodiments of the invention can use ADMM to solve an elastic-net SVM that incorporates domain knowledge. First, EQS. (2) and (17) can be combined, and the resulting optimization formulation can be written in an equivalent unconstrained form by penalizing the violation of the inequality constraints through hinge losses in the objective function (ENK-SVM):
Variable-splitting can then be applied to decouple the L1-norms and hinge losses and obtain the following objective function
and equivalent constrained optimization formulation:
Embodiments of the invention then form the augmented Lagrangian L of the system of EQ. (19):
and minimize L with respect to w, b, c, a, p, q, u, v individually and in order. For the sake of readability, the non-negative constraints are not penalized for u and v in the augmented Lagrangian for the moment.
Given (ak, ckpk, qk), solving for (w, b) again involves solving a linear system
where
κ1=λ2+μ2+ρ1+ρ3,
κ2=μ3+μ4,
rw=XTYγ1k+μ1XTY(e−ak)−γ2k+μ2ck+ρ3DTvk−ρ1BTuk,
and
rb=eTYγ1k+μ1eTY(e−ak)+γ3k+μ3(dTuk+1−qk)−γ4k−μ4(gTvk+1−pk).
As with the linear system of Algorithm 2, above, the N in the lower right of the matrix on the left-hand side is a matrix with diagonal elements equal to N. Similar to solving the linear system in Algorithm 2, the solution to the above linear system can be computed through a few PCG iterations, taking advantage of the fact that the left-hand-side matrix is of low-rank. Note however, that other methods as are know to those of skill in the art can be used to solve EQ. (21), such as Cholesky decomposition or LU factorization.
To minimize the augmented Lagrangian with respect to u, embodiments of the invention can solve a convex quadratic formulation with non-negative constraints:
Obtaining a closed-form solution to the above formulation through a direct approach is challenging. However, by introducing a slack variable s and transferring the non-negative constraint on u to s, the formulation of EQ. (22) can be decomposed into two parts:
Penalizing the linear constraint u−s=0 in the new augmented Lagrangian, the new subsystem with respect to (u, s) is
Given an sk≧0, one can compute uk+1 by solving a k1×k1 linear system:
(ρ1BBT+μ3ddT+μ5I)uk+1/2=ru, (25)
where
ru=−ρ1Bwk+1+μ3dbk+1+μ3d(qk−1)−dγ3k+γ5+μ5sk
Embodiments of the invention can assume that B is full row-rank. This is a reasonable assumption since otherwise there is at least one redundant domain knowledge constraint which can be removed. The number of domain knowledge constraints k1 and k2 are usually small, so EQ. (25) can be solved exactly and efficiently. An exemplary, non-limiting method for solving EQ. (25) uses a Cholesky factorization.
Embodiments of the invention can solve for the sk+1 corresponding to uk+1 by observing that EQ. (24) is separable in the elements of s. For each element si, an optimal solution to the one-dimensional quadratic formulation with a non-negative constraint on si is given by
which can be written in vector form:
Similarly, embodiments of the invention can solve for vk+1 by introducing a non-negative slack variable t and computing
(ρ3DDT+μ4ggT+ρ6I)vk+1=rv, (26)
where
rv=ρ3Dwk+1−μ4gbk+1−gγ4k−μ4g(1−pk)+γ6+μ6tk.
Embodiments of the invention can solve for the tk+1 corresponding to vk+1 similar to the solution for s:
Now, given (wk+1, bk+1, uk+1, vk+1), the solutions for a and c are the same as in Lines 4 and 5 of Algorithm 2:
The subsystem with respect to q is
The solution is given by a (one-dimensional) proximal operator associated with the hinge loss:
Similarly, the subsystem with respect to p is:
and the solution is given by
The above steps are summarized in Algorithm 4, with reference to the steps of the flowchart in
Although there appears to be ten additional parameters (two λ's, four ρ's and four μ's) in the ADMM method for ENK-SVM, embodiments of the invention can set the ρ's to the same value and do the same for the μ's. Hence, in practice, there is only one additional parameter to tune, and experiments described below indicate that an algorithm according to an embodiment of the invention is insensitive to the μ's. In addition, note that due to the presence of additional constraints, the stopping criteria of EQ. (7) should not be used to break out of the loop over k.
3.2 IPM Phase
The Hessian of EQ. (17) (KSVM-P) with respect to the variables β=(w, b, ξ, u, v, ηu, ηv)εRM, M=m+N+k1+k2+3, is a highly sparse matrix:
An exemplary, non-limiting solver for solving EQ. (17) is the IPM QP solver available from www.mosek.com. The Mosek solver contains a highly efficient primal-dual IPM implementation that works very well for large sparse systems. To use the solver's interface from a HIPAD algorithm according to an embodiment of the invention, EQ. (17) is cast into the general form:
with H defined above, f=(0, 0, ceT, 0, 0, ρ2, ρ4), and
The box-constraint limit values xl and xu depend upon the component of β. For the first component, w, the values of xl and xu are ±∞, since the w are unconstrained. For the other components, xl=0 and xu=∞.
A two-phase algorithm for an elastic-net KSVM that incorporates domain knowledge according to an embodiment of the invention is presented in Algorithm 5, with reference to the steps of the flowchart in
3.3 Convergence
Recall from Remark 1 that the convergence of ADMM, as stated in Theorem 1, only applies to the case of two-way update of the variables. In a first phase of Algorithm 5, while the updates to a, c, q, p, s and t are independent from each other given (w, b, u, v), the updates to w, b, u, and v are not. However, one may observe from EQS. (21), (25), and (26) that the new iterate of (w, b, u, v) can be computed by solving an augmented linear system in (w, b, u, v) similar to Line 3 in Algorithm 4. With this modification, Algorithm 4 essentially employs a two-way update of the variables (w, b, u, v) and (a, c, p, q). Hence, convergence is guaranteed by Theorem 1. In practice, a sequential updating scheme for (w, b, u, v) is maintained.
An exemplary, non-limiting HIPAD algorithm according to an embodiment of the invention can be written in C/C++, and use the CBLAS interface and the BLAS implementation provided by the GNU Scientific Library for basic linear algebra routines, and where code for the ADMM phase is based on that from G. Ye, Y. Chen, and X. Xie, “Efficient variable selection in support vector machines via the alternating direction method of multipliers”, AISTAT 2010, 2010, the contents of which are herein incorporated by reference in their entirety. Algorithm according to embodiments of the invention were tested on both synthetic and real data and compared with the LIBSVM disclosed in R. Fan, P. Chen, and C. Lin, “Working set selection using second order information for training support vector machines”, The Journal of Machine Learning Research, 6:1889-1918, 2005, the contents of which are herein incorporated by reference in their entirety, and ADMM-ENSVM. Since LIBSVM does not support ENSVM, it was used to train a standard SVM. An exemplary, non-limiting HIPAD-ENK according to an embodiment of the invention was written in Matlab, and the Mosek QP IPM solver was used for an IPM for KSVM phase according to an embodiment of the invention through the solver's Matlab interface.
The transition condition at the end of a first phase of a HIPAD according to an embodiment of the invention is specified in EQ. (7), with εtol=10−2. The stopping criteria for ADMM-ENSVM are as follows:
where Fk is the objective value at the k-th iteration, ε1=10−5, and ε2=10−3.
4.1 Synthetic Data
Synthetic data sets were generated from real data sets as follows. For each real data set, the feature space was augmented to a dimensionality of M=10,000 (or 20,000). The original m features occupy the first m entries in the feature vector. The remaining features were generated as independent and identically distributed (i.i.d.) Gaussian. Table 1, shown in
This set of data simulates the presence of noisy features. The prediction accuracy of LIBSVM, which is a standard SVM, falls because of the interference of the noise. In contrast, the results show that a HIPAD according to an embodiment of the invention is robust to noisy features in that it can select the most important true features and eliminate the noisy features, producing the same, or even better, prediction accuracies as those on the original data, as shown in the last column in Table 2. A HIPAD according to an embodiment of the invention is also faster than ADMM-ENSVM on these dense data sets.
4.2. Real Data
A HIPAD algorithm according to an embodiment of the invention was tested on nine real data sets which are publicly available. Table 3, shown in
The parameters of HIPAD, ADMM-ENSVM, and LIBSVM were selected through cross validation on the training data. The experiment results are summarized in Table 4, shown in
4.3. Simulation for Knowledge Incorporation
Synthetic data was generated to simulate the example presented at the beginning of Domain Knowledge Incorporation section in a high dimensional feature space. Specifically, four groups of multivariate Gaussians K1, . . . , K4 were sampled from N(μ1+,Σ1), . . . , N(μ4+,Σ4), N(μ1−,Σ1), . . . , N(μ4−,Σ4) N(μ, Σ) for four disjoint blocks of feature values (xK
It was challenging to separate the training samples because the values of blocks K2 and K3 for the two classes are close to each other. However, blocks K1 and K4 in the test samples are well-separated. Hence, if one is given information about these two blocks as general knowledge for an entire population, the resulting SVM classifier would perform much better on the test data. Since the mean values of the distributions from which the entries in K1 and K4 are generated are known, the following information about the relationship between the block sample means and class membership can be supplied to the KSVM:
where Lj is the length of the Kj, j=1, . . . , 4, and the lowercase xi denotes the i-th entry of the sample x. Translating into the notation of (KSVM-P):
The information given here is not precise, in that a sample should belong to the positive (or negative) class only when the corresponding block sample mean well exceeds the distribution mean. This is consistent with real-world situations, where the domain or expert knowledge tends to be conservative and often does not come in exact form.
Two sets of synthetic data were simulated for ENK-SVM as described above, with (Ntrain=200, Ntest=400, mtrain=10; 000) for a dataset ksvm-s-10k and (Ntrain=500, Ntest=1,000, mtest=50; 000) for a dataset ksvm-s-50k. The number of features in each of the four blocks (K1, K2, K3, K4) is 50 for both data sets. The parameters of a HIPAD-ENK (Algorithm 5) according to an embodiment of the invention are set as follows: λ1=0:1, λ2=1.ρ1=ρ3=50; ρ2=ρ4=25 in an ADMM phase according to an embodiment of the invention; c=0.05, ρ1=ρ3=5,000, ρ2=ρ4=2,500 in an IPM phase according to an embodiment of the invention. For LIBSVM, c=1.0. Experimental results are presented in Table 5, shown in
In fact, the feature supports identified by a HIPAD-ENK according to an embodiment of the invention also provides an indication on how useful (or valid) the given information is towards the current classification task. To demonstrate that, d and g above were both changed to 0. This time, a HIPAD-ENK according to an embodiment of the invention did not include K1 and K4 in the feature support as shown in
It is to be understood that embodiments of the present invention can be implemented in various forms of hardware, software, firmware, special purpose processes, or a combination thereof. In one embodiment, the present invention can be implemented in software as an application program tangible embodied on a computer readable program storage device. The application program can be uploaded to, and executed by, a machine comprising any suitable architecture.
The computer system 141 also includes an operating system and micro instruction code. The various processes and functions described herein can either be part of the micro instruction code or part of the application program (or combination thereof) which is executed via the operating system. In addition, various other peripheral devices can be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures can be implemented in software, the actual connections between the systems components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
While the present invention has been described in detail with reference to exemplary embodiments, those skilled in the art will appreciate that various modifications and substitutions can be made thereto without departing from the spirit and scope of the invention as set forth in the appended claims.
Entry |
---|
Forero, et al., Consensus-Based Distributed Support Vector Machines, Journal of Machine Learning Research, 11, 2010, pp. 1663-1707. |
Hui Zou, et al., “Regularization and Variable Selection Via the Elastic Net,” J.R. Statistic Society B (2005) 67, Part 2, pp. 301-320. |
Rong-En Fan, et al., “Working Set Selection Using Second Order Information for Training Support Vector Machines,” Journal of Machine Learning Research 6 (2005), pp. 1889-1918. |
Li Wang, et a., “The Doubly Regularized Support Vector Machine,” Statistica Sinica 16 (2005), p. 589-615. |
Jonathan Eckstein, et al., “On the Douglas-Rachford Splitting Method and the Proximal Point Algorithm for Maximal Monotone Operators,” Oct. 1989, pp. 1-34. |
E. Michael Gertz, et al., “Object-Oriented Software of Quadratic Programming,” ACM Transaction on Mathematical Software, vol. 29, No. 1, Mar. 2003, pp. 58-81. |
Glenn Fung, et al., “Knowledge-Based Support Vector Machine Classifiers,”. |
Jianing Shi, et al., “A Fast Hybrid Algorithm for Large-Scale 1-Regularized Logistic Regression,” Joumal of Machine Learning Research 11 (2010), pp. 713-741. |
Gui-Bo Ye, et al., “Efficient Variable Selection in Support Vector Machines Via the Alternating Direction Method of Multipliers,” Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS), 2011, vol. 15, pp. 832-840. |
Stephen Boyd, et al., “Distributed Optimization and Statistical Learing Via the Alternating Direction Method of Multipliers,” Foundations and Trends in Machine Learning, vol. 3, No. 1, (2010) pp. 1-122. |
David D. Lewis, et al., “RCV1: A New Benchmark Collection for Text Catergorization Research,” Journal of Machine Learning Research 5 (2004), pp. 361-397. |
Number | Date | Country | |
---|---|---|---|
20130073489 A1 | Mar 2013 | US |
Number | Date | Country | |
---|---|---|---|
61536092 | Sep 2011 | US |