The present invention relates generally to machine learning, and more particularly to training support vector machines.
Machine learning involves techniques to allow computers to “learn”. More specifically, machine learning involves training a computer system to perform some task, rather than directly programming the system to perform the task. The system observes some data and automatically determines some structure of the data for use at a later time when processing unknown data.
Machine learning techniques generally create a function from training data. The training data consists of pairs of input objects (typically vectors), and desired outputs. The output of the function can be a continuous value (called regression), or can predict a class label of the input object (called classification). The task of the learning machine is to predict the value of the function for any valid input object after having seen only a small number of training examples (i.e. pairs of input and target output).
One particular type of learning machine is a support vector machine (SVM). SVMs are well known in the art, for example as described in V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998; and C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery 2, 121-167, 1998. Although well known, a brief description of SVMs will be given here in order to aid in the following description of the present invention.
Consider the classification shown in
As can be seen from
As described above, SVMs determine a maximum margin hyperplane based on a set of support vectors. The maximum margin hyperplane is determined by minimizing a primal cost function. However, directly solving the minimization problem may be difficult because the constraints can be quite complex. Accordingly, a dual maximization problem can be solved instead of the primal problem. The maximum of the dual problem is equal to the minimum of the primal problem, but the constraints of the dual problem are typically much simpler than those of the primal problem. In order to train SVMs an iterative optimization algorithm is used to maximize the dual problem. Typically, the optimization algorithm performs iterations until the dual problem converges at a maximum. However, it is desirable to expedite the SVM training process by early termination of the optimization algorithm before the optimum solution is reached, without loosing accuracy of the resulting SVMs.
The present invention provides a method and apparatus for early termination in training of a support vector machine (SVM). In accordance with the principles of the present invention, the training of an SVM can be terminated earlier as the amount of training data grows. Accordingly, embodiments of the present invention utilize a termination criterion that varies based on the number of training data examples used to train the SVM.
In one embodiment of the invention, a support vector machine is iteratively trained based on training data using an objective function having primal and dual formulations. At each iteration, an SVM solution is updated in order to increase a value of the dual formulation. A termination threshold is then calculated based on the updated SVM solution. The termination threshold can increase sublinearly with respect to the training data. The termination threshold can be calculated based on the observed variance of the loss for the updated SVM solution. A duality gap between the value of the dual formulation and the primal formulation based on the updated SVM solution is calculated. The termination threshold is compared to the duality gap, and when the duality gap is less than the termination threshold, the training is terminated.
These and other advantages of the invention will be apparent to those of ordinary skill in the art by reference to the following detailed description and the accompanying drawings.
The central role of optimization in the design of a machine learning algorithm derives naturally from a widely accepted mathematical setup of a learning problem. For example, a learning problem can be described as the minimization of the expected risk Q(f)=∫L(x,y,f)dP(x,y) in a situation where the ground truth probability distribution dP(x,y) is unknown, except for a finite sample {(x1,y1), . . . , (xn,yn)} of independently drawn examples. Statistical learning theory indicates this problem can be approached by minimizing the empirical risk Qn(f)=. . . n−1ΣL(xi,yi,f) subject to a restriction of the form Ω(f)<Mn. This leads to the minimization of the penalized empirical risk:
The penalized empirical risk expressed in Equation (1) can be minimized using various optimization algorithms. Embodiments of the present invention expedite this process by termination of such an optimization before reaching the optimum value. This “early termination” of an optimization algorithm is conceptually distinct from “early stopping” of an optimization algorithm. Early stopping interrupts the optimization algorithm when a cross-validation estimate reveals overfitting to the training data. Early termination terminates the optimization algorithm prior to convergence at the optimum value when it can be confidently asserted that the approximate solution will perform as well as the exact optimum.
The principles of the present invention can be applied to machine learning algorithms that admit a dual representation, and will be discussed mores specifically herein in the context of a support vector machine (SVM) algorithm solved in dual formulation. One skilled in the art will recognize that the principles of the present invention may be similarly applied to any other machine learning methods that admit a dual representation.
In an SVM training method, consider n training patterns x1 . . . xn, and their associated labels y1 . . . yn=±1. Let Φ be a feature map that represents a pattern x as a point Φ(x) in a suitable Hilbert space H. What is sought is a linear decision function f·Φ(x), parameterized by f εH, whose sign indicates the putative class of pattern x. In order to avoid minor technical complications in the discussion of the present invention, the decision function is described herein with no threshold. It is to be understood by those skilled in the art, that the principles of the present invention can also be applied using a threshold. If φ(v)=max(0,1−v) is the Hinge loss function, the minimization of the primal cost function can be expressed as:
This primal cost function P(f) is an adaptation of the penalized empirical risk (Equation (1)) with L(x,y,f)=Φ(y f·Φ((x)), Ω=∥f∥2/2, and λn=1/nCn. This optimization problem admits a dual formulation:
where the function K(x,x′)=Φ(x)·Φ(x′) is called the kernel function. It is common to choose a kernel function, and to let Φ be implicitly defined by the choice of the kernel function. Let {circumflex over (f)} and {circumflex over (α)} be optimal solutions of problems (2) and (3), respectively. In this case, due to the strong duality property, for any feasible f and α,
Accordingly, the maximum of the dual formulation is equal to the minimum of the primal formulation, and the primal formulation (2) can be solved by optimizing the dual formulation (3).
Typical modern SVM solvers iteratively maximize the dual cost function (3) and terminate when a small predefined threshold ε exceeds the Lω norm of the projection of the gradient (∂D(α)/∂αi) on the constraint polytope. This quantity can be easily calculated during the iterative process. In conventional SVM solvers, the threshold ε is typically specified prior to training an SVM as a relatively small value, typically in the range 10−4 to 10−2. Although some problems are capable of tolerating much larger thresholds, it is impossible to identify these problems prior to training, and using a large threshold is considered unreliable in convention SVM solvers.
In other SVM training methods, the threshold ε is compared to a duality gap between the primal and dual formulations. The duality gap is the difference between the primal formulation and dual formulation at a current values of f and α. As expressed in (4), at the optimal solution for the primal and dual formulations, the duality gap will be 0. In this case, optimization is terminated when the optimization method reaches
The strong duality property then guarantees that P(
Embodiments of the present invention utilize a threshold value ε that grows sublinearly with the number of training data examples n. Accordingly the threshold ε changes during the training process based on the training data. This is possible because:
Letting ε grow makes the optimization coarser when the number of training examples increases. As a consequence, the asymptotic complexity of early-terminated optimization can be smaller than that of the exact optimization.
In order for the termination threshold ε to grow with the number of training examples, it is necessary to calculate a termination threshold based on the training data that guarantees nearly the same generalized performance as the exact optimization algorithm for finite training sets. Accordingly values of the termination threshold ε must be determined to ensure that Q(
Accordingly, it can be assumed that, for any reasonable learning algorithm {circumflex over (f)}(Sn), the deviations Q({circumflex over (f)}(Sn))−Qn({circumflex over (f)}(Sn)) are larger than those prescribed by the central limit theorem:
Let {circumflex over (f)}=inffeHP(f) be the solution of the primal problem (2) and
The first ratio on the right hand side of (9) is close to the unity because both divergences Q(
On the other hand, using (8),
Therefore, a termination threshold ε can be used that is proportional to the empirical approximation of the variance of the loss function measured based on the training data, such that:
ε˜Cn√{square root over (nVarx,hL(x,y,f))}≈Cn√{square root over (nVarkφ(ykf·Φ(xk)))}. (12)
Accordingly, the duality gap can be compared to the termination threshold ε determined based on the observed variance of the loss function of the training data at each step in an SVM training method in order to determine whether to terminate the training method.
At step 502, an SVM solver is initialized resulting in an initial SVM solution. The SVM solver is initialized by initializing the variables of the dual formulation to an initial value for all of the training data examples. This results in an initial solution for the SVM that will be updated with iterations of the SVM training method. This step is shown at 552 in the pseudo code of
At step 504, the SVM solution is updated based on training data. The SVM solution is updated by calculating an update step for a variable of the dual formulation in order to maximize the dual formulation as much as possible within certain constraints. This step is shown at 554 in the pseudo code of
At step 506, the termination threshold ε is determined based on the current SVM solution. As described above the termination threshold ε can be determined based on observed variance of the loss function of the training examples for the current SVM solution, as expressed in (12). Accordingly, the variance of the loss function can be approximated and the approximation of the variance of the loss function can be used to determine the termination threshold ε. The variance of the loss can be calculated by calculating the losses for all training data examples, and calculating the empirical variance between the losses.
At step 508, the duality gap is calculated for the current SVM solution. As described above the duality gap is the difference between the primal formulation for the current SVM solution and the dual formulation for the current SVM solution. Accordingly, the current value for the dual formulation D(α) is calculated based on the current α, and the current α is used to calculate the current f in order to calculate the current value for the primal function P(f). The duality gap P(f)−D(α) can then be calculated.
At step 510, it is determine whether the current duality gap P(f)−D(α) is less than the current termination threshold ε. If the current duality gap P(f)−D(α) is less than the current termination threshold ε, the termination criterion is met, and the method proceed to step 512. If the current duality gap P(f)−D(α) is not less than the current termination threshold ε, the termination criteria is not met and the method returns to step 504 and performs another iterative update to the SVM solution. Steps 506-510 of
At step 512, when the termination criteria has been met, the SVM training method is terminated and the current SVM solution is output. For example, the SVM solution can be stored in memory or storage of a computer system in order to generate an SVM which can be used to classify data similar to the training data.
The method of
The steps of the method of
The foregoing Detailed Description is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention.