The disclosed embodiments generally relate to techniques for improving the performance of supervised-learning models, such as support vector machines (SVMs). More specifically, the disclosed embodiments provide a randomized technique that iteratively improves approximations for nonlinear SVM models.
Support vector machines (SVMs) comprise a popular class of supervised machine-learning techniques, which can be used for both classification and regression purposes. For large scale data sets, the task of allocating and computing the associated large kernels (e.g., Gaussian), which are used to solve the SVM model, becomes prohibitively expensive. More specifically, for such nonlinear kernels, the complexity of an SVM solution technique grows quadratically in memory space and cubically in running time as a function of the number of observations in the data set. This means it is impractical to use SVMs for larger data sets with more than hundreds of thousands of observations, which are becoming increasingly common in many application domains.
To remedy this computing-cost problem, people perform various types of approximations, such as: sampling data points; computing block-diagonal approximations for nonlinear kernels; and performing incomplete Cholesky factorizations. These approximations can significantly reduce computation costs, which makes it practical to analyze large data sets. Unfortunately, the use of such approximations generally produces suboptimal results during classification and regression operations. Moreover, there presently do not exist any techniques for effectively improving these suboptimal results.
Hence, what is needed is a technique for improving approximations for nonlinear SVMs.
The disclosed embodiments relate to a system that improves operation of a monitored system. During a training mode, the system uses a training data set comprising labeled data points received from the monitored system to train the SVM to detect one or more conditions-of-interest. While training the SVM model, the system makes approximations to reduce computing costs, wherein the approximations involve stochastically discarding points from the training data set based on an inverse distance to a separating hyperplane for the SVM model. Next, during a surveillance mode, the system uses the trained SVM model to detect the one or more conditions-of-interest based on monitored data points received from the monitored system. When one or more conditions-of-interest are detected, the system performs an action to improve operation of the monitored system.
In some embodiments, while training the SVM model, the system uses a block-diagonal approximation to initialize an active set of support vectors for the SVM model. Next, the system iteratively performs the following operations to improve the SVM model while SVM misclassifications continue to decrease by more than a minimum amount. First, the system randomly selects additional points from the training data set based on an inverse distance to the separating hyperplane for the SVM model. The system then solves a nonlinear kernel for the SVM model based on the active set of support vectors and the additional data points to compute a new active set of support vectors. Then, if the new active set of support vectors produces fewer misclassifications than the active set of support vectors, the system updates the active support vectors with the new active set of support vectors.
In some embodiments, while randomly selecting the additional points, the system selects an additional point x from the training data set with a probability P(x)=(μ+v d(x))−β, wherein d(x) represents a distance from x to the separating hyperplane, and β, v and β represent associated parameters.
In some embodiments, the SVM model is formulated based on one of the following types of kernels: a linear kernel; a polynomial kernel; a hyperbolic tangent kernel; and a radial basis function kernel.
In some embodiments, the monitored system comprises one of the following: a computer system; a database system; a website; an online customer-support system; a vehicle; an aircraft; a utility system asset; and a piece of machinery.
In some embodiments, the data points received from the monitored system include one or more of the following: time-series sensor signals; computer parameters; textual data; numerical data; and image data. In some embodiments, detecting the one or more conditions-of-interest involves detecting one or more of the following: an impending failure of the monitored system; a malicious-intrusion event in the monitored system; a preventive-maintenance condition for the monitored system; a fraud condition for the monitored system; a product-purchasing condition for the monitored system; and a consumer-attrition condition for the monitored system.
In some embodiments, performing the action to improve operation of the monitored system involves one or more of the following: sending a notification to an administrator of the monitored system; performing an action to stop a malicious-intrusion event in the monitored system; scheduling a maintenance operation for the monitored system; performing an action to stop an instance of fraud associated with the monitored system; performing an action to make relevant offers to customers associated with the monitored system; and performing an action to improve satisfaction of a customer associated with the monitored system.
The following description is presented to enable any person skilled in the art to make and use the present embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present embodiments. Thus, the present embodiments are not limited to the embodiments shown, but are to be accorded the widest scope consistent with the principles and features disclosed herein.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium. Furthermore, the methods and processes described below can be included in hardware modules. For example, the hardware modules can include, but are not limited to, application-specific integrated circuit (ASIC) chips, field-programmable gate arrays (FPGAs), and other programmable-logic devices now known or later developed. When the hardware modules are activated, the hardware modules perform the methods and processes included within the hardware modules.
During operation, customer-support system 124 receives various signals from application 120 and associated database system 122. Next, customer-support system 124 analyzes these signals using an associated SVM model 126 to produce information, which is presented to an analyst 111 through client system 115 to facilitate interactions with customers 102-104. For example, SVM model 126 can perform a classification operation based on the signals received from application 120 and database 122 to detect: a possible malicious-intrusion event; a possible fraudulent transaction; or a set of customer interactions that indicate possible dissatisfaction of a customer. Finally, a notification about a detected problem can be presented to analyst 111, which enables analyst 111 to take action to remedy the problem.
An SVM model can also be used to facilitate the operation of a prognostic-surveillance system. As illustrated in
During operation of prognostic-surveillance system 200, time-series signals 204 feed into a time-series database 206, which stores the time-series signals 204 for subsequent analysis. Next, the time-series signals 204 either feed directly from monitored system 202 or from time-series database 206 into analysis module 208. Analysis module 208 uses an associated SVM model 210 to analyze time-series signals 204 to detect various problematic conditions for monitored system 200. For example, analysis module 208 can be used to detect: an impending failure of the monitored system 202; a malicious-intrusion event in monitored system 202; or a condition indicating that preventive maintenance is required for the monitored system 202. A notification about a detected problem can then be sent to analyst 212, which enables analyst 212 to take action to remedy the problem.
We now present details of our new randomized technique that iteratively improves approximations to support nonlinear SVMs. As mentioned above, for large scale data sets, allocating and computing a nonlinear (e.g., Gaussian) kernel for an SVM is often prohibitively expensive. To address the problem, we propose a novel technique. In the first step, it constructs a block-diagonal approximation of the kernel to find an initial set of support vectors S. It then generates new random samples of observations based on their proximity to the separating hyperplane, which improves S after each iteration.
Let X be the input data set. Any point, which is not a support vector X∈X\S, can be safely dropped from the SVM model, because an inactive constraint can be dropped from an optimization problem without changing the optimal solution. Once an initial set of support vectors S has been found, we first drop all X\S points from the data set. It is intuitively clear that any point that is too far from the separating hyperplane (in the transformed feature space) has little chance of ever entering the set of optimal support vectors. Therefore, at the next iteration of our technique, we add points with probability
P(x)=(μ+vd(x))−β (1)
where d(x) is the distance from x to the hyperplane, and μ, v, β>0 are associated parameters. In other words, the closer the point is to the current separating hyperplane, the greater the chance it will be added back to the model. Then we solve the new model, and repeat.
Let us illustrate our approach on the airline on-time data set. Because it has approximately 123 million observations, solving a nonlinear SVM is out of the question (because it is impractical with existing technology to allocate a 123 million-by-123 million square matrix). So we first construct a block-diagonal approximation to find an initial set of support vectors S0. Say, for example, S0 has 300 support vectors, which approximate the optimal solution (the optimal set of support vectors). We, of course, cannot allocate a nonlinear kernel for the original data set, but for, say, a 10,300-observation data set, we surely can. So at the next step, we randomly choose 10,000 observations X0, such that the probability of an observation to be added to the new model is given by formula (1), and solve the SVM model on S0∪X0 observations, which gives us S1. The process is then repeated until some stopping criteria are met.
Imagine we have two sets of points and wish to construct a maximum margin separating hyperplane (see
Whenever the classes are not linearly separable (see
where xi are data samples (observations), M is the number of observations, yi are class labels, C is the misclassification penalty, and k(·, ·) is the nonlinear kernel function.
Commonly, the following kernels are used in practice:
The biggest challenge in the (2) formulation lies in constructing the quadratic matrix Q: qij≡k(xi, xj). Q can become prohibitively large even for medium data set sizes. To illustrate this, let us consider a one million observation data set, which nowadays would be viewed as rather small. It will require 3.7 terabytes to store the lower (or upper) triangular part of Q. Note, this number (3.7 terabytes) does not depend upon the number of columns in the data set, because Q∈M×M, and it grows quadratically with the number of observations M.
In this section we give a brief overview of the predictor-corrector interior-point method for SVM. As stated earlier, a nonlinear SVM formulation is a classical quadratic programming (QP) model. Let us consider the following standard QP formulation, which is identical to (2), except we no longer use SVM specific notation, but switch to the standard QP nomenclature:
here x is the vector of search variables, Q is a symmetric positive-semidefinite matrix, c represents the linear part of the objective function, l is the vector of lower bounds, u is the vector of upper bounds, and A is a matrix of linear equality constraints.
The dual program to (3) can be stated as follows:
where d1, and d2 are dual variables associated with the lower and upper bounds correspondingly, and y is the vector of dual variables associated with the linear equality constraints.
The predictor-corrector interior-point algorithm will solve (twice at each step) the following system of equations, known as the reduced Karush-Kuhn-Tucker (KKT) system:
where the right-hand sides ρ1 and ρ2 are defined as follows:
During the predictor step, u and the delta terms are dropped, and the resultant system is solved for the initial estimate of the delta terms. During the corrector step, an estimate of the μ is reinstated to the system, along with nonlinear delta terms and the system is solved again.
To solve KKT, one has to compute the Cholesky factorization
and then proceed to solve for Δy
AL
−T
L
−1
A
T
Δy=−ρ
2
−AL
−T
L
−1ρ1 (7)
and, finally, restore Δx
Δx=L
−T
L
−1(ρ1ATΔy) (8)
Of course, no explicit inverses of the lower L, and upper LT triangular matrices are computed; instead, one carries out forward and backward substitutions.
Now we must recall that Q (the SVM kernel matrix) can be prohibitively large; and for most medium to large scale data inputs, it simply cannot be allocated. We next provide an approximation to the nonlinear SVM model, and then show how to improve it.
We consider the most typical case: “tall and skinny” matrices, where M»N. When storing such matrices on a cluster of compute nodes, X is usually partitioned into a collection of row blocks
where Xp∈M
Note that because each partition Xp does not necessarily have the same number of rows, correspondingly Qp can be of different sizes. See
Some of the obvious properties of the {tilde over (Q)} matrix:
is carried out by each worker independently (“embarrassingly parallel method”).
Introducing Q into the reduced KKT system (5) makes it tractable to store and solve. Understandably, we would not be solving the original nonlinear SVM model, but its block-diagonal approximation, which we will denote dSVM, where ‘d’ stands for “diagonal”.
Having solved dSVM, we found a set of support vectors, which to some extent approximate the optimal solution. Let us consider a hyperplane wTx+b=0 and an arbitrary observation g. The distance from g to the hyperplane is given by
It is intuitively clear, if the distance d is large, the chance of g being a support vector is small; therefore, we do not need to keep the observation in the optimization model. In the transformed feature space, the core expression |wTy+b| translates to
Let S be the initial set of support vectors, obtained by solving the dSVM. To improve it, we randomly choose N (e.g., N=20000) observations from the input data set X, where each point x is drawn with probability
where μ, v, β>0 are associated parameters, whose values can be chosen via, e.g., Bayesian optimization. Let X0 be the resultant set. At the next step we solve the nonlinear SVM model on the union ∪0. This procedure can be repeated a number of times. The stopping criteria can be
SVM to detect one or more conditions-of-interest (step 602). While training the SVM model, the system makes approximations to reduce computing costs, wherein the approximations involve stochastically discarding points from the training data set based on an inverse distance to a separating hyperplane for the SVM model (step 604). Next, during a surveillance mode, the system uses the trained SVM model to detect the one or more conditions-of-interest based on monitored data points received from the monitored system (step 606). When one or more conditions-of-interest are detected, the system performs an action to improve operation of the monitored system (step 608).
We propose using a block-diagonal approximation to produce an initial set of support vectors. We also propose a way to generate random samples, which provides a higher probability of inclusion for points that are closer to the separating hyperplane (in the transformed feature space). Indeed, the standard way of solving large scale SVM models today would focus on random sampling of the input data, which produces significantly lower model accuracy than our new technique.
Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The foregoing descriptions of embodiments have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present description to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present description. The scope of the present description is defined by the appended claims.