The invention relates to learning machines and, more particularly, to kernel-based techniques for implementing learning machines.
There are a number of known techniques for automating the classification of data based on an analysis of a set of training data. Of particular interest herein are kernel-based techniques such as support vector machines. The development of support vector machines has a history that dates back to 1965, when Chervonenkis and Vapnik developed an algorithm referred to as the generalized portrait method for constructing an optimal separating hyperplane. A learning machine using the generalized portrait method optimizes the margin between the training data and a decision boundary by solving a quadratic optimization problem whose solution can be obtained by maximizing the functional:
subject to the constraint that Σi=1lyiαi=0 and αi≧0. The Lagrange multipliers αi define the separating hyperplane used by the learning machine. Supposing the optimal values for the multipliers are αio and the corresponding value for the threshold is bo, the equation for this hyperplane is Σi=1lαioyi(xi, x)+bo=0.
In 1992, Boser, Guyon, and Vapnik devised an effective means of constructing the separating hyperplane in a Hilbert space which avoids having to explicitly map the input vectors into the Hilbert space. See Bernard E. Boser, Isabelle M. Goyon, and Vladimir N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proceedings of the Fifth Annual Workshop on Computational Learning Theory (July 1992), Instead, the separating hyperplane is represented in terms of kernel functions which define an inner product in the Hilbert space. The quadratic optimization problem can then be solved by maximizing the functional:
subject to the constraints
and αi≧0.
In this case, the corresponding equation of the separating hyperplane is
In 1995, Cortes and Vapnik generalized the maximal margin idea for constructing the separating hyperplane in the image space when the training data is non-separable. See Corinna Cortes and Vladimir N. Vapnik, “Support Vector Networks,” Machine Learning, Vol. 20, pp. 273-97 (September 1995). The quadratic form of the optimization problem is expressed in terms of what is referred to as a “slack variable” which is non-zero for those points that lie on the wrong side of the margin, thereby allowing for an imperfect separation of the training data. By converting to the dual form, the quadratic optimization problem can again be expressed in terms of maximizing the following objective functional
subject to the constraint Σi=1lyiαi=0 but with the new constraint that
0≦αi≦C.
Again, the corresponding equation of the separating hyperplane is given by equation (1) above. The equation is an expansion of those vectors for which αi≠0, these vectors being referred to in the art as “support vectors.” To construct a support vector machine one can use any positive definite function K (xi, xj) creating different types of support vector machines. Support vector machines have proven to be useful for a wide range of applications, including problems in the areas of bioinformatics or text classification.
A method for training a support vector machine includes receiving input vectors from a labeled dataset that have been clustered into one or more clusters, the input vectors from the labeled dataset relating to a phenomenon of interest; creating simultaneously a t+1 space of a decision function and a t space of a correcting function where t is the number of clusters; specifying generalized constraints in the space of the correcting function which are dependent upon the clusters of input vectors in the dataset; generating in the space of the correcting function, a subset of the input vectors from the dataset using the decision function subject to the generalized constraints, the subset of the input vectors defining a separating hyperplane in a space of the decision function, the subset of the input vectors for use by the support vector machine in describing input vectors from an unlabeled dataset relating to the phenomenon of interest; and wherein the generalized constraints can be represented as
where there are t clusters and where Kr is the kernel function defined on the correct space into which the input vectors within the cluster r are mapped, αi are the coefficients determining the separating hyperplane in the separating space, and Tr defines a set of indices of input vectors within a cluster r.
These and other advantages of the invention will be apparent to those of ordinary skill in the art by reference to the following detailed description and the accompanying drawings.
The training data for the learning machine are represented as input vectors in an input space, e.g., illustratively depicted in
These training input vectors 101, 102, . . . 106, are initially clustered either manually or using any of a number of known automated clustering techniques.
As further explained herein, a separating hyperplane 180 is constructed in the separating space 150 which defines the above-mentioned separating function. The separating hyperplane 180 is constructed in a manner which not only optimizes the margins 171, 172 in the separating space 150, but which also takes into account the correction functions of the corresponding clustered input vectors. Accordingly, in contrast with prior art techniques, the separating function can be generated in a manner which takes advantage of any a priori knowledge one may have about the global structure of the training data, including the noise characteristics in the input space 100. As further discussed herein, the disclosed technique can be seen as a generalization of the support vector machine framework; accordingly, the inventors refer to the technique as “SVM+”.
A preferred embodiment is further described with reference to
At step 230, the input training data {(xi, yi)}i=1l is clustered to form t≧1 clusters. As mentioned above, the t clusters can be formed manually or using any of a number of known automated clustering techniques. The present invention is not limited to any specific clustering technique. Any clustering technique, including a hierarchical approach or a partitioning approach such as a k-means or k-medians clustering algorithm, can be readily adapted and utilized. Where the input training data is represented in vector form x1, . . . , x1, these vectors would be the disjoint union of the t clusters of vectors xi. Define the set of the indices of the input training vectors from cluster r,
It should be noted that although, for purposes of the following discussion, the clusters are assumed to depend on the values of the xi, the clusters could readily be arranged so as to also depend on the values of the yi.
At step 240, kernel functions are defined for the learning machine. As discussed above, the input training vectors xiεXr are simultaneously mapped into a separating space ziεZ and a correction space zirεZr. Input vectors of different clusters are mapped to the same separating space Z but to different correction spaces Z. The separating space Z defines the separating function. The correction space Zr defines a correction function
ξi=φr(xi, a), aεAr, xiεXr, r=1, . . . , t (2)
for each cluster Xr from a set of admissible correcting functions, φr(xi, a), aεAr, for this cluster. It is assumed that the admissible set of correction functions ξ=φ(x, a), aεAr, can be described as linear non-negative functions in Zr space:
ξi=φr(xi, ar)=(wr, zir)+dr≧0, r=1, . . . , t.
Again, as noted above, the correction functions could also depend on the yi values, although this is not explicitly indicated here. The separating space Z and the correction spaces Zr are preferably represented as Hilbert spaces. In accordance with Mercer's theorem, there exists in the input vector space a positive definite function, referred to in the art as a kernel function, that defines a corresponding inner product in the separating space Z and the correction spaces Zr. Accordingly, the corresponding inner product for the separating space can be defined by the kernel
(zi, zj)=K(xi, xj)
while the corresponding inner product in the correction space for cluster r can be defined by the kernels
(zir, zjr)=Kr(xi, xj)≧0, i, jεTr, r=1 . . . t.
Note that although prior art techniques use a single kernel, the herein disclosed technique uses up to (t+1) distinct kernels.
The present invention is not limited to any specific form of kernel function for either the separating space or the correction space(s), The most popular kernels used in machine learning today are the polynomial kernel of degree d
K(xi, xj)=((xi, xj)+C)d
and the exponential kernel
An optimal separating function can then be constructed. The optimal separating hyperplane in the separating space can be found by minimizing the objective function
subject to the constraints
yi[(w, zi)+b]≧1−((wr, zir)+dr), iεTr, r=1, . . . , t,
and
(wr, zir)+dr≧0, iεTr, r=1, . . . , t.
The corresponding Lagrangian is
It can then be shown that the optimal separating hypersurface in X (input) space has the form
where the coefficients αi maximize the quadratic form
subject to the constraints
where |Tr| is the number of elements in cluster Tr and where αi≧0 for i=1, . . . , l. The inventors refer to the constraint in equation (5), which is the same as disclosed in the prior art, as a “balance” constraint. The t constraints in equation (6) will be recognized by those skilled in the art as what is referred to as a “lasso” constraint (see R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. R. Statist. Soc. (B), 58, 267-288 (1996)).
The constraint in equation (7) can be rewritten in the form
where the 0≦A*i,j≦1 are elements of a symmetric matrix A* that satisfies the equalities
The inventors refer to the constraints defined in equation (8) as the matrix constraints. Note that the above constraint matrix form can be considered as a generalized form of the prior art constraint of 0≦αi≦C. When A*i,j=1 if i=j and A*i,j=0 if i≠j, i.e., if the matrix A* is an l×l identity matrix, then the above generalized constraint defines the box constraints that are independent of the data, just as in the prior art. In this case, the lasso constraints in equation (6) are automatically satisfied. The disclosed technique, however, has the advantage of enabling a broader specification of constraints which can be dependent on the global structure of the data. The learning machine, accordingly, can consider the global structure of the training data—which prior art support vector machine techniques ignore.
With reference again to
These values are the computed labels for the given data. Where the classifier is a binary classifier, the generated label can be simply output or used in the relevant application. Where the classifier is a multiclass classifier, the generated labels from a plurality of binary classifiers are combined at step 340 using known techniques. For example, and without limitation, the multiclass classifier can be comprised of a plurality of binary classifiers, trained as above, where each binary classifier could compare one class to all others or could compare classes pairwise. The final result can then be output at step 350.
The machine learning methods disclosed herein may be readily adapted and utilized in a wide array of applications to construct a separating or decision function which describes (e.g., classifies, predicts, etc.) data in multidimensional space, the data corresponding to a phenomenon of interest, e.g., images, text, stock prices, etc. More specifically, the applications include, for example and without limitation, general pattern recognition (including image recognition, object detection, and speech and handwriting recognition), regression analysis and predictive modeling (including quality control systems and recommendation systems), data classification (including text and image classification and categorization), bioinformatics (including automated diagnosis systems, biological modeling, and bioimaging classification), data mining (including financial forecasting, database marketing), etc.
One skilled in the art will recognize that any suitable computer system may be used to execute the machine learning methods disclosed herein. The computer system may include, without limitation, a mainframe computer system, a workstation, a personal computer system, a personal digital assistant (PDA), or other device or apparatus having at least one processor that executes instructions from a memory medium.
The computer system may further include a display device or monitor for displaying operations associated with the learning machine and one or more memory mediums on which one or more computer programs or software components may be stored. For example, one or more software programs which are executable to perform the machine learning methods described herein may be stored in the memory medium. The one or more memory mediums may include, without limitation, CD-ROMs, floppy disks, tape devices, random access memories such as but not limited to DRAM, SRAM, EDO RAM, and Rambus RAM, non-volatile memories such as, but not limited hard drives and optical storage devices, and combinations thereof. In addition, the memory medium may be entirely or partially located in one or more associated computers or computer systems which connect to the computer system over a network, such as the Internet.
The machine learning methods described herein may also be executed in hardware, a combination of software and hardware, or in other suitable executable implementations. The learning machine methods implemented in software may be executed by the processor of the computer system or the processor or processors of the one or more associated computers or computer systems connected to the computer system.
After the computer implemented SVM+ learning machine has been trained, the input device 410 of the computer system 400 inputs unlabeled data about the phenomenon of interest into the processor 420 in accordance with the methods disclosed herein. In the earlier mentioned embodiment wherein the SVM+ learning machine has been trained for use in pattern classification, the unlabeled input data may comprise a plurality of patterns to be classified by the computer system 400 using the trained SVM+ learning machine. In any case, the unlabeled data inputted into the processor 420 may also be collected and preprocessed in accordance with the preferred embodiment described earlier. The CPU 424 of the processor then executes the SVM+ leaning machine program instructions disclosed herein using the coefficients {αio}i=1l and the threshold bo generated above during training of the SVM+ learning machine, to describe the phenomenon of interest, or in other words, classify, analyze, mine or otherwise transform the unlabeled input data corresponding to the phenomenon of interest into a form that is useful for analysis, control and decision making. The processor 420 then outputs the result of the SVM+ learning machine (e.g., classified patterns) to the output device 430. In embodiments where the output device 430 comprises the display monitor, the display monitor may display the transformed data in a suitable manner so that a user can make some decision or take some action (e.g., identify an image, recognize a handwriting sample, buy or sell a stock, and control a manufacturing process, to name a few). Alternatively, the output device 430 may comprise a device that further processes the transformed data and automatically makes a decision or takes some action as a result of or in response to this data.
While exemplary drawings and specific embodiments of the present invention have been described and illustrated, it is to be understood that that the scope of the present invention is not to be limited to the particular embodiments discussed. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by workers skilled in the arts without departing from the scope of the present invention as set forth in the claims that follow and their structural and functional equivalents.
This application is a continuation-in-part of application Ser. No. 11/252,487, filed Oct. 18, 2005. The entire disclosure of application Ser. No. 11/252,487 is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6327581 | Platt | Dec 2001 | B1 |
20050228591 | Hur et al. | Oct 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20080313112 A1 | Dec 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 11252487 | Oct 2005 | US |
Child | 12100220 | US |