Method and Apparatus for Transductive Support Vector Machines

Information

  • Patent Application
  • 20070265991
  • Publication Number
    20070265991
  • Date Filed
    March 21, 2007
    17 years ago
  • Date Published
    November 15, 2007
    16 years ago
Abstract
Disclosed is a method for training a transductive support vector machine. The support vector machine is trained based on labeled training data and unlabeled test data. A non-convex objective function which optimizes a hyperplane classifier for classifying the unlabeled test data is decomposed into a convex function and a concave function. A local approximation of the concave function at a hyperplane is calculated, and the approximation of the concave function is combined with the convex function such that the result is a convex problem. The convex problem is then solved to determine an updated hyperplane. This method is performed iteratively until the solution converges.
Description

BRIEF DESCRIPTION OF THE DRAWINGS


FIG. 1 shows a 2-class data set;



FIG. 2 shows a 2-class data set classified using a maximum-margin hyperplane defined by support vectors;



FIGS. 3 and 4 illustrate mapping lower dimensional data into higher dimensional space so that the data becomes linearly separable;



FIG. 5 shows a 2-class data set including unlabeled data;



FIG. 6 illustrates three loss functions for unlabeled data;



FIG. 7 illustrates a Ramp Loss function and the decomposition of the Ramp Loss function into convex and concave Hinge Loss functions;



FIG. 8A illustrates a method for training a TSVM according to an embodiment of the present invention;



FIG. 8B is pseudo code of the method shown in FIG. 8A; and



FIG. 9 shows a high level block diagram of a computer capable of implementing embodiments of the present invention.





DETAILED DESCRIPTION

The principles of the present invention will be discussed herein in the context of a transductive support vector machine (SVM) algorithm solved in dual formulation with an emphasis on two-class classification problems. One skilled in the art will recognize that the principles of the present invention may be applied to alternative problems, such as regression or multi-class classification, in a similar manner. The transduction methods described herein allow for large scale training with high dimensionality and number of examples.


Consider a set of L labeled training pairs ={(x1,y1), . . . ,(xL,yL)}, x∈n, y∈{1,−1} and an unlabeled set of U test vectors ={xL+1, . . . , xL+U}. Here, y is the label for the labeled data. SVMs have a decision function fθ(·) of the form:






f
θ(x)=ω·Φ(x)+b,


where θ=(ω,b) represents parameters of the hyperplane classifier that classifies the data, and Φ(x) is a feature map which maps real world data to a high dimensional feature space. SVMs and TSVMs can be used to classify any type of real world data into two classes. However, the real world data often cannot be classified with a linear hyperplane classifier, so the real world data is transformed to labeled training data and unlabeled test data in a high dimensional feature space. For example, in text classification, pieces of text are transformed into data points in a high dimensional feature space, so that a hyperplane classifier can be used to classify the data points corresponding to the text into two classes. Other examples of real word data that can be transformed into training data for classification include, but are not limited to, images (e.g., faces, objects, digits, etc.), sounds (speech, music signal, etc.), and biological components (proteins, types of cells, etc.). The transformation of data to the high dimensional feature space can be performed using a kernel function k(x1,x2)=Φ(x1)·Φ(x2), which defines implicitly the mapping into the higher dimensional feature space. The use of the kernel function to map data from a lower to a higher dimensionality is well known in the art, for example as described in V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.


Given the training set Λ and the test set Y, the TSVM optimization problem attempts to find among the possible binary vectors





{Y=(yL+1, . . . ,yL+U)}


the one that an SVM trained on ∪(×Y) yields the largest margin. Accordingly, the TSVM optimization problem attempts to label each of the test set Y with the label that maximizes the margin.


The TSVM optimization problem is a combinatorial problem, but it can be approximated as finding an SVM separating the training set under constraints which force the unlabeled examples to be as far as possible from the margin. The can be expressed as minimizing:








1
2





ω


2


+

c





i
=
1

L







ξ
i



+


c
*






i
=

L
+
1



L
+
U








ξ
i







subject to






y
i
f
θ(xi)≧1−ζi, i=1, . . . ,L





|fθ(xi)|≧1−ζi,i=L+1, . . . ,L+U


where ζi is the distance of a data point from the margin, C is the cost function for the labeled data, and C* is the cost function for the unlabeled data. The cost functions assign a cost or a penalty to the data points based on a location of the data point relative to the margin. This minimization problem is equivalent to minimizing:











J


(
θ
)


=



1
2





ω


2


+

c





i
=
1

L








H
1



(


y
1




f
θ



(

x
i

)



)




+


c
*






i
=

L
+
1



L
+
U









H
1



(




f
θ



(

x
i

)




)






,




(
1
)







where the function H1(·)=max(0,1−·) is a classical Hinge Loss function for the labeled data. This classical Hinge Loss function is shown in graph 704 of FIG. 7. A standard loss function H1(|·|) for the unlabeled is shown in graph 602 of FIG. 6. For C*=0 in Equation (1), the standard SVM optimization problem can be obtained. For C*>0, unlabeled data within the margin is penalized based on the loss function for unlabeled data. This is equivalent to using the hinge loss on the labeled data, but with an assumption that the label for the unlabeled example is yi=sign(fθ(xi)).



FIG. 6 illustrates three loss functions for unlabeled data. Conventional TSVMs based on Equation (1) use the loss functions shown in graphs 602 and 604 for unlabeled data. For example, the SVMLight algorithm, described in T. Joachims, “Transductive Inference for Text Classification Using Support Vector Machines”, International Conference on Machine Leaming, ICML, 1999, assigns a Hinge Loss H1(·)(see graph 704 of FIG. 7) on labeled examples and a “Symmetric Hinge Loss” H1(|·|) (graph 602) on unlabeled data. The algorithm described in O. Chapelle and A. Zien, “Semi-Supervised Classification by Low Density Separation”, Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005, uses a smooth version of the Symmetric Hinge Loss (graph 604) for the unlabeled data:


Graph 606 of FIG. 6 shows a loss function for unlabeled data in a TSVM according to an embodiment of the present invention. Given an unlabeled example x and using the notation z=fθ(x), this loss function 606 can be expressed as:





zRs(z)+Rs(−z),  (2)


where s<1 is a hyper-parameter and Rs refers to a “Ramp Loss” function, which is a “cut” version of the Hinge Loss function. FIG. 7 illustrates the Ramp Loss function Rs. The Ramp Loss function is shown in graph 702 of FIG. 7. The s parameter controls where the Ramp Loss function 702 is cut, and as a consequence this parameter also controls the wideness of the flat portion of the loss function (606 of FIG. 6) used for transduction according to an embodiment of the present invention. When s=0, the loss function reverts to the Symmetric Hinge H1(|·|) shown in graph 602. When s≠0, a non-peaked loss function (graph 606) is generated. This non-peaked loss function 606 when s≠0 is referred to herein as the “Symmetric Ramp Loss” function.


As illustrated in FIG. 7, the Ramp Loss function 702 can be decomposed into the sum of the convex Hinge Loss function 704 and a concave loss function shown in graph 706. As described above, the Hinge Loss function 704 is used for the labeled data.


Training a TSVM using the Symmetric Ramp Loss function 606 expressed in Equation (2) is equivalent to training an SVM using the Hinge Loss function H1(·) 704 for labeled examples, and using the Ramp Loss Rs(·) 702 for unlabeled examples, where each unlabeled example appears as two examples labeled with both possible classes. Accordingly, by introducing:






y
i=1i∈[L+1 . . . L+U]






y
i=−1i∈[L+U+1 . . . L+2U]






x
i
=x
i−U
i∈[L+U+1 . . . L+2U],


it is possible to rewrite Equation (1) as:











J
s



(
θ
)


=



1
2





ω


2


+

c





i
=
1

L








H
1



(


y
i




f
θ



(

x
i

)



)




+


c
*






i
=

L
+
1



L
+

2

U











R
s



(


y
i




f
θ



(

x
i

)



)


.








(
3
)







Accordingly, each of the unlabeled examples are duplicated in order to associate a cost with assigning each of the classes to the unlabeled examples. The minimization of the TSVM objective function expressed by Equation (3) will be considered hereinafter.


One problem with TSVMs is that in high dimensions with few training data it is possible to classify all of the unlabeled examples as belonging to only one of the classes with a very large margin. This can lead to poor performance of a TSVM. In response to this problem, it is possible to constrain the solution of the TSVM objective function by introducing a balancing constraint, which assures that data are assigned to both classes. The balancing constraint enforces that a fraction of positives and negatives assigned to the unlabeled data is the same fraction as found in the labeled data. An example of a possible balancing constraint can be expressed as:











1
U






i
=

L
+
1



L
+
U









f
θ



(

x
i

)




=


1
L






i
=
1

L








y
i

.







(
4
)







The TSVM optimization problem as expressed in Equation (3) is not convex and minimizing a non-convex objective function can be very difficult. The “Concave-Convex Procedure” (CCCP) is a procedure for solving non-convex problems that can be expressed as the sum of a convex function and a convex function. The CCCP procedure is generally described in A. L. Yuille and A. Rangarajan, “The Concave-Convex Procedure (CCCP)”, Advances in Neural Information Processing Systems 14, MIT Press, Cambridge, Mass., 2002. However, in order to apply the CCCP procedure to the TSVM optimization problem, it is necessary to express the objective function in terms of a convex function and a concave function, and to take into account the balancing constraint.



FIG. 8A illustrates a method for training the TSVM according to an embodiment of the present invention. This method is a method for optimizing the TSVM objective function in order to determine a hyperplane classifier that classifies all of the unlabeled data. FIG. 8B is pseudo code of the method of FIG. 8A. One skilled in the art will readily be able to associate portions of the pseudo code of FIG. 8B with the steps of FIG. 8A.


At step 802, the TSVM objective function is decomposed into a convex function and a concave function. A convex function is a function that always lies over its tangent, and a concave function is a function that always lies under its tangent. The TSVM objective function is expressed in Equation (3). As described above and illustrated in FIG. 7, the Ramp Loss (702) can be rewritten as the sum of two Hinge losses (704 and 706), such that:






R
s(z)=H1(z)−Hs(z).  (5)


H1(z) is convex and −Hs(z) is concave.

Based on the deconstruction of the Ramp Loss function for the unlabeled data, the TSVM objective function Js(θ) can be decomposed into the sum of a convex function Jvexs (θ) and a concave function Jcavs (θ) as follows:














J
s



(
θ
)


=





1
2





ω


2


+

C





i
=
1

L








H
1



(


y
i




f
θ



(

x
i

)



)




+


C
*






i
=

L
+
1



L
+

2

U










R
s



(


y
i




f
θ



(

x
i

)



)











=






1
2





w


2


+

C





i
=
1

L








H
1



(


y
i




f
θ



(

x
i

)



)




+


C
*






i
=

L
+
1



L
+

2

U










H
1



(


y
i



f
θ



(

x
i

)


)




-





J
vex
s



(
θ
)
















C
*






i
=

L
+
1



L
+

2

U










H
s



(


y
i



f
θ



(

x
i

)


)








J
cav
s



(
θ
)




.








(
6
)







The convex function Jvexs(θ) of Equation (6) can be reformulated and expressed using dual variables α using the standard notation of SVM. The dual variables α are Lagrangian variables corresponding to the constraints of the SVM.


At step 804, the balancing constraint is introduced to the decomposed objective function for the CCCP procedure. Enforcing the balancing constraint of Equation (4) can be achieved by introducing an extra Lagrangian variable α0 and an example (or data point) x0 explicitly defined by:








Φ


(

x
0

)


=


1
U






i
=
L


L
+
U








Φ


(

x
i

)





,




with label y0=1. Thus, if we note K the kernel matrix such that






K
ij=Φ(xi)·Φ(xj),


the column corresponding to the example x0 is calculated as follows:










K

i





0


=


K

0





i


=


1
U






j
=

L
+
1



L
+
U










Φ


(

x
j

)


·

Φ


(

x
i

)







i
.










(
7
)







the computation of this column can be efficiently achieved by calculating it one time, or by approximating Equation (7) using a know sampling method.


At step 806, a hyperplane classifier is initialized. As described above, a hyperplane classifier which classifies the data into two classes is defined in terms of θ=(ω,b). An initial estimate for the hyperplane classifier can be determined using a SVM solution on only the labeled data points. This step is shown at 850 in the pseudo code of FIG. 8B.


At step 808, an initial approximation of the concave function is calculated. This initial approximation is a local approximation of the concave function at the initialized hyperplane classifier. For example, a tangent of the concave function Jcavs (θ) at the initial hyperplane classifier θ0 can be used as an initial estimate of the concave function Jcavs(θ). A first order approximation of the concave part Jcavs(θ) of the TSVM objective function can be calculated as:










β
i

=



y
i







J
cav
s



(
θ
)







f
θ



(

x
i

)





=

{





C
*





if






y
i




f
θ



(

x
i

)



<
s





0


otherwise



,







(
8
)







for unlabeled examples (i.e., i≧L+1). The concave function Jcavs(θ) does not depend on labeled examples (i≦L), so βi=0 for all i≦L. The initial approximation Bi0 can be calculated Equation (8). This step is shown at 852 in the pseudo code of FIG. 8B.


At step 810, a convex problem combining the convex function Jvexs(θ) of the TSVM objective function and approximation of the concave function at the current hyperplane is solved. The convex function Jvexs(θ) is combined with the approximation of the concave function such that the resulting function remains convex. Since this resulting problem is convex, we can apply any efficient convex optimization algorithm. For example, the resulting function may be solved using a known SVM algorithm. Therefore, as in known algorithms used in SVMs, the primal minimization problem can be transformed into a dual maximization problem. This step is shown at 854 in the pseudo code of FIG. 8B.


At step 812, an updated hyperplane classifier is determined based on the solution to the convex problem. The parameters of the (ω,b) hyperplane classifier 6 are updated based on the solution to the convex problem of step 810. This step is shown at 856 and 858 in the pseudo code of FIG. 8B. At 856, ω is updated, and at 858, b is updated.


At step 814, an updated approximation for the concave function of the TSVM objective function is calculated based on the update hyperplane classifier. Using Equation (8), a first order local approximation (i.e., the tangent) of the concave function is calculated in at the updated hyperplane. This step is shown at 860 in the pseudo code of FIG. 8B.


At step 816, it is determined whether the solution has converged. It is possible to determine whether the solution has converged by comparing the updated approximation for the concave function (βt+1) based on the updated hyperplane with the previous approximation for the concave function (βt). If the updated approximation is equal to the previous approximation than the solution has converged. This step is shown at 862 in the pseudo code of FIG. 8B. If the solution has converged, the method stops and the updated hyperplane is the final solution. If the solution has not converged, the method repeats steps 812-816 until the method converges.


This method is guaranteed to converge to a solution in finite time because the variable β can only take a finite number of values and because J(θt) decreases with every iteration. As described herein a CCCP-TSVM is used to classify data. Training a CCCP-TSVM amounts to solving a series of SVM optimization problems with L+2U variables. Although conventional SVM training has a worst case complexity of O((L+2U)3), it typically scales quadratically. The CCCP-TSVM method described above also scales quadratically similar to the conventional SVM training. This is faster than the conventional TSVM training methods.


The steps of the method described herein may be performed by computers containing processors which are executing computer program code which defines the functionality described herein. Such computers are well known in the art, and may be implemented, for example, using well known computer processors, memory units, storage devices, computer software, and other components. A high level block diagram of such a computer is shown in FIG. 9. Computer 902 contains a processor 904 which controls the overall operation of computer 902 by executing computer program instructions which define such operation. The computer program instructions may be stored in a storage device 912 (e.g., magnetic disk) and loaded into memory 910 when execution of the computer program instructions is desired. Thus, the operation of computer 902 is defined by computer program instructions stored in memory 910 and/or storage 910 and the computer 902 will be controlled by processor 904 executing the computer program instructions. Computer 902 also includes one or more network interfaces 906 for communicating with other devices via a network. Computer 902 also includes input/output 908 which represents devices which allow for user interaction with the computer 902 (e.g., display, keyboard, mouse, speakers, buttons, etc.). One skilled in the art will recognize that an implementation of an actual computer will contain other components as well, and that FIG. 9 is a high level representation of some of the components of such a computer for illustrative purposes. One skilled in the art will also recognize that the functionality described herein may be implemented using hardware, software, and various combinations of hardware and software.


The foregoing Detailed Description is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention.

Claims
  • 1. A method for training a transductive support vector machine based on an objective function, the objective function based on a set of labeled training data classified into first and second classes and a set of unlabeled test data, comprising: decomposing the objective function into a sum of a concave function and a convex function; anddetermining a hyperplane classifier classifying the unlabeled data into the first and second classes by iteratively approximating the concave and convex functions.
  • 2. The method of claim 1, wherein said step of determining a hyperplane classifier comprises: (a) determining an initial hyperplane classifier;(b) calculating a local approximation of the concave function at the initial hyperplane classifier;(c) solving a convex problem formed by combining the local approximation to the concave function with the convex function;(d) determining an updated hyperplane classifier based on the solution to the convex problem;(e) calculating a local approximation of the concave function at the updated hyperplane classifier; and(f) repeating steps (c)-(e) until the local approximation of the concave function at the updated hyperplane classifier converges.
  • 3. The method of claim 2, wherein step (a) comprises: determining the initial hyperplane classifier using a support vector machine based on the labeled data.
  • 4. The method of claim 2, wherein each of steps (b) and (e) comprises: calculating a tangent of the concave function at a hyperplane classifier.
  • 5. The method of claim 2, wherein said step of determining a hyperplane classifier further comprises: (g) enforcing a balancing constraint for balancing a ratio of the unlabeled data classified into the first and second classes and a ratio of the labeled data classified into the first and second classes.
  • 6. The method of claim 1, wherein the objective function is based on a non-convex loss function for the unlabeled data, and said step of decomposing the objective function comprises: decomposing the loss function for the unlabeled data into a convex Hinge loss function and a concave Hinge loss function.
  • 7. The method of claim 6, wherein said loss function for the unlabeled data comprises a Symmetric Ramp Loss function achieved by duplicating all of the unlabeled data in order to associate a cost with classifying all of the unlabeled data into both of the first and second classes.
  • 8. The method of claim 1, further comprising: mapping data into a high dimensional feature space to generate the labeled training data and the unlabeled test data.
  • 9. The method of claim 8, further comprising: classifying the real world data using the hyperplane classifier.
  • 10. A computer readable medium storing computer program instructions for training a transductive support vector machine based on an objective function, the objective function based on a set of labeled training data classified into first and second classes and a set of unlabeled test data, said computer program instructions defining the steps comprising: decomposing the objective function into a sum of a concave function and a convex function; anddetermining a hyperplane classifier classifying the unlabeled data into the first and second classes by iteratively approximating the concave and convex functions.
  • 11. The computer readable medium of claim 10, wherein the computer program instructions defining the step of determining a hyperplane classifier comprise: computer program instructions defining the steps of: (a) determining an initial hyperplane classifier;(b) calculating a local approximation of the concave function at the initial hyperplane classifier;(c) solving a convex problem formed by combining the local approximation to the concave function with the convex function;(d) determining an updated hyperplane classifier based on the solution to the convex problem;(e) calculating a local approximation of the concave function at the updated hyperplane classifier; and(f) repeating steps (c)-(e) until the local approximation of the concave function at the updated hyperplane classifier converges.
  • 12. The computer readable medium of claim 11, wherein the computer program instructions defining step (a) comprise computer program instructions defining the step of: determining the initial hyperplane classifier using a support vector machine based on the labeled data.
  • 13. The computer readable medium of claim 11, wherein the computer program instructions defining each of steps (b) and (e) comprise computer program instructions defining the step of: calculating a tangent of the concave function at a hyperplane classifier.
  • 14. The computer readable medium of claim 11, wherein the computer program instructions defining the step of determining a hyperplane classifier further comprise computer program instructions defining the step of: (g) enforcing a balancing constraint for balancing a ratio of the unlabeled data classified into the first and second classes and a ratio of the labeled data classified into the first and second classes.
  • 15. The computer readable medium of claim 11, wherein the objective function is based on a non-convex loss function for the unlabeled data, and the computer program instructions defining the step of decomposing the objective function comprise computer program instructions defining the step of: decomposing the loss function for the unlabeled data into a convex Hinge loss function and a concave Hinge loss function.
  • 16. The computer readable medium of claim 15, wherein said loss function for the unlabeled data comprises a Symmetric Ramp Loss function achieved by duplicating all of the unlabeled data in order to associate a cost with classifying all of the unlabeled data into both of the first and second classes.
  • 17. The computer readable medium of claim 1, further comprising computer program instructions defining the step of: mapping data into a high dimensional feature space to generate the labeled training data and the unlabeled test data.
  • 18. The computer readable medium of claim 17, further comprising computer program instructions defining the step of: classifying the real world data using the hyperplane classifier.
  • 19. An apparatus for training a transductive support vector machine based on an objective function, the objective function based on a set of labeled training data classified into first and second classes and a set of unlabeled test data, comprising: means for decomposing the objective function into a sum of a concave function and a convex function; andmeans for determining a hyperplane classifier classifying the unlabeled data into the first and second classes by iteratively approximating the concave and convex functions.
  • 20. The apparatus of claim 19, wherein said means for determining a hyperplane classifier comprises: means for initializing parameters of a hyperplane classifier;means for calculating a local approximation of the concave function based on the parameters of the hyperplane classifier;means for solving a convex problem formed by combining the local approximation to the concave function with the convex function; andmeans for updating the parameters of the hyperplane classifier based on the solution to the convex problem.
  • 21. The apparatus of claim 20, wherein said means for determining a hyperplane classifier further comprises: means for determining when the parameters of the hyperplane classifier converge.
  • 22. The apparatus of claim 20, wherein said means for initializing parameters of a hyperplane classifier comprises: means for determining the parameters of the hyperplane classifier using a support vector machine based on the labeled data.
  • 23. The apparatus of claim 20, wherein said means for calculating a local approximation of the concave function comprises: means for calculating a tangent of the concave function at the parameters of the hyperplane classifier.
  • 24. The apparatus of claim 20, wherein said means for determining a hyperplane classifier further comprises: means for enforcing a balancing constraint for balancing a ratio of the unlabeled data classified into the first and second classes and a ratio of the labeled data classified into the first and second classes.
  • 25. The apparatus of claim 19, wherein the objective function is based on a non-convex loss function for the unlabeled data, and said means for decomposing the objective function comprises: means for decomposing the loss function for the unlabeled data into a convex Hinge loss function and a concave Hinge loss function.
  • 26. The apparatus of claim 25, wherein said loss function for the unlabeled data comprises a Symmetric Ramp Loss function achieved by duplicating all of the unlabeled data in order to associate a cost with classifying all of the unlabeled data into both of the first and second classes.
  • 27. The apparatus of claim 19, further comprising: means for mapping data into a high dimensional feature space to generate the labeled training data and the unlabeled test data.
  • 28. The apparatus of claim 27, further comprising: means for classifying the real world data using the hyperplane classifier.
Parent Case Info

This application claims the benefit of U.S. Provisional Application No. 60/747,225 filed May 15, 2006, the disclosure of which is herein incorporated by reference.

Provisional Applications (1)
Number Date Country
60747225 May 2006 US