1. Field of the Invention
The present disclosure relates to supervised learning, and more particularly to a system and method for supervised learning for classification that unifies generative and discriminative methods in a variational framework.
2. Description of Related Art
One of the most significant approaches to machine learning to emerge in recent years is the Support Vector Machine (SVM), which follows a discriminative classification paradigm. The SVM is based on Vapnik's structural risk minimization principle. This criterion attempts to fit the training data while at the same time minimize the “complexity” of the classifier.
Similar ideas are incorporated in Relevance Vector Machine (RVM) classifiers, which in addition aim at a sparse solution. This, in theory, enables fast classification. Learning is, however, a demanding task, and yet not solved for many difficult problems.
It should be noted that for SVM reduced set methods exist that subsequently prune support vectors to reduce the number of kernels needed at the expense of some increase in error.
An alternative to discriminative classification, generative classification ideas are valuable, typically when the classification task needs the output of confidences or classification probabilities within a larger framework. Hidden Markov models are the basis of many large-scale speech recognition engines, where a probabilistic reasoning based on language models is needed.
Therefore, a need exists for a system and method for supervised learning for classification that unifies generative and discriminative methods in a variational framework.
According to an embodiment of the present disclosure, a computer-implemented method for supervised learning for classification that unifies generative and discriminative methods in a variational framework including providing training data for determining a classifier, defining a cost functional based on a kernel density, finding a function δ of the cost functional by searching for a zero crossing of joint probabilities p(γ=0|X)−p(γ=1|X), wherein γ is a label for a given data point X, optimizing the cost functional using a gradient descent, And outputting the classifier comprising-an optimized cost functional for classifying data.
According to an embodiment of the present disclosure, a computer-implemented method for classification that unifies generative and discriminative methods in a variational framework including providing a trained classifier, providing data to be classified, and classifying the data to be classified using the trained classifier comprising a cost functional implementing a simultaneous mixed generative and discriminative determination.
According to an embodiment of the present disclosure, a computer readable media embodying instructions executable by a processor to perform a method for supervised learning for classification that unifies generative and discriminative methods in a variational framework is provided. The method including providing training data for determining a classifier, defining a cost functional based on a kernel density, finding a function δ by searching for a zero crossing, optimizing the cost functional using a gradient descent according to the function δ, and outputting the classifier comprising an optimized cost functional for classifying data. The method may further include performing a classification comprising providing a trained classifier, providing data to be classified, and classifying the data to be classified using the trained classifier comprising a cost functional implementing a simultaneous mixed generative and discriminative determination.
Preferred embodiments of the present invention will be described below in more detail, with reference to the accompanying drawings:
FIGS. 2B-D and 2F-H show sparse kernel density progressions for XOR and spiral problems, respectively, according to an embodiment of the present disclosure;
FIGS. 3B-D and 3F-H show sparse kernel boundary progressions for XOR and spiral problems, respectively, according to an embodiment of the present disclosure;
According to an embodiment of the present disclosure, a method for supervised learning for classification unifies generative and discriminative methods in a variational framework. The method defines a cost functional based on a kernel density estimate (KDE) of the training data, and a decision boundary. This cost functional is minimized by estimating a function in the form of a linear combination of kernels, a sparse kernel machine. Despite the variational formulation, it is shown that the complexity of the optimization problem can be reduced since the proposed cost functional can be computed analytically. As a result, training is computationally efficient. The method has been tested with a number of data sets and illustrate its performance on classification tasks.
In supervised learning training data is given in the form of a set of input vectors Xi∈ and associated labels γi∈Y with i=1 . . . M, to make predictions of γ for new input vectors X. Depending on whether the space Y is finite or not, the task is either called classification or regression respectively. In the problem of classification where Y={0,1}, i.e., where the labels γi are of binary nature, generative and discriminative approaches may be used. Generative approaches are based on learning a model of the joint probability, p(X,γ), or equivalently the class conditional density, p(X|γ), of the inputs X and γ, and making a prediction by using Bayes rule to calculate p(γ|X), and estimating a classifier that picks the most likely parameter γ. Discriminative approaches are designed to infer the class mapping with help of some sort of discriminant function. Discriminative classifiers often achieve higher test set accuracy than generative classifiers at the price of making ‘hard’ binary decisions that are not associated to any probability measure. Generative classifiers have the advantage of being specifically designed to return a decision and the associated probability measure. In an effort to obtain the benefits of both methods, a method for supervised learning is implemented to estimate a classifier, a variational sparse kernel machine (
Classification and Confidence Measure
According to an embodiment of the present disclosure, to recover a function g:Y, a classifier predicts a label γ given a data point X. If the pairs {X,γ} are associated to a probability density p(X,γ), then one can choose g such that
This choice for g is a Bayes classifier, which can be obtained as a solution of risk minimization where incorrect labels are penalized for a certain choice of weights. Typically, a finite set of pairs {Xi,γi}i=1 . . . M, and the training data are provided, while the probability density p(X,γ) and the conditional probability density p(X|γ) are unknown. Since the Bayes classifier can be arbitrarily complex, training data may not be sufficient to uniquely identify the probability density p(X,γ) or p(γ|X), and therefore, the recovery of the classifier g is an ill-posed problem; a problem is ill-posed when a solution does not exist, it is not unique or a small variation in the input data causes large variations in an output—stability. Coping with the ill-posedness of the problem is a difficult task; however, in many applications, it may be useful to render a stable estimation of g. One way to do so is to introduce regularization during the estimation of g. This can be formulated in an energy minimization setting by introducing additional terms that are devoted to constraining the solution to belong to a certain manifold or, by explicitly parameterizing g or p(γ|X).
Before describing the details of how to carry out the minimization task, an aspect of the classification problem should be noted: When predicting a label γ for a given data point X, it is important to also have the likelihood p(γ|X). Indeed, if the choice of a label γ is supported by a large p(γ|X), then one can make a decision with high confidence as opposed to the case when p(γ|X) is uniform in γ, where any choice is arbitrary. The likelihood p(γ|X) is a confidence measure. The lone classifier g may not provide such information, unless one explicitly parameterizes g as a function of the confidence measure as in eq. (1) and recovers p(γ|X). According to an embodiment of the disclosure, a confidence measure is output, therefore this strategy is followed and the problem of estimating p(γ|X) may be posed as the following minimization problem:
where ψ is a cost functional (e.g., the Kullback-Leibler pseudo-distance or an Lp norm). As only a finite data set {Xi,γi}i=1 . . . M is available, the following approximation may be used:
This approximation is accurate for a large number of samples, and the needed size of the training set increases with the dimension of the space . One way to cope with the limitation of the available data is to use the KDE of the probability density of the training data, as described below. Then, rather than using eq. (3), one can substitute p(X,γ) (and p(γ|X)) with the corresponding KDE approximation q(X,γ) (and
respectively) directly into eq. (2) and solve for {circumflex over (p)}(γ|X). This results in
where we chose the discrepancy function ψ to be the L2 norm. Eq. (4) forms the basis of the approach.
Estimating Confidence: Generative vs. Discriminative Approaches
The problem of finding a classifier g and a confidence measure {circumflex over (p)}(γ|X) can be posed by minimizing eq. (4). This optimization problem corresponds to a discriminative approach, as the method estimates p(γ|X). The optimization procedure emphasizes the decision boundary as the probability density p(γ|X) tends to be “flat” away from it. Indeed, in many cases, p(γ|X) resembles a smooth step function when parameterized with respect to X. If one poses the alternative problem of estimating
then the emphasis would be larger in regions away from the decision boundary. This latter minimization corresponds to the generative paradigm where the solution is an estimate of p(X|γ).
According to an embodiment of the present disclosure, one can explicitly regulate the emphasis on the decision boundary and on the inner regions. The method linearly combines both terms into a single minimization problem as follows:
where μ0 is a scalar parameter that regulates how discriminative the solution will be, and δ:×Y[0,∞) is a function that emphasizes the decision boundary.
Once {circumflex over (p)}(X|γ) has been estimated, {circumflex over (p)}(γ|X) is recovered via Bayes rule as
and the classifier g is obtained from eq. (1).
Thus, the method includes a confidence measure {circumflex over (p)}(γ|X) associated to a classifier g, wherein the confidence measure can be selectively constrained to be more discriminative or generative depending on the classification problem at hand. Further, the cost functional in eq. (6) can be computed in analytic form for a certain choice of representation of the various terms. This will enable an efficient training procedure. Detection can be made fast by using a sparse kernel representation for our solution {circumflex over (p)}(X|γ).
Kernel Density Estimation
As described herein, one is typically not given the joint probability p(γ,X) and, hence, the joint probability p(γ,X) needs to be estimated from a set of data samples {Xiγi, }i=1 . . . M. Methods to solve such task can be divided into parametric and nonparametric ones. Parametric methods provide a sparse data representation, but are typically based on assumptions that are too restrictive. Nonparametric methods make use of fewer assumptions, but their representation uses the entire data set, which makes them memory demanding. Nonparametric methods are described herein as a general example of estimating the joint probability p(γ,X). The nonparametric methods obtain a KDE q(X,γ) of p(X,γ) as a sum of kernels k:[0,∞), each of which is located at a data point of the training set. If the kernels depend only on the norm of their argument, then the KDE approximation q(X,γ) of p(X,γ) takes the form
where N is the number of samples such that γn=γ∀n=1 . . . M. A number of different kernels may be used depending on the problem at hand. For example, radial basis functions, i.e., Gaussian kernels, may be used. In particular, consider Gaussian kernels with isotropic covariances Σn=σn2I, where I denotes the identity matrix in the space . Thus, eq. (8) becomes
with mean Xn and variance σn2, which can be written in shorthand notation as
q(X|γ)=ζ(γ)Tk(X) (10)
where ζ(γ)=[ζ(γ)1, . . . , ζ(γ)M]T with
and
k(X)=[k(X;X1,σ12), . . . , k(X;XM,σM2)]T. (12)
The KDE q(X|γ) amounts to choosing the covariances {σn2}n=1 . . . M of the radial basis k(X) (also referred to as the bandwidth). This task can be performed by standard methods in the literature, for example, the “plug-in” method, which estimates the optimal covariance by minimizing the asymptotically mean integrated squared error (AMISE).
Since the task is to estimate the confidence measure p(γ|X), the KDE q(X|γ) can be used directly after applying Bayes rule. Notice that while the confidence measure obtained from the KDE would be highly accurate, it is memory demanding as all the training samples need to be employed in the representation, and it is computationally intensive during prediction. Its performance scales with the number of training samples M. To overcome these disadvantages, eq. (6) may be solved by seeking for a sparse approximation of the KDE which yields the variational sparse kernel machines, as described below.
Variational Sparse Kernel Machines
According to an embodiment of the present disclosure, the method uses an explicit representation of the solution {circumflex over (p)}(X|γ) of eq. (6). This representation serves a number of purposes, including regularizing the estimation problem, making the classifier computationally efficient during prediction, providing a classifier with low memory demands, and forming the basis for the computation of eq. (6) in analytic form.
Writing {circumflex over (p)}(X|γ) as
where
α=[α1, . . . , αS]T (14)
and
h(X;V,σ2)=[h(X;V1,σ12), . . . , h(X;VS,σS2)]T (15)
for S<<M—a compressed approximation of the KDE q(X|γ) may be determined by minimizing eq. (6). Given an explicit representation of {circumflex over (p)}(X|γ), a solution may be identified by parameters α, V and σ2. Furthermore, to guarantee that the estimated function is a valid probability density in X, it is imposed that
αm≧0∀m=1 . . . S (16)
and that
The first constraint (eq. (16)) can be achieved by using an exponential map so that each αm is parameterized in λm as
αm=eλm. (18)
The second constraint (eq. (17)) can be achieved by adding the following term as a soft constraint in eq. (6):
Now, to form the basis for the computation of eq. (6) in analytic form, h is chosen to be composed of Gaussian kernels as done for the KDE q(X|γ), i.e. h=k. Similarly, the same choice of representation is made for the boundary function δ. Now, all the explicit representations are substituted for the various terms in eq. (6), an equation including the following product of Gaussians in X and V is obtained:
k(X;Vm;σm2)k(X;Vj;σj2)k(X;Xi;σi2) (20)
and
k(X;Vm;σm2)k(X;Xj;σj2)k(X;Xi;σi2. (21)
It can be shown that the integral in X of each of such product of Gaussians is again a product of two Gaussians evaluated at some combination of the data points Xi and/or the vectors Vm. For example, eq. (20) yields
An analytic form of the integral (6) can be obtained when the chosen kernels are Gaussians. This enables the determination of analytic forms for the gradients of the cost functional (6) with respect to the unknowns. Further, the minimization may be performed by gradient descent in an efficient way, an example of which is disclosed below. The sparseness of the explicit representation of {circumflex over (p)}(X|γ) and the method used to recover its parameters, provides the motivation to call this solution variational sparse kernel machines.
Estimating Sparse Kernel Machines
The method includes an estimate of the free parameters {circumflex over (α)}, {circumflex over (V)} and {circumflex over (σ)}2 that minimize eq. (6). The variational nature of the formulation is exploited by employing a steepest gradient descent. The gradient ∇Ε of the cost functional Ε is determined as
with respect to each unknown and the following set of equations are evolved
λ(t+1)=λ(t)−ε∇λΕ
σ(t+1)=σ(t)−ε∇σΕ
V(t+1)=V(t)−ε∇VΕ (24)
for the steepest choice of ε. Although this method is guaranteed to minimize the cost functional Ε, since it is a local method, it may converge to a local minimum. Hence, the parameters need to be initialized not too far away from the global minimum. To do so, a clustering algorithm is used that provides an initial estimate of the centers of the Gaussians as well as their spreads σ and the parameters λ.
Enhancing the Decision Boundary
A function δ was introduced above that is devoted to emphasize the decision boundary. This function is represented by a sum of Gaussians. In the case of binary classification, it is immediate to define the decision boundary as the locations X where p(γ=0|X)=p(γ=1|X). One can obtain sample-locations from the decision boundary by considering all possible pairs of data points and by searching for the zero of p(γ=0|X)−p(γ=1|X) along the segment joining a pair.
Experiments
To illustrate a method for classification according to an embodiment of the present disclosure, and to show its suitability to challenging classification problems a number of simulations are shown in the following.
As a first example, consider the “XOR” problem—FIGS. 2A-D and
FIGS. 2B-D and FIGS. 3B-D show an example of the evolution of our method. FIGS. 2B-D and 3B-D, and FIGS. 2F-H and 3F-H correspond to the training after 1, 3, and 5 iterations, respectively. FIGS. 2A-D show the sparse kernel density difference {circumflex over (p)}(X|γ1)−{circumflex over (p)}(X|γ2), FIGS. 3A-D show the feature space partitioning based on the current iteration. In addition, mean vectors Vm, standard deviation σm, and mixing parameters αm of the 14 sparse kernels are illustrated by means of center, radius, and hue value of the displayed circle, respectively.
It can be clearly seen that, given a reasonable initialization, the learning algorithm converges to a close approximation of the KDE classifier. In the experiments, the initialization was taken from hierarchical clustering. It can be concluded from
Referring to
Exemplary implementations of the classifier may perform feature space analysis, and pattern recognition, and more particularly speech recognition, face detection, and object detection.
According to an embodiment of the present disclosure, a system and method for supervised learning for the purpose of classification exploits the advantages of generative and discriminative methods in a variational framework. A sparse representation of a confidence measure is recovered by means of a mixture of radial basis functions. The estimate of the confidence measure is a variational sparse kernel machine. The system and method regularize the estimation problem, make classifier computationally efficient during prediction, wherein the classifier will be not memory demanding, and the computation of eq. (6) can be carried out in analytic form.
It is to be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or a combination thereof. In one embodiment, the present invention may be implemented in software as an application program tangibly embodied on a program storage device. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture.
Referring to
The computer platform 501 also includes an operating system and microinstruction code. The various processes and functions described herein may either be part of the microinstruction code or part of the application program (or a combination thereof), which is executed via the operating system. In addition, various other peripheral devices may be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
Having described embodiments for a system and method for supervised learning for classification that unifies generative and discriminative methods in a variational framework, it is noted that modifications and variations can be made by persons skilled in the art in light of the above teachings. It is therefore to be understood that changes may be made in embodiments of the present disclosure that are within the scope and spirit thereof.
This application claims the benefit of Provisional Application No. 60/738,233 filed on Nov. 18, 2005 in the United States Patent and Trademark Office, the contents of which are herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60738233 | Nov 2005 | US |