This invention relates to learning machines and, more particularly, to supervised learning systems and methods using hidden information.
In the classical supervised machine learning paradigm, training examples are represented by vectors of attributes, a teacher supplies labels for each training example, and a learning machine learns a decision rule using this data.
In actuality, however, the teacher can supply training data with some additional information which will not be available at the test stage. Consider, for example, an algorithm that learns a decision rule for prognosis of a disease in a year, given the current symptoms of a patient. In this example, additional information about symptoms in six months can be provided along with the training data that contains current symptoms and outcome in a year. This additional information about symptoms in six months may be helpful for predicting the outcome of the disease in a year.
Accordingly, a machine learning method that uses hidden information is needed.
A method is disclosed herein for use in describing a phenomenon of interest. The method comprises the steps of: providing training data relating to the phenomenon of interest and labels for labeling the training data; providing hidden information about the training data or directed distances obtained from the hidden information; and computing a decision rule for use in describing the phenomenon of interest using the training data, the labels, and the hidden information or directed distances.
A machine learning system is further disclosed herein for use in describing a phenomenon of interest. The machine learning system comprises: a first input module for providing training data relating to the phenomenon of interest and labels for labeling the training data; a second input module for providing hidden information about the training data or directed distances obtained from the hidden information; and a processor executing a first set of instructions for computing a decision rule for use in describing the phenomenon of interest using the training data, the labels for labeling the training data, and the hidden information about the training data or the directed distances.
Also disclosed herein is a method for training a learning machine. The method comprises the steps of: providing training data relating to the phenomenon of interest and labels for labeling the training data; providing hidden information about the training data or directed distances obtained from the hidden information; and computing a decision rule for use in describing a phenomenon of interest using the training data, the labels, and the hidden information or directed distances.
A supervised machine learning method and system is disclosed herein that uses training data, labels for the training data, and hidden information about the training data, to generate an improved decision rule. The hidden information about the training data may belong to a space which is different from the training data space. In addition, the hidden information does not lead to additional training data, but instead, contains additional description about the training data.
The machine learning method and system of the present disclosure, according to one embodiment, uses a support vector machine (SVM) type of algorithm. Other embodiments of the method and system may use other types of supervised learning algorithms. As is well known in the art, the SVM algorithm is a supervised learning algorithm which learns a decision rule, y=f(x), from a set of functions given training data.
(X,Y)={(xi,yi)}i=1l, yiε{−1,1}.
To construct a decision rule y=f(x), the SVM algorithm maps vectors xεX to zεZ and finds a hyperplane that separates the images zi's of training vectors xi's in two classes with a minimal number of errors. Among the many possible hyperplanes, the SVM algorithm selects the optimal one that does this separation with maximum margin. The hyperplane is specified by a weight vector w and a threshold b which are found by solving the quadratic optimization problem (1):
where C is fixed and ξi's are the slack variables. The SVM algorithm finds the optimal hyperplane by solving the optimization problem (1) in the dual space. The SVM algorithm does not calculate the inner product in Z space. Instead, it uses the “kernel trick.” According to Mercer's theorem, for every inner product in Z space, there exists a positive definite function K(xi,xj), (kernel function) such that zi·zj=K(xi,xj), for all i, j=1, . . . , l. So, only the kernel function needs to be specified for learning a nonlinear decision rule. The decision rule for the SVM algorithm has a form:
where the coefficients αi are obtained by maximizing the functional
With increasing training data, the SVM solution converges to the Bayesian solution.
The machine learning method of the present disclosure also uses a support vector machine plus (SVM+) algorithm for learning hidden information by modeling slacks using hidden information. At the training stage using the SVM+ algorithm, the following triplets are provided:
(X,X*,Y)={(xi,xi*,yi)}i=1l, yiε{−1,1}.
where X* is additional information from the data space X*, which is generally different from space X. The goal is to use the additional information X* to find a decision rule y=f(x) (in the space X), which is better than the decision rule obtained without using the additional information. The space X in which the decision rule is constructed may be called the decision space, and the space X* may be called the correction space. Compared to conventional supervised learning methods where the teacher only provides labels for training vectors, the teacher in the SVM+ based learning method also supplies additional descriptions as hidden information for the training data e.g., vectors.
The SVM+ algorithm is a generalization of the SVM algorithm. It allows model relationships between the slack variables, ξi in problem (1) above, in the SVM algorithm, i.e.,
ξi=ψ(xi*, δ), δεD
where ψ=(xi*, δ) belongs to some set of admissible functions in X*, called the correcting functions. This is a generalization of the SVM algorithm because X*=X and ψ(xi, δ) is the set of all possible functions in the SVM algorithm. Because slacks are no longer variables in the optimization problem in the SVM+ algorithm, the SVM+ algorithm can depend on less parameters than the SVM algorithm, consequently, the decision rule found by the SVM+ algorithm is selected from a set with a smaller capacity than the SVM algorithm, which can lead to a better generalization. Similar to the mapping of vectors xi εX to the space ziεZ in the decision space, the vectors xi*εX* in correction space are mapped to zi* εZ*. To accomplish this, two different kernels, the decision kernel (represented by K (,)) and the correction kernel (represented by K*(,)) are used. The correcting function has the form
ψ(xi*,δ)=w*zi*+d; w*εZ*, dεR (4)
Using this mapping, the slacks can be written as ξi=w*zi*+d. This leads to the following SVM+ problem formulation:
where γ and C are parameters. The dual optimization problem corresponding to equation (5) using the kernel trick is:
The SVM+ decision rule has the same form as the SVM decision rule (2). It differs in the way it determines the coefficients αis. The coefficients βi's appear only in the correcting function which is given by
The quadratic program (6) related to the SVM+ algorithm is different from the conventional SVM algorithm but can be solved using a known generalized sequential minimal optimization procedure for the SVM+ algorithm.
The hidden information input module 12 is used at the training stage for allowing hidden information X* to be inputted into the system by a teacher. The hidden information input module 12 is not used at the test stage because the decision rule used at the test stage depends only on the information in training data X. The information in hidden information X* is used only to estimate the coefficients αi's in the decision rule. The hidden information X* may be some specific and/or additional information or description about or derived from the training data. Alternatively, the hidden information X* may be the same data as the training data X. The hidden information produces a better decision rule in the space X of conventional training data.
The hidden information may used in two different modes in the training stage of the system of
The training stage of the system further includes an SVM+ module 18 or other suitable classifier module, which receives inputs from the training data module 10, data labeling module 11, and the first or second set modules 14 and 17, to learn a decision rule y=f(x) for classifying the labeled training (X, Y) using SVM+ algorithm parameters and the inputted training data X, directed distances or hidden information D, and labeling data Y. A decision rule output module 19 outputs the decision rule y=f(X) computed by SVM+ module 18.
Referring still to
From the grid of SVM parameters, pairs of the parameters are selected 204 and used along with labeling data Y 200 and the hidden information X* 202 to obtain an SVM cross-validation error 206 for initially classifying the labeling data Y and the hidden information X* in the space of hidden information. A determination 208 is made as to whether the selected pair of SVM parameters provide a minimum cross-validation error. If the selected pair of SVM parameters do not provide a minimum cross-validation error, another pair of SVM parameters are selected 204 and used along with labeling data Y 200 and the hidden information X* 202 to perform another SVM cross-validation error determination 208. If the selected pair of SVM parameters do provide a minimum cross-validation error, this pair of selected parameters are used to compute the SVM algorithm to learn the decision rule y=F(x*) 210 as described above in equation 3 to provide a final classification rule. Then, directed distances di are calculated 212 from the decision rule y=F(x*) in the space of hidden information.
To illustrate the operation of the method of
Second, for each training vector xi*, an estimate of directed distance is computed to the decision rule {circumflex over (f)}(x*) as
Finally, SVM+ algorithm is used with the input (X, dα, Y).
Also consider an example where we are given images of digits (X), their labels (Y) and poetic description about those digits (X*). In the proposed learning system, first poetic descriptions are used to estimate the directed distances, which are then used as correction space in an SVM+ algorithm. In accordance with the method of
Next, the values directed distances are found,
where xi* is the poetic description for the training vector xi.
Finally, a decision rule is obtained by using an SVM+ algorithm with input (X, dP, Y)
From the pre-specified grid of SVM+ parameters, groups of the parameters are selected 304 and used in the SVM+ algorithm along with training data X 300, labeling data Y 302 and directed distances or hidden information D 308 to obtain a decision rule that will be used further on for classifying X,Y. More specifically, for each selected group of parameters the SVM+ algorithm is solved as described above in equations 5 and 6, to obtain the decision rule. Equation 5 is the original formulation of SVM+ and uses hidden information to model slacks in the training data. To find the decision rule, however, it is easier to solve an equivalent optimization problem in equation 6, which is the dual of the problem in equation 5. The SVM+ optimization problem given by equations 5 and 6 is a simplified form of the conventional SVM+ algorithm. The simplified SVM+ optimization disclosed herein does not require grouping information for the input data, as all the input data is considered in one group. By considering one group for all the input data, the number of parameters in the optimization problem is reduced.
A determination 310 is made as to whether the selected group of SVM+ parameters provide a minimum error rate on a validation set. If the selected group of SVM+ parameters does not provide a minimum error rate, another group of SVM+ parameters are selected 304 and used along with training data X 300, labeling data Y 302 and directed distances or hidden information D 308, to perform another SVM+ minimum error rate determination 310. If the selected group of SVM+ parameters does provide a minimum error rate on the validation set, this group of selected parameters are used to compute the SVM+ algorithm to learn the decision rule y=f(X) 312.
The system and methods disclosed herein may be readily adapted and utilized in a wide array of applications to classify and/or predict data in multidimensional space, the data corresponding to a phenomenon of interest, e.g., images of objects and scenes obtained by cameras and other sensors, text, a voice, stock prices, etc. More specifically, the applications include, for example and without limitation, general pattern recognition (including image recognition, object detection, and speech and handwriting recognition), regression analysis and predictive modeling (including quality control systems and recommendation systems), data classification (including text and image classification and categorization), bioinformatics (including protein classification, automated diagnosis systems, biological modeling, and bio-imaging classification), data mining (including financial forecasting, database marketing), etc.
One skilled in the art will recognize that the computer system disclosed herein may comprise, without limitation, a mainframe computer system, a workstation, a personal computer system, a personal digital assistant (PDA), or other device, apparatus, and/or system having at least one processor that executes instructions which perform the methods disclosed herein. The instructions may be stored in a memory medium.
The computer system may further include a display device or monitor for displaying operations associated with the methods described herein and one or more memory mediums on which one or more computer programs or software components may be stored. For example, one or more software programs which are executable to perform the methods described herein may be stored in the memory medium. The one or more memory mediums may include, without limitation, CD-ROMs, floppy disks, tape devices, random access memories such as but not limited to DRAM, SRAM, EDO RAM, and Rambus RAM, non-volatile memories such as, but not limited hard drives and optical storage devices, and combinations thereof. In addition, the memory medium may be entirely or partially located in one or more associated computers or computer systems which connect to the computer system over a network, such as the Internet.
The methods described herein may also be executed in hardware, a combination of software and hardware, or in other suitable executable implementations. The methods implemented in software may be executed by the processor of the computer system or the processor or processors of the one or more associated computers or computer systems connected to the computer system.
While exemplary drawings and specific embodiments of the present invention have been described and illustrated, it is to be understood that that the scope of the present invention is not to be limited to the particular embodiments discussed. Thus, the embodiments shall be regarded as illustrative rather than restrictive, and it should be understood that variations may be made in those embodiments by workers skilled in the arts without departing from the scope of the present invention as set forth in the claims that follow and their structural and functional equivalents.
This application claims the benefit of U.S. Provisional Application No. 61/026,868, filed Feb. 7, 2008. The disclosure of U.S. Provisional Application No. 61/026,868 is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
5794190 | Linggard et al. | Aug 1998 | A |
6134344 | Burges | Oct 2000 | A |
20020016798 | Sakai et al. | Feb 2002 | A1 |
20020090631 | Gough et al. | Jul 2002 | A1 |
20040034612 | Mathewson et al. | Feb 2004 | A1 |
20040068199 | Echauz et al. | Apr 2004 | A1 |
20050197980 | Dundar et al. | Sep 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20090204555 A1 | Aug 2009 | US |
Number | Date | Country | |
---|---|---|---|
61026868 | Feb 2008 | US |