The present invention is concerned with learning machines such as Support Vector Machines (SVMs).
The reference to any prior art in this specification is not, and should not, be taken as an acknowledgement or any form of suggestion that the prior art forms part of the common general knowledge.
A decision machine is a universal learning machine that, during a training phase, determines a set of parameters and vectors that can be used to classify unknown data. An example of a decision machine is the Support Vector Machine. A classification Support Vector Machine (SVM) is a universal learning machine that, during a training phase, determines a decision surface or “hyperplane”. The decision hyperplane is determined by a set of support vectors selected from a training population of vectors and by a set of corresponding multipliers. The decision hyperplane is also characterised by a kernel function.
Subsequent to the training phase the classification SVM operates in a testing phase during which it is used to solve a classification problem in order to classify test vectors on the basis of the decision hyperplane previously determined during the training phase.
Support Vector Machines find application in many and varied fields. For example, in an article by S. Lyu and H. Farid entitled “Detecting Hidden Messages using Higher-Order Statistics and Support Vector Machines” (5th International Workshop on Information Hiding, Noordwijkerhout, The Netherlands, 2002) there is a description of the use of an SVM to discriminate between untouched and adulterated digital images.
Alternatively, in a paper by H. Kim and H. Park entitled “Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3d local descriptor” (Proteins: structure, function and genetics, 2004 Feb. 15; 54(3):557-62) SVMs are applied to the problem of predicting high resolution 3D structure in order to study the docking of macro-molecules.
The mathematical basis of a SVM will now be explained. An SVM is a learning machine that given m input vectors x∈, drawn independently from the probability distribution function p(x) with an output value yi, for every input vector xi, returns an estimated output value ƒ(xi)=yi for any vector xi, not in the input set.
The (xi, yi) i=0, . . . m are referred to as the training examples. The resulting function ƒ(x) determines the hyperplane which is then used to estimate unknown mappings. Each of the training population of vectors is comprised of elements or “features” of a feature space associated with the classification problem.
With some manipulations of the governing equations the support vector machine can be phrased as the following Quadratic Programming problem:
The K(xi,xi) is the kernel function and can be viewed as a generalised inner product of two vectors. The result of training the SVM is the determination of the multipliers αi.
Suppose we train a SVM classifier with pattern vectors xi, and that r of these vectors are determined to be support vectors, Denote them by xi, i=1, 2, . . . , r. The decision hyperplane for pattern classification then takes the form
where αi is the Lagrange multiplier associated with pattern xi and K(.,.) is a kernel function that implicitly maps the pattern vectors into a suitable feature space. The b can be determined independently of the αi.
Given equation (7) an un-classified sample vector x may be classified by calculating ƒ(x) and then returning −1 for all returned values less than zero and 1 for all values greater than zero.
As previously mentioned, each of the training population of vectors is comprised of elements or “features” that correspond to features of a feature space associated with the classification problem. The training set may include hundreds of thousands of features. Consequently, compilation of a training set is often time consuming and may be labour intensive. For example, to produce a training set to assist in determining whether or not a subject may be likely to develop a particular medical condition may involve having thousands of people in a particular demographic fill out a questionnaire containing tens or even hundreds of questions. Similarly to generate a training set for use in classifying email messages as likely to be spam or not-spam typically involves the processing of thousands of email messages.
It will be realised that given that there is often a considerable overhead involved in compiling a training set it would be advantageous to enhance the extraction of information associated with the training set.
It is an object of the invention to provide a method that enhances the extraction of information associated with a training set for a decision machine.
Where the feature space from which the training vectors are derived exceeds the true dimensionality associated with the classification problem to be addressed, then a number of sets of training vectors might be derived. The present inventor has conceived of a method for enhancing information extraction from a training set that involves forming a plurality of mutually orthogonal training sets. As a result the classifications made by each decision machine are totally independent of each other so that the chance of correct classification after multiple machines is maximized.
According to a first aspect of the present invention there is provided a method of operating at least one computational device to enhance extraction of information associated with a first training set of vectors, the method including operating said computational device to perform the step of:
(a) forming a plurality of mutually orthogonal training sets from said first training set.
The method will preferably include the step of:
(b) training each of a plurality of decision machines with a corresponding one of the plurality of mutually orthogonal training sets.
The method may also include the step of:
(c) extracting information about one or more test vectors with reference to the plurality of decision machines.
In a preferred embodiment the plurality of decision machines comprises a plurality of support vector machines.
Step (a) will usually include:
(i) centering and normalizing the first training set.
In the preferred embodiment step (a) includes:
(ii) iteratively solving a minimization problem with respect to a floating vector and with reference to the first training set to thereby determine a feature selection vector;
wherein iterations of the floating vector are derived from previous iterations of the feature selection vector so that an iteration of the floating vector and a previous iteration of the feature selection vector are orthogonal.
The minimization problem will preferably comprise a least squares problem.
Step (a) may further include:
(iii) setting elements of the features selection vector to zero in the event that they fall below a threshold value.
The method will preferably also include:
(iv) setting elements of a next iteration of the floating vector to zero in the event that they correspond to above-threshold elements of a current iteration of the feature selection vector.
Preferably the method includes:
(v) applying iterations of the feature selection vector to the first training set to thereby form the plurality of mutually orthogonal training sets.
Step (a) may also include:
flagging termination of the method in the event that at least a predetermined number of elements of the feature selection vector are less than a predetermined tolerance.
The method may further include:
programming at least one computational device with computer executable instructions corresponding to step (a) and storing the computer-executable instructions on a computer readable media.
According to a further aspect of the invention there is provided a method of operating at least one computational device to enhance extraction of information associated with a first training set of vectors, the method including operating said computational device to perform the step of:
(a) forming a plurality of mutually orthogonal training sets from said first training set;
(b) training each of a plurality of classification support vector machines with a corresponding one of the plurality of mutually orthogonal training sets; and
(c) classifying one or more test vectors with reference to the plurality of classification support vector machines.
In another aspect of the present invention there is provided a computer software product in the form of a media bearing instructions for execution by one or more processors, including instructions to implement the above described method.
According to a further aspect of the present invention there is provided a computational device programmed to perform the method. The computational device may for example be any one of the following.
Further preferred features of the present invention will be described in the following detailed description of an exemplary embodiment wherein reference will be made to a number of figures as follows.
Preferred features, embodiments and variations of the invention may be discerned from the following Detailed Description which provides sufficient information for those skilled in the art to perform the invention. The Detailed Description is not to be regarded as limiting the scope of the preceding Summary of the Invention in any way. The Detailed Description will make reference to a number of drawings as follows:
The present inventor has realised that a method for feature selection in the case of non-linear learning systems may be developed out of a least-squares approach. The minimization problem of equations (1-3) is equivalent to
where the (ij) entry in K is K(xi, xj), α is the vector of Lagrange multipliers and e is a vector of ones. The constraint equations (4-6) will also apply to (8). The notation outside the norm symbol indicates that it is the square of the 2-norm that is to be taken. The theory for a linear kernel where K(xi, xj)=xiT·xj is a simple inner product of two vectors will now be developed. Writing the input vectors as a matrix: X=[xl, . . . , xk] it follows that e=XTb for some floating vector b. The problem set out above in (8) can then be rewritten as:
This is the normal equation formulation for the solution of
so that (9) and (10) are equivalent. The first step in the solution of (10) is to solve the underdetermined least squares problem that will have multiple solutions
any solution is sufficient. However the desired and feasible solution is
where P is an appropriate pivot matrix and b2=0. The size of b2 is determined by the rank of the matrix X, or the number of independent columns of X. Any method that gives a minimum 2-norm solution and meets the constraints of the SVM problem may be used to solve (12). It is in the solution of (11) that an opportunity for natural selection of the features arises since only the nonzero elements contribute to the solution. For example, suppose that the solution of (11) is bmin and that the non-zero elements of bmin=[b1, . . . , bn]T are b100, b1, b191, b202, b323, b344, etc. In that case only features xi,100, xi,1, xi,191, xi,202, xi,323, xi,344 etc. are used in the matrix X. The other features that make up X can be safely ignored without changing the performance of the SVM. Consequently, bmin may be referred to as a “feature selection vector”.
Numerically the difference between a zero element and a small element less than a predetermined minimum threshold value is negligible. For a computer implementation, all those elements less than the threshold can be disregarded without reducing the accuracy of the solution to the minimization problem set out in equation (8), and equivalently equation (9).
A second motivation for this approach is the fact that equation (9) contains inner products that can be used to accommodate the mapping of data vectors into feature space by means of kernel functions. In this case the X matrix becomes [Φ(x1), . . . , Φ(xn)] so that the inner product XTX in (9) gives us the kernel matrix. The problem can therefore be expressed as in (8) with e=Φ(x)·Φ(b). To find b we must then solve the optimisation problem
where Φ(x)·Φ(b) is computed as K(xi, b).
Thus the method can be readily extended to kernel feature space in order to provide a direct method for feature selection in non-linear learning systems. A flowchart of a method incorporating the above approach is depicted in
At box 48 a classification for the test vector is calculated. The test result is then presented at box 50.
In the Support Vector Regression problem, the set of training examples is given by (x1, y1), (x2, y2), . . . ,(xm, ym), xi ∈; where yi may be either a real or binary value. In the case of yi ∈{±1}, then either the Support Vector Classification Machine or the Support Vector Regression Machine may be applied to the data. The goal of the regression machine is to construct a hyperplane that lies as “close” to as many of the data points as possible. With some mathematics the following quadratic programming problem can be constructed that is similar to that of the classification problems and can be solved in the same way.
Minimise ½λTDλ−λT (14)
subject to
λTg=0
0≦λi≦C
where
This optimisation can also be expressed as a least squares problems and the same method for reducing the number of features can be used.
Where the feature space from which the training vectors are derived exceeds the true dimensionality associated with the classification problem to be addressed, then a number of sets of support vectors might be derived. Consequently a number of different decision machines, such as support vector machines (SVMs) can be constructed each defining a different decision hyperplane.
For example, if SVM1 has a decision surface ƒ1(x) and SVM2 has a decision surface ƒ2(x) then the classification of a test vector might be made by using ƒs(x)=ƒ1(x)+ƒ2(x). More generally, a decision surface ƒs(x) can be derived from SVMs SVM1, . . . , SVMn defining respective decision hyperplanes ƒ1(x), . . . , ƒn(x) as ƒs(x)=β1*ƒ1(x)+β2*ƒ2(x)+, . . . , +βn*ƒn(x) where the β are scaling constants. Alternatively, confidence intervals associated with the classification capability of each of the SVM1, . . . , SVMn might be calculated and the best estimating SVM used.
A problem arises however in that it is not apparent how the sets of test vectors that are used to train each of the SVMs might be selected in order to improve the classification performance of the composite decision surface ƒs(x).
As previously mentioned, the present inventor has realised that it is advantageous for the SVM training data sets to be orthogonal to each other. By “orthogonal” it is meant that the features composing the vectors which make up the training set used for classification in one SVM are not evident or used in the second and successive machines. As a result the classifications made by each SVM are totally independent of each other so that the chance of correct classification after multiple machines is maximized. Mathematically
[Xn]TXm≠n=[0] (15)
where Xn and Xm are training data sets, in the form of matrices, derived from a large training data set and [0] is a matrix of zeroes. That is, the training sets that are derived are mutually orthogonal.
At box 102 of
At box 105 the feature selection method that was previously described is applied to calculate:
At box 107 each of the elements of bminn are compared to a predetermined tolerance, for example the maximum element of bminn i.e. max(bminn) multiplied by an arbitrary scaling factor “tol”. Here tol is a relatively small number. If it is the case that at least P (where P is an appropriate integer value) of the elements of bminn are less than tol then the procedure progresses to box 110 where the Boolean variable “Continue” is set to True. Alternatively, if less than P of the elements of bminn are less than or equal to tol then the procedure proceeds to box 108 where Continue is set to False. In either event, the procedure then progresses to box 109.
At box 109 the significant elements of bminn are determined by comparing each element to a threshold being tol multiplied by the largest element of bminn. The below-threshold elements of bminn are set to zero. Elements of a new floating vector, bn+1 corresponding to the above-threshold elements of bminn are also set to zero. The inner product of bn+1 and bminn will then be zero indicating that they are orthogonal vectors.
At box 115 a sub-matrix of training vectors Xn is produced by applying a “reduce” operation to X. The reduce operation involves copying the elements of X to Xn and then setting to zero all the xj,i elements of Xn corresponding to elements of bn that equal zero. This operation effectively removes rows from the Xn sub-matrix. Alternatively, in another embodiment rather than setting to zero all the xj,i elements of Xn corresponding to elements of bn that equal zero the xj,i elements of Xn are instead removed so that the rank of the matrix Xn is less than that of X.
At box 117 a support vector machine is trained with the Xn training set to produce an SVM that defines the first hyperplane ƒn=1(x).
The procedure then progresses to decision box 118. If the Continue variable was previously set to true at box 110 then the procedure progresses to box 119. Alternatively, if the Continue box was previously set to False at box 108 then the procedure terminates.
At box 119 the counter variable n is incremented, and the procedure then proceeds through a further iteration from box 105. So long as at least P elements of bminn are greater than threshold, i.e. tol*max(bminn), at box 107, the method will continue to iterate. With each iteration a new SVM is trained from a subset training set matrix Xn, which is orthogonal to the previously generated training sets, to determine a new hyperplane ƒn(x).
Since the features selected from X in each iteration of the procedure are always different, the SVM models will, due to the constraint in box 105 of
Apart from comprising a personal computer, as described above with reference to
The embodiments of the invention described herein are provided for purposes of explaining the principles thereof, and are not to be considered as limiting or restricting the invention since many modifications may be made by the exercise of skill in the art without departing from the scope of the invention as defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
2004907341 | Dec 2004 | AU | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/AU05/01962 | 12/23/2005 | WO | 00 | 6/29/2007 |