This invention concerns data mining, that is the extraction of information, from “unlearnable” data sets. In a first aspect it concerns apparatus for such data mining, and in a further aspect it concerns a method for such data mining.
Learnable data sets are defined to be those from which information can be extracted using a conventional learning device such as support vector machines, decision trees, a regression, an artificial neural network, evolutionary algorithm, k-nearest neighbor or clustering methods.
To extract information from a data set, first a training sample is taken and a learning device is trained on the training sample using a supervised learning algorithm. Once trained the learning device, now called a predictor, can be used to process other samples of the data set, or the entire set.
Composite learning devices consist of several of the devices listed above together with a mixing stage that combines the outputs of the devices into a single output, for instance by a majority vote.
Data sets that cannot be successfully tined by such conventional means are termed “unlearnable”. The inventors have identified a class of “unlearnable” data which can be mined using a new technique, this class of data is termed “Anti-Learnable” data.
The invention is apparatus for data mining unlearnable data sets, comprising:
a learning device trained using a supervised learning algorithm to predict labels for each item of a training sample; and,
a reverser to apply negative weighting to labels predicted for other data from the data set using the learning device, if necessary.
This apparatus is able to data mine a class of unlearnable data, the anti-learnable data sets.
The apparatus may further comprise:
a further learning device trained using a further supervised learning algorithm to predict labels for each item of a further training sample; and,
a reverser to apply negative weighting to labels predicted for other data from the data set using at least one learning device.
Where there is more than one learning device the training samples may be distinct from each other.
The apparatus may be embodied in a neural network, or other statistical machine learning algorithm.
At least one of the learning devices may use the k-nearest neighbour method or be a support vector machine, or other statistical machine learning algorithm.
The reverser may operate automatically. The reverser may be implemented as a direct majority voting method or developed from the data using a supervised machine learning technique such as a perceptron or a state vector machine (SVM).
In a further aspect the invention is a method for extracting information from unlearnable data sets, the method comprising the steps of:
creating a finite training sample from the data set;
training a learning device using a supervised learning algorithm to predict labels for each item of the training sample;
processing other data from the data set to predict labels and determining whether the other data is learnable (predicted labels are better than random guessing) or anti-learnable (predicted labels are worse than random guessing); and,
applying negative weighting to the predicted labels if the other data is anti-learnable.
The effect of this method is to identify whether data is learnable or anti-learnable. A learning index may be calculated to determine the learnability type, and the type may be output from the calculation.
The method may comprise the Her steps of:
training a further learning device using a further supervised learning algorithm to predict labels for each item of a limber training sample;
processing other data from the data set to predict labels and determining whether the predicted labels of the first and further learning devices are learnable or anti-learnable; and,
applying negative weighting to the predicted labels of a learning device if the data is anti-learnable.
The method may comprise the step of training a reverser to apply the negative weighting automatically.
The method may include the further step of transforming anti-learnable data into learnable data for conventional processing. The transformation may employ a non-monotonic kernel transformation. This transformation may increase within-class similarities and decrease between class similarities.
The method may comprise the additional step of using a learning device to idler process the weighted data.
The method may be enhanced by reducing the size of the training samples, or by selecting a “less informative” representation. (features) of the data, which increases the performance of the predictors below the level of random guessing. Mercer kernels may be used for this purpose.
The method may be embodied in software.
A number of examples of the invention will now be described with reference to the accompanying drawings, in which:
FIGS. 14(a) and (b) are graphs of testing results for random 34% Hardamard data with different predictors.
Referring to
We know that each member of the population will be either male or female. We can choose to apply a label Y to each item of the population data to indicate the sex. For instance Y may be +1 for a male or −1 for female. Y is a 1-dimensional space of labels. There is a probability that each member of the population will either be male or female, and a statistical probability distribution can be constructed for the population.
If we were to mine the data to apply the Y sex determining label to each member of the population, the steps would be as follows:
First a training sample of the data would be taken and a learning device trained on the training sample using a supervised leaning algorithm. Typically one type of pattern, or put another way one feature space, is selected for training. Once trained the learning device should model the dependence of labels on patterns in the form of a deterministic relation, a ƒ:X→R, where for each member of the training sample there is a probability of 1 that they are either male or female. The function f is a predictor and the trained learning device is now called a “predictor”.
The trained leaning device can now be used to process other samples of the data set or the entire set. When this is done, if the data set is a learning data set we expect to see a result similar to the plot show at 24. This is less perfect than the training result because the predictor does not operate perfectly.
When the data set is anti-learnable the result is less than random as shown in plot 26. Anti-learning is therefore a property a dataset exhibits when mined with a learning device trained in a particular way.
Anti-learning manifests itself in both natural and synthetic date. It is present in some practically important cases of machine learning from very low training sample sizes.
We have mentioned already AROC as a metric measuring performance of a predictor. However other metrics are applicable here as well. For a purpose of an illustration we shall introduce the accuracy.
The AROC can be computed via an evaluation of conditional probabilities [Bamber, 1975]:
Here we assume that z=(x,y)εZ′ and z′=(x′,y′)εZ′ are drawn from test distribution Pz′, i.e. the frequency count measure on a finite test set Z′⊂D. Clearly AROC[ƒ,Z′]=1 indicates perfect classification by the rule x|→ƒ(x)+b for a suitable threshold bεR; and the expectation AROC for a classifier randomly allocating the labels is 0.5.
Another measure is the accuracy, which is a class-calibrated version of a complement of the test error, ignoring skewed conditional class probabilities. We define it as
where we assume that z=(x,y)εZ′ are drawn from test distribution Pz′. The expected value for a random classifier is ACC[ƒ,Z′]=½ and perfect classification corresponds to ACC[ƒ,Z′]=1.
There are a number of steps in extracting information from unlearnable data sets, some of which may not always be required. The following description will address both essential and nonessential steps in the order in which they occur.
In a typical data ruining task the selection of the suitable domain of patterns X is part of the data mining task. Referring again to
The goal of supervised learning is to select a predictor ƒ:X0→R mapping the measurement space 12 into real number. Such a selection is done on a basis of a finite training sample Z=((x1,y1), . . . ,(xm,ym))εD⊂R×{±1} of examples with known labels. This is achieved using a supervised learning algorithm, Alg, in a training process. The training outputs a function ƒ=Alg(Z,param) which as a rule predicts labels of the training data set better ten random guessing μ(ƒ,Z)>0.5, typically almost perfectly μ(ƒ,Z)≈1.0, where με{AROC, ACC} is a pre-selected performance measure.
The desire is to achieve a good prediction of labels on an independent test set Z′⊂D\Z not seen in training.
We say that the predictor f is learning (L-predictor) with respect to training on Z and testing on Z′ if μ(ƒ,Z)>0.5 and μ(ƒ,Z′)>0.5.
We say that the predictor f is anti-learning (AL-predictor) with respect to the training-testing pair (Z,Z′) if μ(ƒ,Z)>0.5 and μ(ƒ,Z′)<0.5.
We say that data set D is learnable (L-dataset) by algorithm Alg(.,param) if ƒ=Alg(Z,param) is an L-predictor for every training test pair (Z,Z′) Z⊂D and Z′⊂D\Z, after exclusion of obvious pathological cases. Analogously we define the anti-learnable data set, AL-data set.
Taking into consideration various feature representations Φ: X0→Xj these concepts are extended to the kernel case. It is assumed that the predictor ƒ=Alg(Z,k,param) depends also on a kernel k, and has the following data expansion form:
for xεX0, where αi,bεR are learnable parameters. For a range of popular algorithms such as support vector machines we have an additional assumption that αi≧0 for all i and we write ƒεCONE(k,Z) in such a case. We say that the k is an AL-kernel on D, if the k-kernel machine f defined as above is an AL-predictor for every training set Z⊂D . Analogously, we define the L-kernel on D. Equivalently we can talk about learnable (L-) and anti-learnable (AL-) feature representations, respectively.
Note that equivalently these concepts can be introduced by considering the feature space representation Φ(X0)⊂Xj and the class of kernel machines with the linear kernel on Xj.
Determination of whether data is of learning or anti-learning type is done empirically most of the time, depending on the learning algorithm and selection of learning parameters. However, in some cases the link can be made directly to the kernel matrix [Kij]. An example here is the cases of perfect anti-learning and the mirror concept of perfect learning, that is μ(ƒ,Z)=1 in training and μ(ƒ,Z′)=0 in an independent test and the μ(ƒ,Z)=μ(ƒ,Z′)=1 in both the training and an independent test, respectively, for every ƒεCONE(k,Z) and Z′⊂D\Z.
The following theorum is presented to assist in the determination:
Theorem 1 The following conditions for the Perfect Antilearning (PAL) are equivalent:
Moreover, the following conditions for Perfect Learning (PL) are equivalent
Corollary 3 PAL or PL, respectively, is equivalent to any of the following two conditions holding for V=0 or V=1, respectively, for every fεCONE[k, Z]:
1. AROC [f.Z′]=V for every Z′⊂Z\Z containing examples of both classes.
2. There exists some bεR such that Acc [f+b,Z′]=V for every Z′⊂Z
The following algorithm is illustrated in
Given:
For l=1:n repeat steps 1-3:
Output: the learning index
and the data/algorithm learnability type
The outputs of all the predictors 32 are received at the reverser 34. If a predictor is AL, then its output will be negatively weighted by reverser 34 in the process of the final decision making. This is a different process to the classical algorithms using ensemble methods, such as boosting or bagging.
The following Single Sensor-Reverser Algorithm is used when there is a single predictor 32, and is illustrated in
Given:
Generate:
The main limitation of this algorithm is that it misclassifies the training set if data is anti-learnable, i.e. gives μ=μ(ƒ,Z)≦μ0. The following algorithms are designed to overcome this limitation.
The following Multi-Sensor with Sign Reverser algorithm is used when there are more than one predictors.
Given:
For i=1:nsens repeat steps 1-4:
For i1:nsens create sensors by repeating steps 1-2:
A transformed kernel matrix [Kijλ]:=[λδij−Kij]1≦1,j≦m, where λ is the maximal eigenvalue of the symmetric matrix [Kij]1≦1,j≦m and δij is the Kronecker delta symbol;
To understand the use of Mercer kernels in more detail, for simplicity let us consider a feature mapping Φ1:X0→X1. The Mercer kernel for this mapping is a symmetric function k1:X0×X0→R such that k1(x,x′)=<Φ1(x)|Φ1(x′)> for every x,x′εX0 and the following symmetric matrix [k1(x1,x1)]1≦1,j≦1 is positive definite for every finite selection of points x1, . . . , xi εX0
Now, for simplicity let us consider a finite subset of measurements space Z=((x1,y1), . . . , (xm,ym))εD⊂X0×{±1}. It is convenient to introduce special notation for the symmetric matrix [Kij(1)]=[k1(x1,xj)]1≦1,j≦m, so called the kernel matrix, representing the kernel k1 on the data of interest. The kernel matrix determines the feature mapping Φ1|(x
These two properties allow to concentrate on kernel, although conceptually, we investigate the properties of various feature representations.
The examples of popular practical kernels include the linear kernel klin(x,x′)=(<x|x′>, the polynomial kernels kd(x,x′)=(<x|x′>+1)d of an integer degree d=2,3, . . . . and radial basis kernel (RBF-kernel), k(x,x′)=exp(−∥x−x′∥2/σ2), where the parameter σ≠0.
Although the invention has been described with reference to particular examples it should be appreciated that it may be applied in many other situations and in more complex ways. For instance, although we have described binary labels, Y={±1}, the more general case of multi-category classification can be reduced to a series of binary classification tasks, thus our considerations extend to that situation as well. However, the case of regression another practically important category of machine learning tasks, which involves non-discrete labels, is beyond the scope of this paper.
In this section we present examples of anti-learning data.
Elevated XOR
Elevated XOR a perfect anti-learnable data set in 3-dimensions which encapsulates the main features of anti-learning phenomenon, see
Response to Chemotherapy for the Oesophogeal Cancer
This is a natural data set, composed of microarray profiles of esophageal cancer tissue. The data has been collected for the purpose of developing a molecular test for prediction of patient response to chemotherapy at Peter MacCallum Cancer Centre in Melbourne [Duong at al., 2004]. Currently there is no test for such a prediction, and resolution of this issue is of critical importance for oesophogeal cancer treatment. Each biopsy sample in the collection has been profiled for expression of 10,500 genes, see
The labels has been used in classification experiments reported in
Modeling Aryl Hydrocarbon Pathway in Yeast
This data consists of the combined training and test data sets used for task 2 of KDD Cup 2002 [Craven, 2002; Kowalczyk Raskutti, 2002]. The data set is based on experiments at McArdle Laboratory for Cancer Research, University of Wisconsin aimed at identification of yeast genes that, when knocked out cause a significant change in the level of activity of the Aryl Hydrocarbon Receptor (AHR) signalling pathway. Each of the 4507 instances in the data set is represented by a sparse vector of 18330 feawtres. Following the KDD Cup '02 setup terminology we experiment here with the either-task, discrimination 127 instances of pooled “change” and “control” class (labeled yi=+1) and the rest, i.e. “nc” (4380 labeled yi=−1). This data is heavily biased, with the proportions between the positive and negative lables, m+:m.≈3%:97%. Hence we have implemented re-balancing via class dependent regularisation constants in the SVM training:
for y±1 and C>0. For instance, B=0 facilitates the case of “balanced proportions”, C+1:C−1=m−1:m+1, while B=+1 or B=−1 facilitates single class leaning, from the “positive” (+1SVM) or “negative” (−1SVM) class examples only, respectively.
In
The curves show averages of 30 trials. In experiments we used one and two class SVM and simple centroid classifier. All plots but one are for the linear kernel (subscript d=1). The curve SVMd=2 is for the second order polynomial kernel of degree 2; plots for other degrees, d=3,4 were very close to this one (data not shown).
Tissue Growth Model
This a synthetic data set, an abstract model of un-controlled tissue growth (like cancer) designed to demonstrate two things:
The issue Growth Model is inspired by the existence of a real-life antilearning microarray data set, and we now present a ‘microarray scenario’ which provably generates antilearning data. We monitor tissue samples from an organ composed of l cell lines for detection of events where with time t the densities of cell lines depart from an equilibrium d0 according to the law d(t)=(d1(t))=d0+(t−to)νεR′. Here t0 is unknown time, the start of the disease, ν=(ν1)εR1 is a disease progression speed vector. (We assume Σ1d1(t)=Σ1d0,1=1, hence Σ1ν1=0.)
We need to disc ate between two growth patterns, CLASS−1 and CLASS+1, defined as follows. The cell lines are split into three families, A, B and C, of lA, lB and lC cell lines, respectively. CLASS−1 consists of abnormal growths in a single cell line of type A, say jA εA cell line, resulting in the speed vector vjA=(νijA) with coordinates νijA˜l−1 for i=jA and νijA˜−1, otherwise. The CLASS+i growths have one cell line of type B, say jB εB, strongly changing which triggers a uniform decline in all cell lines of type C. This results in the speed vector vjB with the coordinates νijB˜b(l−1) for i=jB, νijB˜l−1, for iεC, and νijB˜(lCbl)/(l−lC), otherwise, where bεR. We assume that our sample collection consists of all n=lA+lB possible such growth patterns.
The densities of cell lines are monitored indirectly, via a differential hybridization to a cDNA microarray chip which measures differences between pooled gene activity of cells of the diseased sample and the ‘healthy’ reference tissue, giving n labeled data points
Here M is an nE×l mixing matrix, ng>>l is the number of monitored genes, and each column on M is interpreted as a genomic signature of a particular cell line, the difference between its transcription and the average of the reference tissue.
Mimicking High-Dimensional Distribution
This an example where anti-learning data arise naturally, in case of high dimensional approximations. This example can be also solved analytically, giving independent evidence for existence of the anti-learning phenomenon. On the basis of this example one can hypothesize that the immune system of a multi-cellular organism has a potential to force a pathogen to develop an anti-learning signature.
The experimental results demonstrating anti-learning in mimicry problem are shown in FIGS. 10(a) and (b). These results show discrimination between background and imposter distributions. Curves plot the area under ROC curve (AROC) for the independent test as a function of a fraction of the background class samples used for the estimation of the mean and std of the distribution. We plot means of 50 independent trials, for SVM filters trained on 50% of the data with regularization constants, as indicated in the subscript, and for the Centroid classifier (Cntr We have used n=1000 and n=5000 dimensional feature space respectively, and 100 samples in the background class and another 100 samples in the imposter class. In the background distribution a feature xi has been drawn independently from a normal N(μi,σi) where μi and σi were chosen independently from the uniform distributions on [−5,+5] and [0.5,1], respectively, i=1, . . . , n.
Learning-Features Removal
These two examples demonstrate that anti-learning can be also observed in public domain microarray data. These examples also show that real life data are a mixture of “learning” and “anti-learning” features which compete with each other. Removal of anti-learning feature enhances performance of learning predictors. And conversely, removal of learning-features increases anti-learning performance.
Meduloblastoma Survival
We have used microarray gene expression data, originally studied in [Pomeroy et˜al., 2002] and now available from Nature's web site. In our experiment we have used data set C only (60 samples containing data for 39 meduloblastoma, a brain cancer, survivors and 21 treatment failures). We have used 4459 features (genes) filtered from the supplied data as described in Supplementary Information to the above publication.
The results are shown in
Prognosis of Outcome of Breast Cancer from Microarray Data
Here we use microarray gene expression data) originally studied in [van't Veer et˜al., 2002] and now available from Nature's web site. In our exponent we have used data for prognosis of breast cancer patient. This set of 97 samples contains 51 patient with poor prognosis (marked “<5YS” in the Sample Annotation_BR—1.txt file supplied with the data) and 46 patients with good prognosis (marked “>5YS”). We have used all available 24481 features (genes) without any preprocessing (see the cited publication for details and information on availability of the data). The results are shown in
Hadamard Matrices
Hadamard Matrices contain rows of mutually orthogonal entries ±1 with recursion
Taking an arbitrary row i≠1 of Hn as set of labels Y, and using the columns of the remaining matrix as data X, we obtain data Hadn−1=(X,Y)⊂Rn−1x{±1}. For instance, for n=4 and i=3 and the linear kernel on R3 we obtain
More generally, since the columns of the Hadamard matrix are orthogonal we obtain y1y1<xi,xj<=nδijy1yj−1<0 for i≠j. This means that kernel matrix K obtained from Hadn−1 satisfies the conditions of perfect antilearning. Note that also K+c satisfies the same conditions for any cεR.
Results of experiments for a raft of different classifiers are given in
Both the Neural Network and Decision Trees performed close to random guessing. Winnow shows weak antilearning tendencies, all other classifiers (Naive Bayes, SVM, Centroid, and Ridge Regression) are strongly antilearning if the noise is not too high. The findings corroborate Theorem 1.
The invention is applicable in many areas, including:
Authentication from multi-dimensional data.
Fraud detection.
Document authorship verification.
Authentication from technological imperfections, such as random imperfections in manufacturing, natural or embedded.
Identification of a printer via multiple natural imperfections.
Money forgery detection.
Watermarking by embedding of slight noise in a document, especially images.
Medical diagnosis, for instance the prediction of response to chemotherapy for esophageal and other cancers and molecular diseases.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
[Bamber, 1975]; D.˜Bamber. The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. Journal of Mathematical Psychology, 12, 387-415, 1975.
[Craven, 2002], M. Craven, The Genomics of a Signaling Pathway: A {KDD} Cup Challenge Task, SIGKDD Explorations, 2002, 4(2).
[Duong at al., 2004]; Cuong Duong, Adam Kowalczyk, Robert Thomas, Rodney Hicks, Marianne Ciavarella, Robert Chen, Garvesh Raskutti, William Murray, Anne Thompson and Wayne Phillips, Predicting response to chemoradiotherapy in patients with oesophageal cancer, Global Challenges in Upper Gastrointestinal Cancer, Couran Cove, 2004.
[Kowalczyk Raskutti, 2002], Kowalczyk, A. and Raskutti, B., One Class SVM for Yeast Regulation Prediction, SIGKDD Explorations, \bf 4(2), 2002.
[Raskutti Kowalczyk 2004], Raskutti, B. and Kowalczyk, A., Extreme re-balancing for SVMs: a case study, SIGKDD Explorations, 6 (1), 60-69, 2004.
[Pomeroy et˜al., 2002], Pomeroy, S., Tamayo, P., Gaasenbeek, M., Sturla, L., Angelo, M., McLaughlin, M., Kim, J., Goumnerova, L., Black, P., Lau, C., Allen, J., Zagzag, D., Olson, J., Curran, T., Wetmore, C., Biegel, J., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D., Mesirov, J., Lander, E., \& Golub, T. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436-442.
[van't Veer et˜al., 2002]: van't Veer, L.˜J., Dai, H., van˜de Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., van˜der Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R, Roberts, C., Linsley, P., Bernards, R., & Friend, S. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415}, 530-536.
Number | Date | Country | Kind |
---|---|---|---|
2004903944 | Jul 2004 | AU | national |
The present application claims priority from Provisional Patent Application No. 2004903944 filed on 16 Jul. 2004, the content of which is incorporated herein by reference.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/AU05/01037 | 7/18/2005 | WO | 1/16/2007 |