This disclosure is directed to the development and validation of improved predictive models for survival and radiation induced side-effects of non-small cell lung cancer (NSCLC) patients, treated with (chemo) radiotherapy.
Radiotherapy, combined with chemotherapy, is the treatment of choice for a large group of lung cancer patients. Radiotherapy is not restricted to patients with mediastinal lymph node metastasis, but is also indicated for patients who are inoperable because of their physical condition. Improved radiotherapy treatment techniques allowed an increase of the radiation dose delivered to the tumor, while keeping the occurrence of treatment-induced side-effects, like pneumonitis or esophagitis, to a minimum. Moreover, it has been found that combining radiotherapy with chemotherapy can improve outcomes even further. Several effective chemo radiation schemes are being applied. These developments have led to improved outcome in terms of survival time while minimizing the occurrence of treatment-related side-effects.
Given the multitude of chemo radiation schemes, it would be desirable to choose the most effective one for each individual patient. The relationship between patient/tumor characteristics, the applied treatment regime and the outcome is not well understood in the medical field to date. For this reason, it is desirable to learn predictive models that can estimate the outcome, both survival and side-effects, for an individual patient under the various treatment options. Given sufficient prediction accuracy, these models can then be used to optimize the treatment of each individual patient, which is the goal of personalized medicine. These models can thus assist the medical doctor with decision support at the point of care.
Although research has been conducted regarding training such models from existing data, an accurate estimation of the survival probability to offer assistance for treatment decision-making for an individual patient is currently not available. Given sufficiently accurate predictions of these models, they can then be used to optimize the treatment of each individual patient, which is the goal of personalized medicine. This motivates a need for the creation of improved robust predictive models that take into account the available information about the patient prior to (and then during) treatment: tumor characteristics, clinical and demographic data as well as information on treatment alternatives.
A probabilistic classification task can be formulated as the task of label assignment to a testing point given a set of labeled training points. Training data is a set of labeled data points D={(yi, xi)|1≦i≦n} where yiε{0, 1} are called labels for data points xi which are vectors in a multidimensional real space called feature values. The goal is to predict the label y0 for a new testing point x0 assuming that the probability law that governs the relationship between label and feature values is the same for training and testing data points.
An essential structural assumption is that the probability law that governs the relationship between explanatory variables (coordinates of data points) and explained variable (label) is invariant for training and testing points. Moreover, this function belongs to a certain class of models. For example in a popular Generalized Liner Model (GLM) method one makes a postulate that actual labels are observed instances of Bernoulli variable whose mean is a function of linear combination of feature values,
Pr(Y=1|w,x)=f(x′·w)
where x′ is the transpose of feature vector x, w is a vector of coefficients and x′·w is the inner product. Each model is identified by the coefficient vector. In particular if f is a logistic function f(z)=(1+exp(−z))−1 then one has a logistic regression. This structural assumption makes it possible to leverage the training data in order to predict the label of the testing point. Frequently the probability with which the label is predicted is of greater interest than the label itself.
Most of available classification methods select a model w* by maximizing the log likelihood with some regularization term and then use the selected model to predict the probability of a label for feature vector x0. The log likelihood for a model w given data D is the following quantity:
Most statistical packages have a built-in MLE implementation that uses a maximum likelihood with no regularization.
Clearly, the crux of the task is that the probability law that governs the relationship between explanatory variables and explained variable is unknown except that it belongs to an assumed class and each model has its own prediction of label probability. Consequently, the handling of the model uncertainty is an issue. Traditionally, there are two major approaches: model selection and model averaging.
The model selection approach tries to identify one model in the class and uses it as the proxy for the underlying true model to predict the label. Two concerns for a model are the accuracy/fitness to the training data and the generalizability/robustness on unseen data. The fitness can be measured by the probability with which the model predicts the training data. This quantity is also known as the likelihood of the model. Focusing exclusively on accuracy for training data can lead to overfitting because more complex models can fit better for training data but their performance for testing data can be poor. Various criteria for model selection are attempts to strike a right balance between fitness and complexity.
In a Bayesian model averaging approach the label probabilities from individual models are averaged with weights equal to the posterior probabilities of the models. The posteriors are obtained via Bayes' theorem from the likelihood given the training data and some prior probability function. When the prior, and consequently, the posterior probabilities of models are known, the Bayesian solution has been shown to be optimal. This optimality is a consequence of a fundamental result of the Bayesian decision theory: if the behavior of a decision maker satisfies a number of rational postulates then an action of uncertain outcomes must be evaluated by the average of utilities of the outcomes weighted with probabilities of their realization (expected utility). The link between statistical theory and decision theory has been long established. The difference is that in model selection, the robustness concern is addressed by controlling model complexity while in model averaging this concern is addressed by diversifying the basis of prediction.
A difficulty with Bayesian approach is in the assumption of prior probability. Proponents of Bayesianism argue that this prior comes from either a decision maker's a priori personal knowledge or the summary of previous and similar studies.
The basic idea behind Bayesian decision theory and the difficulty with the prior assumption motivate a decision theory that directly uses likelihood as an uncertainty measure. The elementary object of this theory is the notion of “likelihood gamble”. Each model is associated with a two quantities: a likelihood awarded by the training data and a label probability for the testing point predicted by the model. A likelihood gamble is the set of such pairs, one for each model in the class. It turns out that, as it is the case for Bayesian decision theory, if decision maker's behavior in dealing with likelihood gambles satisfies a number of rational postulates, then it can be proved that likelihood gambles must be evaluated by a well defined construct.
Exemplary embodiments of the invention as described herein generally include methods and systems for improved predictive models obtained by using a Likelihood gamble pricing (LGP), which is a decision-theoretic approach to statistical inference that combines the likelihood principle of statistics with Von Neumann-Morgensterns axiomatic approach to decision making. The regularization induced by the LPG approach produces better probabilistic predictions than both the unregularized and the regularized (by the standard 2-norm regularization) widely used logistic regression approaches. In several survival and side effects models, a Likelihood Gamble Pricing (LGP) approach according to an embodiment of the invention provides excellent generalization, yielding more accurate and more robust predictions than other standard machine learning approaches.
An LGP approach according to an embodiment of the invention provides a new way for making predictions without learning a specific model by considering the task of predicting the probability of a binary label, such as the presence of side-effect or survival at 2-years, of a new example in the light of the available training data. In an LPG approach according to an embodiment of the invention, the (labeled) training data and the (unlabeled) new example are combined in a principled manner to achieve a regularized prediction of the probability of the unknown label. The task of assigning a label to a data point is formulated as a “likelihood gamble” whose price is interpreted as the probability of the label. Informally, a method according to an embodiment of the invention leverages the information available from both the training set and a given testing point to improve predictions by adding extra information when the training set is small and by avoiding overfitting when the training set is large. This regularized prediction is achieved by solving two unregularized maximum likelihood tasks.
LGP approach according to an embodiment of the invention can be applied to any generalized linear model. Embodiments of the invention are applied to the logistic regression model and Gaussian regression model. An LGP prediction algorithm is investigated by contrasting it against a standard maximum likelihood method (MLE). Like MLE, LGP is asymptotically consistent, but LGP provides a remedy for a number of known shortcomings of MLE, such as instability in case of a flat likelihood function and underestimation of the probabilities of rare events.
In a further embodiment of the invention, an LGP decision-theoretic approach is examined from a statistical perspective, and several properties are derived: (1) combining an (unlabeled) example with labeled training data as to predict its class label; (2) comparing LGP to Bayesian statistics and to transductive inference; and (3) providing an approximate solution to LGP that is computationally efficient and yields a regularized prediction of the class label by solving two standard (unregularized) maximum likelihood problems.
According to another embodiment of the invention, a classification algorithm uses hypothetical labels for data points that need to be classified to compute the probability of label from the difference of log likelihoods computed for two hypothetical data sets obtained by augmenting the given labeled training data set with hypothetically labeled testing data. A prototype implementation shows superior accuracy compared with traditional MLE method.
According to another embodiment of the invention, a classification algorithm predicts discrete or continuous-valued properties, such as the probability of 2-year survival of a cancer patient, the probability of re-occurrence of cancer in a patient after a certain time period, or the risk of suffering from a treatment-related side-effect.
Empirical results in several publicly available datasets confirm that al LGP classification approach according to an embodiment of the invention can produce more accurate and robust predictions than a standard classification method. On simulated data, it is shown that an LGP approach according to an embodiment of the invention outperforms MLE with respect to the mean square error and KL divergence, especially in cases of small training data. This result validates both a theory constructed from basic rational postulates and an efficient algorithm in situations where the training data is scarce. Experiments on UCI data indicate that an LGP approach according to an embodiment of the invention generalizes easily, yielding more accurate and more robust predictions than standard machine learning approaches.
According to an aspect of the invention, there is provided a method for predicting survival rates of medical patients, the method including providing a set D of survival data for a plurality of medical patients having a same condition, providing a regression model, said model having an associated parameter vector β, providing an example x0 of a medical patient whose survival probability is to be classified, calculating a parameter vector {circumflex over (β)} that maximizes a log-likelihood function of β over the set of survival data, l(β|D), wherein the log likelihood l(β|D) is a strictly concave function of β and is a function of the scalar xβ, calculating a weight w0 for example x0, calculating an updated parameter vector β* defined as the parameter vector β that maximizes a function l(β|D∪{(y0,x0,w0)}) wherein data points (y0,x0,w0) augment said set D, calculating a fair log likelihood ratio λf from {circumflex over (β)} and β* using λf=λ(β*|x0)+sign(λ({circumflex over (β)}|x0)){l({circumflex over (β)}|D)−l(β*|D)}, and mapping the fair log likelihood ratio λf to a fair price y0f, wherein said fair price is a probability that class label y0 for example x0 has a value of 1.
According to a further aspect of the invention, the weight w0 is calculated from
wherein λ(β|x0) is a log-likelihood ratio of a likelihood that a class label y0 has a value of 1 over a likelihood that a class label y0 has a value of 0, wherein said log-likelihood-ratio is an affine function of the scalar xβ.
According to a further aspect of the invention, the regression model is a logistic regression model with a probability of label y being 1 is
wherein λ(β|x)=xβ.
According to a further aspect of the invention, the log-likelihood of β is
According to a further aspect of the invention, the weight w0 is calculated from
According to a further aspect of the invention, the fair log likelihood ratio λf is λf=x0β*+sign(x0{circumflex over (β)})└l({circumflex over (β)}|D)−l(β*|D)┘.
According to a further aspect of the invention, the fair price is
According to a further aspect of the invention, the regression model is a Gaussian regression model with two clusters having a Gaussian distribution for either class, N(0,σ2) and N(1,σ2), wherein σ2 is a standard deviation.
According to a further aspect of the invention, the log-likelihood of β is
According to a further aspect of the invention, the weight w0 is calculated from
According to a further aspect of the invention, the fair log likelihood ratio λf is
According to a further aspect of the invention, the fair price is
According to a further aspect of the invention, the weight w0=2, and said updated parameter vector β* is determined by maximizing l(β|D∪{(1,x0,1),(0,x0,1)}), wherein (1,x0,1),(0,x0,1) are data points augmenting said set D.
According to another aspect of the invention, there is provided a program storage device readable by a computer, tangibly embodying a program of instructions executable by the computer to perform the method steps for predicting survival rates of medical patients.
a)-(e) plots the KL divergence of the four methods, logistic regression, regularized logistic regression, MAP odds approach, and a likelihood gamble pricing approach according to an embodiment of the invention, for the five publicly available UCI datasets.
Exemplary embodiments of the invention as described herein generally include systems and methods for the classification of cancer patient survival data based on a likelihood gamble pricing approach. Accordingly, while the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that there is no intent to limit the invention to the particular forms disclosed, but on the contrary, the invention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.
Likelihood Gamble Pricing
Likelihood gamble pricing is a combination of the likelihood principle of statistics with the von Neumaun-Morgenstern axiomatic approach to decision making. It takes from von Neumann and Morgenstern the idea that decision theory under uncertainty should be constructed from basic rational axioms and fuses with a fundamental idea from statistics that a likelihood function represents all information from observed data.
Formally, consider an ambiguous situation described by a tuple (X, Y, Θ, A, D). X, Y are the variables describing a phenomenon of interest. X is a predictor variable whose values can be observed. Y is a response variable whose values determine the utility of actions. Θ denotes the set of probabilistic models over X, Y that encode knowledge about the phenomenon. A is the set of actions that are functions from the domain of utility variable Y to the unit interval [0, 1] representing risk-adjusted utility. Finally, the data gathered is denoted by D. Although the term “model” is often used in the literature to denote a family of probability functions sharing a common functional form, the more specific term “parametric model” is used herein for this concept and the term “model” or “complete model” is reserved for a fully specified probability function.
Given an action aεA, each model θεΘ is associated with two quantities: the likelihood of the model Prθ(D) and the expected utility of the action conditional on the model Eθ[a]=∫y dy a(y) Prθ(y). We put these quantities together and have a likelihood prospect, denoted as (Prθ(D):Eθ[a]). An action in presence of multiple models Θ and observed data D is described by a collection of prospects. And since likelihood functions are determined up to a proportional constant, a likelihood function can be normalized so that the maximum is 1:
Lθ=Prθ(D)/PrΘ(D) (1)
where PrΘ(D)≡maxθεΘ{Prθ(D)}.
Thus, maxθεΘ{Lθ}=1. A set of prospects with normalized likelihoods {(Lθ:Eθ[a])|θεΘ} is called a likelihood gamble. Formally,
Definition 1: The set of likelihood gambles G is defined recursively as follows.
(i) [0, 1]⊂G;
(ii) Let I be a set of indices, liε[0, 1], maxiεI li=1 and giεG for iεI, then {(li:gi)|iεI}εG; and
(iii) Nothing else belongs to G.
This definition makes the set of likelihood gambles a mixture set. A number xε[0, 1] is called a constant gamble. It represents an action a in ambiguity-free situations where only one model is possible i.e., Θ={Prθ}, and x is the expected utility of a with respect to Prθ
Consider a preference relation over likelihood gambles. A strict preference and indifference ˜ relations are defined as usual: g1g2 iff g1g2^g2g1, and g1˜g2 iff g1g2^g2g1.
It is assumed that satisfies the following axioms:
(A1) Weak order: is complete and transitive.
(A2) Archimedean axiom: If fgh then there exist normalized likelihood vectors (ai) and (bi) such that {a1:f, a2:h}g and g{b1:f, b2:h}.
(A3) Independence: Suppose fg and let (ai) be a normalized likelihood vector, then for any h, {a1:f, a2:h}{a1:g, a2:h}.
(A4) Compound gamble: Suppose f1={b1i:f1i|iεI}, f2={b2j:f2j|jεJ}, then
(A5) Idempotence: {ai:f|iεI}˜f for any set I.
(A6) Numerical order: For f, hε[0, 1], f≧h iff fh.
The measure of uncertainty for these axioms is the statistical likelihood.
Definition 2:
There is the following representation theorem.
Theorem 4 (Representation Theorem): If satisfies axioms A1-A6, then it is represented by a utility function that maps likelihood gambles to B such that
This utility function representation is no longer a real line but a set of 2-dimensional vectors.
Definition 5: A number 0≦x≦1 is called the price or the ambiguity-free equivalence of gamble g if g˜x.
The representation theorem suggests a procedure for determining the price of a likelihood gamble. First, calculate the gamble's 2-dimensional utility according to EQ. (4) and then find a constant gamble having the same 2-dimensional utility. Indeed, suppose U(g)=(α,β)=U(v) where v is a number in the unit inter-val. From the representation theorem, it follows that g˜{α/1, β/0}˜v. By transitivity, g˜v. Therefore, v is the price of gamble g.
Gambles of the form {α/1, β/0} are called a canonical gambles. {α/1, β/0} describes an action f in a simple situation where the class of models consists of just two elements Θ={θ1, θ2} with corresponding normalized likelihoods α, β. The risk-adjusted expected utility of f is 1 under θ1 and 0 under θ2. It can be shown that indifference v˜{α/1, β/0} implies the following equation that relates values v and α, β.
logit(v)=ln(α/β)+c, (5)
where logit(z)=ln(z/(z−1)) for zε(0, 1) and c is a parameter representing ambiguity averse degree of a decision maker. Intuitively c is the price that the decision maker would pay for a “fair” canonical gamble {1/1, 1/0} where data equally supports each of two possible models.
Given c, EQ. (5) can be solved for α, β taking into account max(α, β)=1:
Here it is assumed c=0. Then EQ. (6) can be rewritten as
A simple likelihood gamble pricing (LGP) algorithm according to an embodiment of the invention to find price for a gamble {lθ/vθ|θεΘ} is as follows,
1. Solve 2 maximization tasks:
where ln(αθ), ln(βθ) are given in EQ. (7). U(vθ)=(αθ,βθ).
2. Defining A=ln(α*)−ln(β*), the gamble's price is
Note that the two maximization tasks (8) and (9) are a consequence of EQ. (4) and EQ. (10) follows from EQ. (5).
The relationship between constant gambles and their utilities is formalized by a function t mapping the unit interval [0, 1] into B. The left and the right components of t(v) are denoted by tα(v) and tβ(v), i.e., t(v)=(tα(v), tβ(v)). The intuitive meaning of t is to convey the indifference relation. For instance, t(0.4)=(0.6, 1) means that the decision maker is indifferent between getting 0.4 and playing a likelihood gamble {0.6:1, 1:0} that delivers utility 1 with (relative) likelihood 0.6 and utility 0 with likelihood 1. One assumption about t is the strictly positive monotonicity of the ratio tα(v)/tβ(v) and of v. This allows one to define the inverse t−1.
With the help of t, the pricing formula for likelihood gambles is obtained by using the definition of max in EQS. (2) and (4). Suppose v is the price of {Li:vi|iεI}, then
The pricing formula calls for solving two maximization tasks. One maximizes the product of the likelihood and the left component of t(v), the other maximizes the product of the likelihood and the right component of t(v). The obtained 2-dimensional utility is converted into a scalar utility by the inverse of t.
To summarize, there is a formula for prediction on the basis of the observed data. This formula, derived from rational postulates, offers a new prediction paradigm which is distinct from traditional approaches such as model selection and Bayesian model averaging. In the model selection approach, the observed data is used to select the best fitted model which is then used for prediction. In Bayesian model averaging, predictions by individual models are “averaged” with posterior model probability calculated from some prior model probability and the likelihood function computed from the data. As reviewed above, the prediction according to LGP involves neither a specific selected model nor a posterior model probability. To address the issue of what benefits are offered by an LGP approach according to an embodiment of the invention, the pricing formula (11) will be used to solve for a logistic regression model.
The following notation convention will be used: upper case letters denote random variables; lower case letters are used for their instantiations; a colon “:” is used to separate likelihoods from values in likelihood prospects and gambles; calligraphic letters are used for sets. L denotes (normalized) likelihoods, and l the log likelihoods. By default, the observed data D on which (log) likelihoods are based is omitted. w and its indexed versions denote models, or their coefficient vectors. xi denote points of training data; x (without index) denotes the point where prediction is needed. ĥ and g denote predictions by MLE and LGP, respectively.
Gamble View of the Logistic Model
The logistic regression model is a workhorse in statistics with applications in categorical data analysis and machine leaning. The predictor X is a multi-dimensional random variable. Each component X(i) of X will be referred to as a (input) feature. It is often assumed for convenience that the first feature is a constant X(1)=1. The response (class) Y is a discrete random variable. The focus hereinafter will be on the binary case where Y can take values −1 or 1. An instance of X will be arranged as a row vector x. The data generation process is such that the logarithm of the odds of classes is a linear combination of feature values. Equivalently, the class probability conditional on x and coefficient vector w is given by
Pr(y|x,w)=(1+exp(−yxw))−1, (12)
where w is a column coefficient vector. Thus,
log(Pr(y=1|x,w)/Pr(y=−1|x,w))=xw.
Consider the following forecast problem. Given the data generation process of logistic model and a random sample of labeled data D={(xi, yi)|1≦i≦n}, it is desired to predict the class of a given point x.
Normally, this forecast task is not solved directly, but rather is formulated as a by-product of the task of estimating the coefficient vector w. The standard approach is to estimate ŵ via maximum likelihood. The estimated probability concerning a test-example x is then Pr(Y=1|x, ŵ)=(1+exp(−xŵ))−1.
From the likelihood gamble point of view, the forecast task for given x is represented by the following gamble:
{LD(w):Pr(y=1|x,w)|wεRm}, (13)
where LD(w) is the normalized likelihood of w given the data D. This will be referred to as the forecast gamble.
Some basic facts about the logistic model will now be reviewed. The log likelihood function is
It is also well known that l(w) is concave.
Before EQ. (11) can be used, the conversion function t needs to be determined. To do so, recall that the semantics of t(v) is such that gamble {tα(v):1, tβ(v):0} is considered equivalent to v by the decision maker. In the case of the logistic model, v is the probability Pr(y=1|x,w). A natural condition that can be imposed is to equalize the likelihood ratio and the posterior odds. That is tα(v)/tβ(v)=v(1−v). Together with the normalization condition max(tα(v), tβ(v))=1, this leads to the solution
for 0<v<1. For the boundary cases tα(1)=1, tβ(1)=0, tα(0)=0 and tβ(0)=1.
The inverse function t−1 is then
Gamble (13) can be priced by formula (11) when plugging in EQS. (12), (14), (15) and (16).
Next, some simplifications are outlined. Notice that, instead of maximizing the product, it is more convenient to maximize its logarithm:
log(LD(w)tα(Pr(y=1|x,w)))=l(w)+min(0,xw),
log(LD(w)tβ(Pr(y=1|x,w)))=l(w)+min(0,−xw),
where l(w)=log(LD(w)) because of EQ. (15), and
log(Pr(y=1|x,w)/Pr(y=−1|x,w)=xw.
log(Pr(y=−1|x,w)/Pr(y=1|x,w)=−xw.
Thus, the price for the forecast gamble can be found in two steps:
1. Solve the 2 maximization tasks:
a=maxw(l(w)+min(0,xw)) (17)
b=maxw(l(w)+min(0,−xw)) (18)
2. The gamble's price is then p=(1+exp(−a+b))−1.
Denote the value p found by an LGP algorithm according to an embodiment of the invention by gD(x). In the following, its properties are investigated mostly by contrasting it with the maximum likelihood solution ĥD(x).
Denote the solutions of (17) and (18) as well as the maximum likelihood solution as follows:
It is possible that these maximization problems may have more than one solution. In this case, ŵ denotes any point where the log likelihood is maximal.
Theorem 6: For a, b found by EQS. (17) and (18): either a=l(ŵ) or b=l(ŵ).
Proof: By definition of ŵ, ŵa and ŵb, for any w, there is
a≧l(w)+min(0,xw), (22)
b≧l(w)+min(0,−xw), (23)
l(ŵ)≧l(w). (24)
Due to EQ. (24), l(ŵ)≧l(w) and l(ŵ)≧l(w). For scalar xŵ, either xŵ≧0 or xŵ≦0. If xŵ≧0, l(ŵ)+min(0,xŵ)=l(ŵ). One has l(ŵ)≧l(ŵa)≧a≧l(ŵ)+min(0,xŵ)=l(ŵ). Thus, l(ŵ)=a. In case xŵ≦0, one can show l(ŵ)=b analogously.
From this theorem, a second version of LGP can be formulated that offers some savings when computing the prices of many forecast gambles on the same data set. If xŵ≧0, then a=l(ŵ) and
b=maxw(l(w)−xw) such that xw≧0. (25)
Otherwise if xŵ≦0, then b=l(ŵ) and
a=maxw(l(w)+xw) such that xw≦0. (26)
Theorem 7: Suppose gD(x) is the price of the forecast gamble and ĥD(x) is an MLE prediction, then
(gD(x)−0.5)(ĥD(x)−0.5)≧0,
(gD(x)−ĥD(x))(ĥD(x)−0.5)≦0.
Proof: It is necessary to show that (a) if ĥD(x)≧0.5, then ĥD(x)≧gD(x)≧0.5 and (b) if ĥD(x)≦0.5, then ĥD(x)≦gD(x)≦0.5. One proves (a) and skips (b) because of analogy. Suppose ĥD(x)≧0.5 and ŵ is the maximum likelihood model coefficients. This means xŵ≧0. As shown in Theorem 6, for a, b found from EQS. (17) and (18) one has a=l(ŵ)≧b. Therefore gD(x)=(1+exp(−a+b))−1≧0.5. To show ĥD(x)≧gD(x), let b′=l(ŵ)−xŵ. It is easy to see that b=maxw(l(w)+min(0,xw))≧b′. Because of ĥD(x)=(1+exp(−a+b′))−1, gD(x)=(1+exp(−a+b))−1 and b≧b′, it is concluded that ĥD(x)≧gD(x).
Theorem 8: If ŵa=ŵb, then gD(x)=ĥD(x).
Proof: By Theorem 6, either l(ŵa)=l(ŵ) or l(ŵb)=l(ŵ). This means that either ŵa or ŵb is a maximum likelihood coefficient vector. Because ŵa=ŵb one can write ŵa=ŵb=ŵ. Now a=l(ŵ)+min(0,xŵ) and b=l(ŵ)+min(0,−xŵ). Therefore a−b=xŵ. Thus gD(x)=(1+exp(−xŵ))−1.
Note that when there is more than one model achieving the maximum likelihood, the ML method is silent on the question of which model should be used to determine the price of a gamble. In practice, to avoid such indeterminableness, a secondary and often ad hoc criterion, such as minimization of the coefficient vector's norm, is introduced to sort out among maximum likelihood models. In contrast, LGP does not have this problem because the price depends on the values of objective functions (likelihoods) rather than the locations in the coefficient space where the objective functions reach their maxima.
It is well known that MLE is asymptotically consistent and it can be shown that gamble pricing is also consistent. For convenience, one makes the technical assumption that the marginal density of X is uniform.
Theorem 9: Under standard regularity conditions, such as iid, identifiability, differentiability, open parameter space, etc., the LGP price asymptotically converges to the true value.
Proof: A sketch of the proof is provided. Denote by Z the random value obtained by joining X and Y, i.e., Z=(X, Y). With the uniform marginal density assumption, one has the unconditional log likelihood log(Pr(z|w))=c+log(Pr(y|x, w)), where c is the log of the uniform density. Denote the true parameter by w0. Under the standard regularity conditions the average (unconditional) log likelihood,
where n is the sample size, converges to the expected log likelihood function
E[log Pr(Z|w)]=c+E[Pr(Y|X,w)].
The expected log likelihood function is maximized at the true parameter w0, i.e., for any w,
E[log Pr(Y|X,w0)]≧E[log Pr(Y|X,w)].
Suppose xw0≧0. Because of ŵa=ŵ (Theorem 6) and the result that ŵ, the maximum likelihood estimate, converges to w0, it holds in the limit that ŵa=w0, i.e., w0 is the solution of the maximization task (17). Next one has to show that, in the limit, w0 is also the solution to the maximization task (18), i.e., ŵb=w0:
l(w0)−xw0≧l(w)+min(0,−xw)
for any w. Because min(0,−xw)≦0, it suffices to show l(w0)−l(w)≧xw0, or equivalently
This inequality holds in the limit: RHS converges to 0 and LHS is non-negative because
E[log Pr(Y|X,w0)]≧E[log Pr(Y|X,w)].
As ŵa=ŵb=w0, the price given by LGP equals the true probability g(x)=Pr(y=1|x,w0)=(1+exp(−xw0))−1. The case where xw0<0.5 is proved similarly.
Note that the properties of LGP listed in the theorems are not specific to a logistic model with a concave log likelihood function. With some minor modifications, these properties can be proven without the requirement of concavity.
Solving the Optimization Task Via Smoothing
Since for each point for which a prediction is needed, one has to solve optimization equations (17) and (18), it is useful to solve them in an efficient manner. First, one can highlight the fact that the optimization equations (17) and (18) are strictly concave problems which ensures and the existence of unique maximizers. The concavity of these objectives functions come from the fact that the function f(y)=min{0, y} is concave and from the well-known fact that the composition of concave functions is also concave. On the other hand, the min{0, y} is not differentiable which restricts or makes difficult the use of well-behaved Newton-like methods that depend on derivatives for their fast convergence rates. In order to address this, a smoothing technique will be used that will result in an objective function with more suitable properties for optimization equations (17) and (18). Since the derivative with respect to x of the max function g(x)=max{0, x} is the step function, the differentiable function p(x, ξ), the integral to the sigmoid function (1+exp(−ξx))−1, is used as an smooth approximation to g(x)=max{0, x}:
Then, given ξ, one has:
Hence instead of solving maximization EQS. (17) and (18), one solves the following smooth convex approximations:
ln(α*)=maxw{ln(l(w))+m(w,x0,ξ)}, (28)
ln(β*)=maxw{ln(l(w))+m(−w,x0,ξ)}. (29)
More specifically, for the logistic regression framework under consideration, the two optimization tasks to solve are:
which are both unconstrained concave maximization tasks for which both the gradient and the Hessian can be easily calculated. According to an embodiment of the invention, a Newton-Armijo method is used to solve EQS. (30) and (31), which provides quadratic convergence to the global solution of the approximated problem. It can be proved that the solution of this smooth approximation to the objective functions of EQS. (17) and (18) that depends on a parameter ξ converges to the solution of the original equations when the parameter ξ approaches infinity and it is as close as desired for appropriate large values of the smoothing parameter ξ.
Now suppose that ξ=1 and y0=1 in EQ. (30). It can be shown that
which is precisely the log likelihood of w given that x0 is a positive point in the training set. It is also trivial to show that EQ. (31) coincides with the log likelihood of w given that x0 is a negative point in the training set, when it is assumed ξ=1 and y0=0.
Illustration
Asymptotic properties are nice but in practice they do not provide much assurance because of the limited data availability. In this section, an LGP according to an embodiment of the invention is illustrated by an application to the space shuttle O-ring data set, illustrated in the table of
A noticeable feature of the LGP curve is the “bump” found near the intersection with the MLE curve where LGP switches from being under the MLE curve to above the MLE curve (or vice versa). Intuitively, the bump is a reflection of the ignorance that the observed data bears on the prediction points. To illustrate that,
This example also illustrates another property of MLE: the instability which occurs when the likelihood function is flat around its maximum. Strictly speaking, the MLE prediction 31 in
Understanding LGP
In this section is discussed a possible understanding of an LGP according to an embodiment of the invention in terms of virtual evidence and regularization with Fisher information.
Virtual Evidence
Consider the question of how much would one be willing to bet on Y=1 at X=x given the data D={(xi, yi)|1≦i≦n}. As gamblers often find it convenient to work with odds rather than probability, the question can be rephrased into the prediction of the log odds in favor of Y=1 at point x. Notice that there is only the formula of log odds given a model log(Pr(Y=1|x, w)/Pr(Y=−1|x, w))=xw. An obvious way to come up with estimate that does not depend on unknown model w is to take expectation with respect to w. So the question reduces to the estimation of E[xw].
One starts by assuming that x=xi, that is, one wants to estimate the odds at a point with an observed label. Denote by the “alternative” data obtained by flipping the label of point i i.e.,
={(x1,y1) . . . (xi,−yi) . . . (xn,yn)}
The difference of log likelihoods computed on D and is lD(w)−(w)=yixiw. This equation implies that the log odds can be estimated through the difference of log likelihoods of D and For simplicity assume yi=1. Viewing lD(w), (w) and xiw as functions of w and applying expectation operator with respect to the distribution of w on both sides, one has
E[xiw]=E[lD(w)]−E[(w)] (33)
Because the mean can be approximated by the mode, E[lD(w)] and E[(w)] are approximated by the modes mod [lD(w)] and mod [(w)], respectively. The bound of this approximation can be quantified. Because the log likelihood function for logistic model is concave therefore it is unimodal. Under a reasonable assumption, the probability density of w is unimodal too. In this case the distance between mean and mode is at most p3×standard deviation
The mod [lD(w)] can be approximated by maxwlD(w). Because lD(w) is a (log) likelihood function, a higher value of lD(w) indicates a higher probability (density) Pr(w). The exact relationship could be seen via Bayes theorem with the assumption of the uniform prior. The higher the density of Pr(w) translates into a higher density of Pr(lD(w)). By definition
mod [lD(w)]=lD(w*),
where w* is such that Pr(lD(w*))≧Pr(lD(w)) for all w.
By the above approximate reasoning, Pr(lD(w*))≧Pr(lD(w)) for all w holds if lD(w*))≧lD(w) for all w. This means lD(w*))=maxw(lD(w))). One can argue similarly for mod[(w)]. To sum up, one has
E[lD(w)]≈mod[lD(w)]≈maxwlD(w)
E[(w)]≈mod[(w)]≈maxw(w)=maxw(lD(w)−xw)
It is reassuring that this intuitive derivation arrives at equations that are identical with the ones of an LGP approach according to an embodiment of the invention, for the special case that the “test” example is already in the training data. While it is unclear how to generalize this intuitive approach to the case of a test example that is not part of the training data, an LGP approach according to an embodiment of the invention provides a solution for the general case.
Leading-Order Approximation
This section provides another view to LGP by relating it to the typical regularization in frequentist statistics, which is based on a correction term of the maximum likelihood approach. The leading-order term that causes the regularization of the predicted price is derived. It turns out that the (observed) Fisher information plays a role. To simplify notation, assume that xŵ>0 (the other case xŵ≦0 is similar). Recall that the price is
P=(1+exp(−l(ŵ)+l(w*)−xw*))−1 (34)
where w* is a solution of EQ. (25) i.e.,
w*=arg maxw(l(w)−xw) (35)
It is assumed that the value of w* is “close” to ŵ, so that the leading-order Taylor-expansion of l is accurate. This assumption can be expected to be true for large data sets as the absolute value of l(w) can be expected to grow linearly in n, while the value of xw is independent of the sample size. Hence the effect of the latter term becomes more and more negligible for large date sets, and hence w* converges asymptotically to ŵ. One can carry out a leading-order Taylor-expansion of l(w*):
where the first derivative vanishes as ŵ is the ML solution, and H denotes the Hessian matrix of the log-likelihood evaluated at ŵ:
Specifically, for the logistic model
Note that −H(ŵ)/n equals the observed Fisher information matrix if no data is missing. This is discussed below. Using this leading-order approximation, one can now find the approximate solution of EQ. (35) by equating its first derivative with zero:
H(ŵ)(w*−ŵ)−xT=0+O((w*−ŵ)2). (38)
This yields the approximate solution for w*:
w*=ŵ+H−1(ŵ)xT+O((w*−ŵ)2) (39)
Plugging EQS. (39), (36) into EQ. (34) yields
p≈(1+exp(−xwR))−1 (40)
where
wR=ŵ+H−1(ŵ)xT. (41)
From a computational view-point, note that the values of ŵ and H−1(ŵ) need be estimated only once (during training) in this approximation, while the prediction for a new point x can then be made without solving an optimization task involving this point (in contrast to the exact solution).
The result of this approximation is that it yields the leading-order correction-term that causes the regularization of the predicted price/probability Y=1 at x compared to the ML prediction. EQ. (41) shows that the regularized parameter estimate wR differs from the ML estimate ŵ in leading order due to the novel regularization-term:
Several comments are in order.
First, note that the correction term is not a constant but depends on the data point x whose label is to be predicted. This is an interesting difference with respect to existing model selection criteria, such as the AIC or BIC. As a consequence, the prediction is not linear in x, so that the predicted price as a function of x lies outside of the specified (linear) model class (cf. the O-ring illustration). Second, as H−1(ŵ)∝1/n in the asymptotic limit (n→∞), cf. EQ. (37), the regularization can be expected to decrease with growing n, and diminishes in the limit, i.e., the regularized parameter estimate then coincides with the ML estimate, as desirable (cf. also Theorem 9). Third, note that the Hessian H−1(ŵ) is negative definite, as ŵ is the ML estimate. Hence the predicted price is closer to 0.5 than the price predicted by ML, as desirable from a regularization point of view (cf. also Theorem 7). Fourth, as the Hessian H−1(ŵ) describes the curvature of the log-likelihood at ŵ, the correction is larger for flatter log-likelihood surfaces. This is desirable, as a flat log-likelihood surface indicates large model uncertainty, and consequently a large uncertainty in the prediction. This can be formalized due to the fact that −H(ŵ)/n equals the observed Fisher information matrix I(ŵ) if no data is missing.
The Hessian matrix H(ŵ) is commonly identified with the observed Fisher information, I(ŵ)n, where n is the number of data points. Since in the asymptotic limit (n→∞), the observed Fisher information converges to the expected Fisher information, the Cramer-Rao inequality for the unbiased estimator ŵ provides an asymptotic lower bound of the estimator's covariance Σ(ŵ) in terms of the inverse Fisher information matrix,
Σ(ŵ)≧I−1(ŵ)=−n·H−1(ŵ),
where ‘≧’ means that the difference is a positive definite matrix. Using this asymptotic bound, and provided that ŵ is an efficient estimator, EQ. (41) suggests that
for sufficiently large n. This implies that the ML estimate ŵ is regularized by its own variance, to obtain the fair prediction for x. In the experiments reported herein, this may also explain the observed “bump” where the predicted price is close to ½.
Although this discussion has been presented in terms of the logistic model, these results hold as long as the log likelihood function l(w) is concave and the odds ratio Λ=Pr(Y=1|x, w)/Pr(Y=−1|x, w) is a linear function of xw which is true for the family of the generalized linear models.
Simulation Study
A limitation of relying on real data to study the properties of a method is the lack of replicability. This issue can be addressed by a simulation study. Here, an LGP method according to an embodiment of the invention will be contrasted against MLE on data simulated by the logistic model. The basic step is as follows. First, generate training data D from a known coefficient vector w0 and then compare the predictions based on D for the testing data T by LGP and ML methods against the true response probabilities computed from w0 and T. A measure of accuracy that can be used is the mean square error, which is the average of squared distance between predictions and true response probabilities. Another accuracy measure is the Kullback-Leibler (KL) divergence that can be applied for distributions on T×{1,−1} assuming a uniform marginal on T.
1. Given (1) a coefficient vector w0, (2) a sample on the feature domain S={xi|1≦i≦n} for training and (3) another sample T={xj|1≦j≦m} for testing.
2. Compute true response probabilities for xiεS:
pi=(1+exp(−xiw0))−1.
Generate response labels yi by a Bernoulli generator Pr(yi=1)=pi and Pr(yi−1)=1−pi. Join xi and yi to form training data D={(xi, yi)}
3. Compute true response probabilities R for xjεT: R=(1+exp(−Tw0))−1. Compute predictions based on D for T by LGP: L=gD(T) and by MLE: M=ĥD(T).
4. Compute mean square errors
5. Compute KL(PrR∥PrL) and KL(PrR∥PrM) where PrR, PrM, PrL are distributions on T×{1,−1} computed from conditional Pr(Y=1|w0, xj) and uniform marginal Pr(xj)=1/m. For example
PrR((xj, 1))=Rj/m; PrR((xj, −1))=(1−Rj)/m.
For a configuration of dimensionality d of X (excluding the intercept) and training sample size n, one generates v coefficient vectors and for each of those generate s training samples and a testing sample of size m. Each mse value reported is the average of v×s×m predictions. The goal is to isolate the effect of prediction methods from any possible effect of coefficients, testing and training values by averaging the latter out.
In a first experiment, the dimensionality (d) varied from 1 to 10 and training data size over dimension ratio (r) varied from 20 to 100 with step 10. The sample size is n=r×d, with v=10, s=100 and m=100. Each number is the average of 105 squared errors.
In a second experiment, with fixed d=4, the training size was varied from 80 to 400 with step 40, with v=10, s=100 and m=100. But instead of mse, KL distances were compared.
Alternative Formulation of Likelihood Gamble Pricing for Binary Classification
An LGP approach according to an embodiment of the invention will now be described from a statistical perspective. The results presented above will now briefly reviewed using a statistical notation rather than a decision theoretic notation. Given training data D={(yi, xi, wi)}, i=1, . . . , N with N examples, where xi are the input vectors, yiε{0,1} is the label, and wiεR is the weight of example i for training the classifier, (omitting wi implies wi=1), an objective is, given a new example x0, to predict the probability y0fε[0,1] that its label equals 1:
y0f=pf(y0=1|x0).
For convenience, the label yε{0,1} and the probability p(y=1|x)ε{0,1} will be used interchangeably, as the latter may be considered the continuous version of the former.
In an LGP approach according to an embodiment of the invention, as outlined above, this prediction task is solved by determining a fair price of the gamble, which corresponds to the probability y0f in a statistical perspective.
In the LGP approach, one first chooses a model class to be used, with model parameter-vector β. Then, LGP directly yields the estimate y0f for a given example x0. It does not yield an estimate of the parameter-vector β. The estimate y0f follows directly from the ‘fair’ value of the likelihood ratio Λf. The link between y0f and Λf depends on the model class used. For example, for logistic regression, logit(y0f)=log Λf with the logit function as defined above. For simplicity, it can be assumed that the so-called ambiguity adverse degree is zero.
In an LGP approach, the ‘fair’ value of the likelihood-ratio is calculated as follows:
where the {tilde over (L)}'s denote normalized likelihoods, i.e., max {tilde over (L)}=1. Note that likelihoods that differ by a constant normalization factor are equivalent concerning the information about the data. The (technical) reason for this normalization in the LGP approach is that is ensures a one-to-one mapping between the space of likelihood ratios Λ and the pair of normalized likelihoods ({tilde over (L)}*1,{tilde over (L)}*0). In EQ. (44), the normalized likelihoods in light of example x0 are given by:
{tilde over (L)}(β|y0=1,x0)=min{1,Λ(β|x0)},
{tilde over (L)}(β|y0=0,x0)=min{1,1/Λ(β|x0)}, (45)
which is the inverse of the mapping Λ(β|x0)={tilde over (L)}(β|y0=1,x0)/{tilde over (L)}(β|y0=0,x0).
In an LGP approach according to an embodiment of the invention, it is desired to solve the joint optimization task in EQ. (44) as to predict y0f for the example x0 in light of the training data D.
Simplifying the Optimization Task
Solving the optimization tasks of EQ. (44) is computationally challenging as the min-function is not differentiable at zero. Next are derived several properties of the solution, which leads to an equivalent pair of equations that is computationally easier to solve.
For convenience, the log-space is used with the normalized log-likelihood {tilde over (l)}=log {tilde over (L)} and the log-likelihood ratio λ=log Λ. The optimization task then becomes, combining EQS. (44) and (45):
The following statements are assumed
1. The log likelihood {tilde over (l)}(β|D) is a strictly concave function of β.
2. The log-likelihood {tilde over (l)} is a function of the scalar xβ, and the log-likelihood-ratio λ(β|x) is an affine function of the scalar xβ.
Note that these assumptions hold for several model classes, including generalized linear models.
The optimization task in EQ. (47) is equivalent to the following one, which does not involve the (non-differentiable) min-function and hence is more amenable to numerical optimization methods.
Proposition: Under the assumptions (1) and (2), the optimization task in EQ. (47) is equivalent to the pair
The mapping between β*0,β*1 and {circumflex over (β)},β* is as follows:
if sign(λ({tilde over (β)}|x0))>0 then β*1={circumflex over (β)} and β*0=β*
else β*1=β* and β*0={circumflex over (β)}.
Note that sign(λ({tilde over (β)}|x0)) is a constant for fixed D in EQ. (49). This is because {circumflex over (β)} is determined by EQ. (48), and thus is a constant when optimizing EQ. (49). Moreover, note that one has witched from normalized log-likelihoods to the standard (unnormalized) log-likelihoods: in the normalization condition {tilde over (l)}=l−{circumflex over (l)}, the maximum likelihood {circumflex over (l)}=l({circumflex over (β)}|D) serves as the normalization constant; as {circumflex over (l)} is a constant when optimizing over β, it hence can be ignored.
Proof: As λ(β|x0) is affine (Assumption 2), the function min{min{0, λ(β|x0)},min{0,−λ(β|x0)}}=−|λ(β|x0)| is concave in β. As the log-likelihood is strictly concave (Assumption 1), the optimization task
has to be strictly concave. Hence it has exactly one solution β*.
Next one proves that sign(λ(β*1|x0))=sign(λ(β*0|x0)) by contradiction: if sign(λ(β*1|x0))≠sign(λ(β*0|x0)) in EQS. (47), then this would erroneously imply that EQ. (50) has two different solutions. This implies that min{0,λ(β*1|x0)}=0 or min{0,−λ(β*0|x0)}=0, and hence the solution to one of the optimization tasks in EQ. (47) is the maximum likelihood estimate {circumflex over (β)}=arg maxβ{tilde over (l)}(β|D). Finally, this shows that sign(λ({circumflex over (β)}|x0))=sign(λ(β*1|x0))=sign(λ(β*0|x0)) so that, in EQ. (50), |λ(β*|x0)|=sign(λ({circumflex over (β)}|x0))λ(β*|x0), which completes this sketch of the proof.
Combining the Proposition with EQ. (43), one can now obtain for the ‘fair’ value of the log likelihood ratio λf=log Λf:
λf=λ(β*|x0)+sign(λ({circumflex over (β)}|x0)){l({circumflex over (β)}|D)−l(β*|D)}. (51)
This shows that, compared to the maximum likelihood solution λ({circumflex over (β)}|x0), the LGP approach achieves regularization of the predicted ‘fair’ log-likelihood-ratio and hence of the predicted probability y0f in a novel way: the degree of regularization depends not only on the training data D but also on the example x0 to be classified. Moreover, another interesting property is that regularization in the LGP approach does not have a free parameter that needs to be tuned. Moreover, with Assumption 2, the optimization tasks (EQS. 48 and 49) and hence the predicted λf and y0f are invariant under affine transformations of the input space x. While scale-invariance of the regularization mechanism is a useful property, note that many popular approaches do not posses this property, e.g., like the hinge loss used in SVMs.
Approximation by Likelihood
In this section is derived a simple leading-order approximation to the second equation derived in the previous section, EQ. (49) such that it becomes a standard maximum likelihood task as well. This approximation enables to predict the fair price or probability of the label in an LGP approach according to an embodiment of the invention by solving two standard maximum likelihood tasks. Moreover, it turns out that this approximation also yields interesting insights into the underlying regularization-mechanism.
In detail, assume that the second term in EQ. (49), which is based on the single example x0 to be classified, can be considered as a small disturbance to the first term, which is the log likelihood of the entire training data D, containing N>>1 examples. In other words, assume that β* is close to {circumflex over (β)}, which will be specified more rigorously in EQ. (54). One approximates the second term in EQ. (49) by the log likelihood l(β|(y0,x0,w0)) in light of the new example x0 to be classified. This results in the constraint
−sign(λ({circumflex over (β)}|x0))λ(β*|x0)=l(β*|(y0,x0,w0)) (52)
where the label y0 and the weight w0 of the example x0 are unknown for the moment. Determining their approximate values based on this constraint is the goal of the remainder of this section. One expands EQ. (52) about {circumflex over (β)} to obtain:
where c1({circumflex over (β)}) and c2({circumflex over (β)}) are the zeroth order terms of the Taylor expansion about {circumflex over (β)}. These terms are constant when optimizing over β, and thus can be ignored in the optimization task.
Given that l({circumflex over (β)}|(y0,x0,w0))=w0·l({circumflex over (β)}|(y0,x0,1)), EQ. (54) can now be solved for the weight w0. Retaining only the relevant terms for the optimization task, one obtains
This explicit equation enables one to calculate the weight w0 to be used when optimizing the approximation (see EQ. (56), below). Note that w0 depends on the (unknown) label y0 of the example x0 to be classified. This is expected, as both w0 and y0 were introduced into the approximation (see right-hand side of EQ. (52)), while the original task (see left-hand side of EQ. (52)) is independent of both variables. EQ. (52) implies that the weight w0 compensates for the unknown label y0 such that the final result (i.e. β* in EQ. (56)) is independent of y0 in this approximation. Note that the value of y0 should be chosen before solving the optimization task of EQ. (56). Hence there should be a mechanism, namely the weight w0, to compensate for a possibly suboptimal choice for y0.
Even though this shows that the choice of y0 does not have much impact on the result, remember that the weight w0 does not compensate for changes in y0 in second or higher orders of the Taylor expansion of l({circumflex over (β)}|(y0,x0,w0)). Even though these approximation errors are typically negligible, they may become noticeable when a ‘bad’ choice was made for y0, which may also entail numerical instabilities. To avoid such ‘bad’ choices for y0, one can either choose value y0 as an appropriate function of example x0, or simply stick to a default value of y0 independent of the example x0. It was found by experiment that the latter approach is sufficient to avoid numerical problems when one chooses y0=0.5 for any x0, which is neutral regarding the labels 0 and 1, and then calculates the weight w0 according to EQ. (55).
Having derived the leading-order approximation to the second term in EQ. (49), one can now obtain the following approximation, inserting EQ. (52) into EQ. (49):
This approximation is a standard maximum likelihood task, and hence can be solved with any standard software package. Interestingly, the example x0 to be classified, with label y0 and weight w0 as derived above, serves as an additional data point besides the training data D. Apart from that, the maximum likelihood prediction for y0f is obtained from EQ. (56) if one sets w0=0.
Thus, given a model class, training data D and an example x0 whose label-probability y0f is to be predicted, the derivation above can be summarized as illustrated in the flowchart of
1. At step 71, determine {circumflex over (β)} by solving standard maximum likelihood task of EQ. (48),
2. At step 72, for an example x0, calculate weight w0 from EQ. (55):
3. At step 73, determine β* by solving the maximum likelihood in EQ. (56):
4. Now that {circumflex over (β)} and β* are determined, the ‘fair’ log likelihood ratio λf can be calculated from EQ. (51) at step 74: λf=λ(β*|x0)+sign(λ({circumflex over (β)}|x0)){l({circumflex over (β)}|D)−l(β*|D)}.
5. Finally, at step 75, λf can be mapped to the ‘fair’ price y0f. This mapping depends on the model class used (see next section for examples).
Note that step 1 needs to be computed only once, while steps 2 through 5 are looped from step 76 for every new example x0 to be classified. Note that {circumflex over (β)} can serve as an initialization for optimizing β*.
If β* is not close to {circumflex over (β)}, then the accuracy of the linear approximation can be improved by iterating steps 2 and 3 until convergence. In this case, use in step 2 the value β* from the previous iteration in place for {circumflex over (β)}.
There are variants of this approach. One variant according to an embodiment of the invention is obtained by choosing an approximation for the weight w0 that is different from the one in step 2. The simplest possible one is w0=2 for any example x0 in step 2 for the logistic regression model. Equivalently to choosing w0=2, one can also insert 2 data points, each with weight 1, at step 4. For example,
Alternatively, instead of inserting the two data points into the ‘augmented’ data, one can insert one data point into the one data set in step 1, and the other data point in to the other data set in step 3. This replaces the steps 71, 72, 73 and 74 of
Models
Various models can be used with an LGP approach according to an embodiment of the invention, including generalized linear models. Herein below are presented equations for two exemplary, non-limiting models, the logistic regression model and the Gaussian classifier model, which follow immediately from the general equations derived above.
Logistic Regression Model
For the logistic regression model, the probability y is linked to the log-likelihood ratio λ via the logistic function, which would be used in step 5 above:
where λ(β|x)=xβ is linear (cf. Assumption 2). In the light of data D, the log-likelihood of β, to be used in step 71, reads
Standard gradient-based methods can be used to determine the maximum likelihood solutions {circumflex over (β)} and β* in steps 1 and 3 above.
To determine {circumflex over (β)} and β*; the optimization task summarized above can be solved. The equation for β* at step 73 would use the same l(β|D) as that used for {circumflex over (β)} in step 71,
The interesting part is to calculate the weight w0 in EQ. (55), used in step 72 above, for the logistic regression model:
which can be easily evaluated. Interestingly, for the choice y0=0.5, one has
w0→2 as |x0{circumflex over (β)}|→∞,
i.e., in both the extreme cases where the predicted price of x0 is close to 0 or close to 1, the new example x0 is assigned the weight 2, interestingly not 1. Hence, this example is treated like two examples with label y0=0.5, or equivalently like one with label 1 and second one with label 0. In between these two extremes, the weight w0 increases continuously as |x0{circumflex over (β)}| decreases, and diverges at the origin:
w0→∞ as x0{circumflex over (β)}→0.
Thus, according to an embodiment of the invention, this behavior of the weight suggests an additional approximation: assign w0=2 for all values of x0{circumflex over (β)}, as suggested by the asymptotic limit above. This results in a particularly simple approach: β* is the maximum likelihood solution in light of the data D∪{(1,x0,1),(0,x0,1)}. As the weight w0 deviates from 2 for examples close to the decision boundary, one expects the approximation error of this additional simplification to be concentrated around y0f=0.5.
This is indeed observed experimentally.
Note that setting w0=2 is exemplary and non-limiting, and other approximations for w0 are within the scope of other embodiments of the invention.
The fair log likelihood ratio for the logistic regression model, used in step 74 of
λf=x0β*+sign(x0{circumflex over (β)})└l({circumflex over (β)}|D)−l(β*|D)┘. (63)
Finally, the fair price for the logistic regression model is
Gaussian Classifier Model
Assume there are two clusters with a Gaussian distribution for either class, N(0,σ2) and N(1,σ2), the log-likelihood of the model in light of the data D, used in steps 71 and 73 for calculating {circumflex over (β)} and β*, respectively, is
It can be easily seen that σ cancels out for the prediction of the fair price, and hence may be omitted.
The log likelihood ratio concerning the new example, used in step 74, is
and the predicted label is given by y(x,β)=xβ=1+λ(β|x)σ2/2.
The value of the weight w0 of the new example, used in step 72, reads (see EQ. (55)):
Note that this weight is negative, diverges to minus infinity at x0{circumflex over (β)}=0.5, assuming y0=0.5 as before, and converges to zero in the limit |x0{circumflex over (β)}|→∞. The latter fact does not mean, however, that the example's effect on the regularized prediction becomes negligible due to the vanishing weight: it is offset by the contribution (x0 {circumflex over (β)}−y0)2 to the likelihood.
Finally, with the solutions of the two optimization tasks, the fair prediction of the label is (see EQS. (61) and (51)):
Experimental results for the Gaussian model show the same general properties as the results for the logistic regression model.
Comparison to Bayesian Statistics
If Bayesian hypothesis testing is modified in two ways, as suggested by an LGP approach according to an embodiment of the invention, then similar results concerning the prediction of the probability y0f of the unlabeled example x0 can be achieved.
In Bayesian statistics, hypothesis testing is typically done by means of the Bayes factor, which quantifies the evidence in favor of a hypothesis H1 vs. a competing hypothesis H0. The training data is a set of labeled data points D={(yi,xi)|1≦i≦n} where yiε{0, 1} are labels for data points xi which are vectors in a multidimensional real space called feature values. The Bayes factor is the ratio of the (marginal) likelihoods of the two hypotheses in light of the data D. If one wishes to incorporate the prior ratio of the two hypotheses, one arrives at the posterior ratio.
Prediction of the binary label y0 of a new example x0 may be considered as distinguishing between two alternative hypotheses, namely y0=1 (i.e., H1) and y0=0 (i.e., H0). A first difference to standard hypothesis testing is that these hypotheses are not concerned with a property of the data D as a whole (e.g., a summary statistics of D, like its mean), but with the label y0 of the new example x0 that is not part of the training data D. So one can observe one of the two data sets:
D+=D∪(1,x0),
D−=D∪(0,x0)
where D+ is the augmented data set that contains of data actually observed D with hypothetical label y0 =1 and D− is the set if y0 =0.
For each of such augmented data sets one can compute the maximal log likelihood. Denote
The probability of label y0 =1 predicted for the testing feature vector x0 is
PrG(y0 =1|x0 )=(1+exp(l(w*+|D+)−l(w*−|D−)))−1.
The prediction is computed from the difference in log likelihoods for two augmented data sets that are obtained by joining actual data with one of two possibilities for the testing data.
A procedure according to an embodiment of the invention is as follows. First, for each of m runs: (1) randomly pick a model identified by a parameter vector wtrue of d dimensions; (2) generate randomly training data D of n d-dimensional feature vectors; (3) generate labels for training feature vectors from the model parameter vector wtrue; (4) compute an MLE estimate ŵ of the parameter vector from the training data D; (5) generate k feature vectors for testing; (6) for each of k testing cases xi compute: (i) the true probability of the label according to the true parameter vector wtrue: Prtrue; (ii) the probability of the label predicted by MLE: Prmle; (iii) the probability of the label predicted by a method according to an embodiment of the invention PrG; (iv) the squared errors (Prmle−Prtrue)2 and (PrG−Prtrue)2; (7) after m runs the MSE mean square error can be computed.
Experiments of a method according to an embodiment of the invention as described above were run for a list of dimensions from 5 to 50 with step 5. For each value of dimension the number of iterations is m=20. For each iteration, the number of training points is 20 times as many as the dimension: n=20×d. The number of testing points is k=1000.
Experiments
Presented herein is a numerical comparison of the following four methods: (1) standard (maximum likelihood) logistic regression (LR), (2) regularized logistic regression (RLR) with standard 2-norm regularization, (3) a LGP approach according to an embodiment of the invention applied to the logistic regression model, and (4) a maximum-a-posteriori odds approach suggested in analogy to an LGP approach according to an embodiment of the invention.
Experiments were performed on five publicly available datasets for the Irvine Machine Learning Database Repository: (1) the Wisconsin Breast Cancer Prognosis—24 months (WPBC 24) dataset (155;32); (2) the Wisconsin Breast Cancer Prognosis—60 months (WPBC 60) dataset (110;32); (3) the ionosphere dataset (351;34); (4) the BUPA Liver Disorders dataset (345;6); and (5) the galaxy dim dataset (4192;14). The numbers in parenthesis are the dimensions of the datasets: (number of points; number of features).
a)-(e) plots the KL divergence of the four methods, logistic regression 91, regularized logistic regression 92, MAP odds approach 93, and a likelihood gamble pricing approach 94 according to an embodiment of the invention, for the five publicly available UCI datasets listed above. The mean and standard deviation of the KL divergence between the real and the predicted values are reported.
The various approaches were evaluated with respect to the continuous valued probability predicted for the labels of (unseen) test-examples, rather than with respect to the binary prediction. Note that this probability is often more meaningful than the binary prediction, e.g., when predicting the probability of a cancer patient for surviving the next 5 years. The standard measures for evaluating the predicted probability in light of the true probability (or empirically observed binary label in the data) of the test-examples are the cross-entropy or the Kullback-Leibler (KL) divergence. Note that the same probability was assigned to each test-example, but normalization issues were ignored, so that the values reported for the KL divergence in
For each data set, the prediction accuracy was examined as to how the probability of the label deteriorates as the size of the training-data decreases, as then the importance of the regularization of the prediction increases. For this reason, different sizes of the training data were used, ranging from 10% to 90% of the available data. Since the dim dataset is considerably larger than the other data sets, the training set size was chosen to range from 1% to 10%. The remaining data served as the test data. For each training set size, the experiment was retested 10 times with random partitioning as to obtain an estimate for the mean and standard deviation of the KL divergence, as shown in
An LPG approach according to an embodiment of the invention as well as the suggested MAP ratio approach consistently make predictions that are closer to the true labels of the test-examples and more robust (less variance) with respect to the empirical KL divergence (compared to the LR and RLR approaches). As the training set size grows, the performances of the various approaches increasingly agree with each other, as expected. These experiments suggest that an LGP approach according to an embodiment of the invention accounts for the ambiguity/uncertainty in the training data, and makes accurate and robust predictions.
Application to Discrete or Continuous-Valued Properties
These models can also be used for predicting discrete or continuous-valued properties, like for instance the probability of 2-year survival of a cancer patient, the probability of re-occurrence of cancer in a patient after a certain time period, or the risk of suffering from a treatment-related side-effect. In case of predicting the probability of 2-year survival of a cancer patient, the training data would be a collection of patients for whom the outcome of 2-year survival is known, as well as several predictors, such as demographics, imaging information, lab results, etc. The new data point is a new patient for whom the probability of 2-year survival has to be predicted.
A procedure according to an embodiment of the invention was applied to training data from cancer patients. The objective was to predict the probability of 2-year survival for a test-set of cancer patients. The accuracy of prediction is assessed in terms of the Kullback-Leibler divergence, which is the standard measure of comparing the predicted probability to the observed outcome.
A method according to an embodiment of the invention as described above is more accurate in predicting the probability of labels than commonly used MLE routine. The method achieves better results than competing methods, especially when the training set is small (i.e., the number of patients is small), as is typically the case in the medical domain. Since a routine to compute maximum log likelihood is implemented in most statistical packages, a method according to an embodiment of the invention does not require any additional implementation and is easy to use.
System Implementations
It is to be understood that embodiments of the present invention can be implemented in various forms of hardware, software, firmware, special purpose processes, or a combination thereof. In one embodiment, the present invention can be implemented in software as an application program tangible embodied on a computer readable program storage device. The application program can be uploaded to, and executed by, a machine comprising any suitable architecture.
The computer system 121 also includes an operating system and micro instruction code. The various processes and functions described herein can either be part of the micro instruction code or part of the application program (or combination thereof) which is executed via the operating system. In addition, various other peripheral devices can be connected to the computer platform such as an additional data storage device and a printing device.
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures can be implemented in software, the actual connections between the systems components (or the process steps) may differ depending upon the manner in which the present invention is programmed. Given the teachings of the present invention provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present invention.
While the present invention has been described in detail with reference to a preferred embodiment, those skilled in the art will appreciate that various modifications and substitutions can be made thereto without departing from the spirit and scope of the invention as set forth in the appended claims.
This application claims priority from “Likelihood Gamble Pricing Approach to Classification”, U.S. Provisional Application No. 60/941,781 of Fung, et al., filed Jun. 4, 2007, “Classification Method that Uses Hypothetical Labels for Testing Data”, U.S. Provisional Application No. 60/973,320 of Giang, et al., filed Sep. 18, 2007, and “Improved Prediction of Outcome”, U.S. Provisional Application No. 61/029,695 of Giang, et al., filed Feb. 19, 2008, the contents of all of which are herein incorporated by reference in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
20030041042 | Cohen et al. | Feb 2003 | A1 |
20060234233 | Bruce et al. | Oct 2006 | A1 |
Number | Date | Country | |
---|---|---|---|
20080301077 A1 | Dec 2008 | US |
Number | Date | Country | |
---|---|---|---|
60941781 | Jun 2007 | US | |
60973320 | Sep 2007 | US | |
61029695 | Feb 2008 | US |