The present invention relates to a method of analysing data and in particular relates to the use of artificial neural networks (ANNs) to analyse data and identify relationships between input data and one or more conditions.
An artificial neural network (ANN), or “neural network”, is a mathematical or computational model comprising an interconnected group of artificial neurons which is capable of processing information so as to model relationships between inputs and outputs or to find patterns in data.
A neural network may therefore be considered as a non-linear statistical data modelling tool and generally is an adaptive system that is capable of changing its structure based on external or internal information that flows through the network in a training phase. The strength, or weights, of the connections in the network may be altered during training in order to produce a desired signal flow.
Various types of neural network can be constructed. For example, a feedforward neural network is one of the simplest types of ANN in which information moves only in one direction and recurrent networks are models with bi-directional data flow. Many other neural network types are available.
One particular variation of a feedforward network is the multilayer perceptron which uses three or more layers of neurons (nodes) with nonlinear activation functions, and is more powerful than a single layer perceptron model in that it can distinguish data that is not linearly separable.
The ability of neural networks to be trained in a learning phase enables the weighting function between the various nodes/neurons of the network to be altered such that the network can be used to process or classify input data. Various different learning models may be used to train a neural network such as “supervised learning” in which a set of example data that relates to one or more outcomes or conditions is used to train the network such that it can, for example, predict an outcome for any given input data. Supervised learning may therefore be considered as the inference of a mapping relationship between input data and one or more outcomes.
Training an artificial neural network may involve the comparison of the network output to a desired output and using the error between the two outputs to adjust the weighting between nodes of the network. In one learning model a cost function C may be defined and the training may comprise altering the node weightings until the function C can no longer be minimised further. In this way a relationship between the input data and an outcome or series of outcomes may be derived. An example of a cost function might be C=E[(f(x)−y)2] where (x, y) is a data pair taken from some distribution D.
In one application, a neural network might be trained with gene expression data from tissues taken from patients who are healthy and from patients who have cancer. The training of the network in such an example may identify genes or gene sets that are biomarkers for cancer. The trained network may be used to predict the likelihood of a given person developing cancer based on the results of an analysis of a tissue sample.
Another field of technology in which an artificial neural network might be used is meteorology in which, for example, temperature or pressure data at a series of locations over time could be used to determine the likelihood of there being rainfall at a given location at a given time.
A known problem with artificial neural networks is the issue of overtraining which arises in overcomplex or overspecified systems when the capacity of the network significantly exceeds the needed free parameters. This problem can lead to a neural network suggesting that particular parameters are important whereas in reality they are not. This is caused by the identification of a set of parameters having a higher importance and by the false detection of parameters. These parameters are likely to have a lower performance when classifying unseen data/cases.
It is an object of the present invention to provide a method of analysing data using a neural network that overcomes or substantially mitigates the above mentioned problem.
According to a first aspect the present invention provides a method of determining a relationship between input data and one or more conditions comprising the steps of: receiving input data categorised into one or more predetermined classes of condition; training an artificial neural network with the input data, the artificial neural network comprising an input layer having one or more input nodes arranged to receive input data; a hidden layer comprising two or more hidden nodes, the nodes of the hidden layer being connected to the one or more nodes of the input layer by connections of adjustable weight; and, an output layer having an output node arranged to output data related to the one or more conditions, the output node being connected to the nodes of the hidden layer by connections of adjustable weight; determining relationships between the input data and the one or more conditions wherein the artificial neural network has a constrained architecture in which (i) the number of hidden nodes within the hidden layer is constrained; and, (ii) the initial weights of the connections between nodes are restricted.
The present invention provides a method of analysis that that highlights those parameters in the input data that are particularly useful for predicting either whether a given outcome is likely, or the probability of time to a give event. In other words, compared to prior art systems the method of the present invention effectively increases the difference or “contrast” between the various input parameters so that the most relevant parameters from a predictive capability point of view are identified.
The present invention provides a method of determining a relationship between input data and one or more conditions using an artificial neural network (ANN). The present invention is also capable of determining a relationship between input data and time to a specified event that is dependent in part upon the input data using an (ANN). The ANN used in the invention has a constrained architecture in which the number of nodes within the hidden layer of the ANN are constrained and in which the initial weights of the connections between nodes are restricted.
The method of the present invention therefore proposes an ANN architecture which runs contrary to the general teaching of the prior art. In prior art systems the size of the hidden layer is maximised within the constraints of the processing system being used whereas in the present invention the architecture is deliberately constrained in order to increase the effectiveness of the predictive capability of the network and the contrast between markers of relevance and non relevance within a highly dimensional system. In comparison to known systems, the present invention provides the advantage that the predictive performance for the markers that are identified is improved and those markers identified by the method according to the present invention are relevant to the underlying process within the system.
Preferably in order to maximise the predictive effectiveness of the present invention the number of hidden nodes is in the range two to five. More preferably the number of hidden nodes is set at two.
Preferably the initial weights of the connections between nodes have a standard deviation in the range 0.01 to 0.5. It is noted that lowering the standard deviation makes the artificial neural network less predictive. Raising the standard deviation reduces the constraints on the network. More preferably, the initial weights of connections between nodes have a standard deviation of 0.1.
Conveniently the input data comprises data pairs (e.g. gene and gene expression data) which are categorised into one or more conditions (e.g. cancerous or healthy). In the example of gene data then the gene may be regarded as a parameter and the expression data as the associated parameter value. Furthermore, input data may be grouped into a plurality of samples, each sample having an identical selection of data pairs (e.g. the gene and gene expression data may detail the condition—healthy/cancerous—of a plurality of individuals).
Training of the neural network may conveniently comprise selecting a particular parameter in each sample (i.e. the same parameter in each sample) and then training the network with the parameter value associated with the selected parameter. The performance of the network may be recorded for the selected parameter and then the process may be repeated for each parameter in the samples in turn.
The determining step of the first aspect of the invention may comprise ranking the recorded performance of each selected parameter against the known condition or time to an event and the best performing parameter may then be selected.
Once the best performing parameter from the plurality of samples has been determined then a further selecting step may comprise pairing that best performing parameter with one of the remaining parameters. The network may then be further trained with the parameter values associated with the pair of selected parameters and the network performance recorded. As before, the best performing parameter may then be paired with each of the remaining parameters in turn.
The selecting, training and recording steps may then be repeated, adding one parameter in turn to the known best performing parameters until no further substantial performance increase is gained.
Conveniently it is noted that the input data may be grouped into a plurality of samples, each sample having an identical selection of data pairs, each data pair being categorised into the one or more conditions and comprising a parameter and associated parameter value, and the training and determining steps of the first aspect of the invention may comprise: selecting a parameter within the input data, training the artificial neural network with corresponding parameter values and recording artificial neural network performance; repeating for each parameter within the input data; determining the best performing parameter in the input data; and, repeating the selecting, repeating and determining, each repetition adding one of the remaining parameters to the best performing combination of parameters, until artificial neural network performance is not improved.
In one application of the method according to an embodiment of the present invention the parameters may represent genes and the parameter values may represent gene expression data. In a further application the parameter may represent proteins and the parameter values may represent activity function.
In other applications of the method according to an embodiment of the present invention the parameter may represent a meteorological parameter, e.g. temperature or rainfall at a given location and the parameter value may represent the associated temperature or rainfall value.
It is however noted that the method according to the present invention may be applied to any complex system where there are a large number of interacting factors occurring in different states over time. The method of the invention shows particular utility in analysis of apparently stochastic systems.
According to a second aspect of the present invention there is provided a method of determining a relationship between input data and one or more conditions comprising: receiving input data categorised into one or more predetermined classes of condition; determining relationships between the input data and the one or more conditions using a neural network, the artificial neural network comprising an input layer having one or more input nodes arranged to receive input data; a hidden layer comprising two or more hidden nodes, the nodes of the hidden layer being connected to the one or more nodes of the input layer by connections of adjustable weight; and, an output layer having an output node arranged to output data related to the one or more conditions, the output node being connected to the nodes of the hidden layer by connections of adjustable weight wherein the artificial neural network has a constrained architecture in which (i) the number of hidden nodes within the hidden layer is constrained; and, (ii) the initial weights of the connections between nodes are restricted.
According to a third aspect of the present invention there is provided an artificial neural network for determining a relationship between input data and one or more conditions comprising: an input layer having one or more input nodes arranged to receive input data categorised into one or more predetermined classes of condition; a hidden layer comprising two or more hidden nodes, the nodes of the hidden layer being connected to the one or more nodes of the input layer by connections of adjustable weight; and, an output layer having an output node arranged to output data related to the one or more conditions, the output node being connected to the nodes of the hidden layer by connections of adjustable weight; wherein the artificial neural network has a constrained architecture in which (i) the number of hidden nodes within the hidden layer is constrained; and, (ii) the initial weights of the connections between nodes are restricted. The output may be optionally either continuous or binary. In embodiments where the output is continuous, the method of the invention is able to predict the probability of time to the occurrence of a predetermined event based upon input data taken at one or more given time points before occurrence of the event.
The invention extends to a computer system for determining a relationship between input data and one or more conditions, or time to an event, comprising an artificial neural network according to the third aspect of the present invention.
It will be appreciated that preferred and/or optional features of the first aspect of the invention may be provided in the second and third aspects of the invention also, either alone or in appropriate combinations.
Accordingly, in one embodiment, the invention provides a computer-implemented method of determining a relationship between input data relating to a specified event and the probability of the time interval to the occurrence of the event in the future. The method includes the steps of receiving input data categorised into one or more predetermined classes; using a microprocessor, training an artificial neural network with the input data, the artificial neural network including an input layer having one or more input nodes arranged to receive input data; a hidden layer including two or more hidden nodes, the nodes of the hidden layer being connected to the one or more nodes of the input layer by connections of adjustable weight; and, an output layer having an output node arranged to continuously output data related to the specified event, the output node being connected to the nodes of the hidden layer by connections of adjustable weight; using a microprocessor, determining a relationship between the input data and the specified event so as to determine a probability value of the time to the occurrence of the event (time to event); wherein the artificial neural network has a constrained architecture in which (i) the number of hidden nodes within the hidden layer is constrained; and (ii) the initial weights of the connections between nodes are restricted.
In another embodiment the invention provides a computer readable medium containing program instructions for implementing an artificial neural network for determining a relationship between input data relating to a specified event and the probability of the time interval to the occurrence of the event in the future, wherein execution of the program instructions by one or more processors of a computer system causes the one or more processors to carry out the steps of: arranging one or more input nodes in an input layer to receive input data categorised into one or more predetermined classes; providing a hidden layer including two or more hidden nodes; connecting the nodes of the hidden layer to the one or more nodes of the input layer by connections of adjustable weight; providing an output layer having an output node arranged to continuously output data related to the event; and connecting the output node to the nodes of the hidden layer by connections of adjustable weight; wherein the artificial neural network has a constrained architecture in which (i) the number of hidden nodes within the hidden layer is constrained; and (ii) the initial weights of the connections between nodes are restricted.
In yet another embodiment, the invention provides a diagnostic system that predicts time to a specified clinical event for a given individual following analysis of biomarker expression levels in a biological sample obtained from said individual. The system includes a biomarker profiler for determining the levels of expression of one or more biomarkers within a sample, thereby generating biomarker expression data; a processor for analysing the biomarker expression data and determining from the data a predicted time to a specified clinical event; and a display that presents the predicted time to a specified clinical event to a user of the diagnostic system.
Other aspects of the invention will become apparent by consideration of the detailed description and accompanying drawings.
In order that the invention may be more readily understood, reference will now be made, by way of example, to the accompanying drawings in which:
Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein are for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” or “having” and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
One drawback of traditional linear based ANN models is that they often cannot generalise well to problems and therefore may only be applicable to the dataset they are originally applied to. Simulation experiments have shown that stepwise logistic regression has limited power in selecting important variables in small data sets, and therefore risks overfitting (Steyerberg, E. W., Eijkemans, M. J. and Habbema, J. D. (1999) Stepwise selection in small data sets: a simulation study of bias in logistic regression analysis, J Clin Epidemiol, 52, 935-942.). Additionally the automatic selection procedure is non-subjective and ignores logical constraints. The applied neural network stepwise approach of the present invention does not share the limitations of the prior art because the models have been shown to be applicable to a separate datasets used for validation, so are capable of generalisation to new data and as such, overfitting has not been observed when using this approach.
In various embodiments, a neural network is implemented on a computer system 100 (
It is noted that the number of hidden layers may be varied.
The various interconnections between the nodes are indicated in
The neural network is arranged such that input data is fed into the input layer 3 and is then multiplied by the interconnection weights as it is passed from the input layer 3 to the hidden layer 5. Within the hidden layer 5, the data is summed then processed by a nonlinear function (for example a hyperbolic tangent function or a sigmoidal transfer function). As the processed data leaves the hidden layer to the output later 7 it is, again multiplied by interconnection weights, then summed and processed within the output layer to produce the neural network output.
One of the most popular training algorithms for multi-layer perceptron and many other neural networks is an algorithm called backpropagation. With backpropagation, the input data is repeatedly presented to the neural network. With each presentation the output of the neural network is compared to the desired output and an error is computed. This error is then fed back (backpropagated) to the neural network and used to adjust the weights such that the error decreases with each iteration and the neural model gets closer and closer to producing the desired output. This process is known as “training”.
When training a neural network the learning rate is a parameter found in many learning algorithms that alters the speed at which the network arrives at the minimum solution. If the rate is too high then the network can oscillate about the solution or diverge from the solution. If the rate is too low then the network may take too long to reach the solution.
A further parameter that may be varied during the training of an artificial neural network is the momentum parameter that is used to prevent the network from converging on a local minimum or saddle point. An overly high momentum parameter can risk overshooting the minimum. A momentum parameter that is too low can result in a network that cannot reliably avoid local minima.
Having discussed the use and training of artificial neural networks, the application of a neural network in the context of embodiments of the present invention is discussed below. It is noted that while the example discussed below relates to bioinformatics, the invention described herein is applicable to other fields of technology, e.g. meteorological predictions, pollution prediction, environmental prediction etc.
As noted above a known problem with neural networks is the fact that they can be over-trained such that relationships can be derived between the input and output data for virtually all of the input data parameters.
In the artificial neural network in accordance with embodiments of the present invention the network is set up to as to improve the network's ability to identify the most relevant input parameters. To this end, the number of nodes within the hidden layer is restricted, preferably below five nodes and particularly to two nodes. In addition to this the standard deviation between the initial weights of the interconnections between nodes is also constrained. Preferably, the standard deviation, σ, of the initial weights of the interconnections are placed in the range 0.01 to 0.5 with an optimum value of 0.1.
In Step 40, the input and output variables to be used in the method of analysis are identified. In the example of the data set of
In Step 42, an input (i.e. a particular gene, for example gene C) is chosen as the input (input 1) to the ANN shown in
In Step 44, the ANN is trained using random sample cross validation. In other words a subset of the overall dataset is used to train the neural network, a “training subset”. In the context of the dataset of
In Step 46, the performance of the artificial neural network for input 1 is recorded and stored.
In Step 48, a further gene is chosen as the sole input to train the neural network and the system cycles round to Step 44 again so that the network is trained from its initial state again using this new data. For example, gene H might be the next input to be chosen and the gene expression data for gene H from samples 1-3 and 8-10 may then be used to train the network again.
Steps 44 and 46 are then repeated (indicated via arrow 50) for each input as sole input to the network (i.e. gene and its associated expression data in the example of
Once each input in the training subset has been used as input the system moves to Step 52 in which the various inputs are ranked according to the error from the true outcome and the best performing input is chosen.
In Step 54 the system moves onto train the network with a pair of inputs, one of which is the best performing input identified in Step 52 and the other is one of the remaining inputs from the training subset. The performance of the network with this pair of inputs is recorded.
The system then repeats this process with each of the remaining inputs from the training subset in turn (indicated via arrow 56), i.e. each of the remaining inputs is paired in turn with the best performing sole input identified in Step 48.
Once each of the remaining inputs has been used, the system identifies, in Step 58, the best performing pair of inputs.
The system then returns to Step 42 (indicated via arrow 60) and repeats the whole process, continually adding inputs until no further improvement in the performance of the artificial neural network is detected (Step 62). At this point, the artificial neural network has identified the inputs which are most closely related to the outcome. In the case of the gene/gene expression data example of
a-c shows the development of the artificial neural network 20 through the first few cycles of the flow chart of
In
In
The addition of further input nodes continues until no further improvement in network performance is identified.
The ANN of the invention shows significant technical utility in analysing complex datasets generated from diverse sources. In one example of the invention in use, clinical data from cancer patients is analysed in order to determine diagnostic and prognostic genetic indicators of cancer. In another example of the invention in use, meteorological measurements are analysed in order to provide predictions of future weather patterns. The invention shows further utility in the fields of ocean current measurements, financial data analysis, epidemiology, climate change prediction, analysis of socio-economic data, and vehicle traffic movements, to name just a few areas.
Cancer is the second leading cause of death in the United States. An estimated 10.1 million Americans are living with a previous diagnosis of cancer. In 2002, over one million people were newly diagnosed with cancer in the United States (information from Centres for Disease Control and Prevention, 2004 and 2005, and National Cancer Institute, 2005). According to Cancer Research UK, in 2005 over 150,000 people died in the United Kingdom as a result of cancer. Detecting cancer at an early stage in the development of the disease is a key factor in enabling the disease to be effectively treated and prolonging the life of the affected individual. Cancer screening is an attempt to detect (undiagnosed) cancers in the population, so as to enable early therapeutic intervention. Screens for detecting and/or predicting cancer are advantageously suitable for testing large numbers of subjects; are affordable; safe; non-invasive; and accurate (i.e. exhibiting a low rate of false positives).
At present there are no clinically validated markers for metastatic melanoma. Data has been obtained from mass spectrometry (MS) proteomic profiling of human serum samples from patients with melanoma at various stages of disease. Using the stepwise ANN approaches of the present invention, protein ions have been identified that distinguish stage IV melanoma patients from healthy controls with an accuracy of over 90%. Using the same approach to analyse the proteomic profiles of digested peptides, ions were identified which predicted validation subsets of samples to an accuracy of 100%. The groups of ions identified here distinguish stage IV metastatic melanoma from healthy controls with incredibly high sensitivity and specificity. This is of even greater significance when it is appreciated that conventional S-100 ELISA typically results in a reported 20% ‘false negative’ rate in patients with detectable metastases by routine clinical and radiographic studies
Potential serum protein melanoma biomarker ions by mass spectrometry using SELDI chips have been reported previously (Mian et al (2005) Serum proteomic fingerprinting discriminates between clinical stages and predicts disease progression in melanoma patients, J Clin Oncol, 23, 5088-5093) where a mass region around 11,700 Da provided a highly statistically significantly difference in intensity between stage I and stage IV melanoma samples. In an example of the invention, described in more detail below, a MALDI MS method was used to generate a more rapid data analysis with higher resolution. These data were subsequently subjected to stepwise ANN analysis and nine ions were identified that discriminated between melanoma stage IV and healthy control sera. This analysis by ANNs of serum proteins resulted in a median accuracy of 92% (inter-quartile range 89.4-94.8%) in discriminating between sera from stage IV melanoma and control patients. The top ion at m/z 12000 was able to discriminate between classes with a median predictive accuracy of 64% (inter-quartile range 58.7-69.2%). This ion is similar in mass to the biomarker ion of m/z 11700 reported using the SELDI technology also for stage IV metastatic cancer reported previously (Mian, et al., 2005). The difference may be attributed to the fact that this ion was found to be significant when used in discriminating between stage I melanoma versus stage IV patients whereas here the ion reported at m/z 12000 was identified when classifying between IV melanoma and unaffected healthy control individuals. Further, in the manuscript by Mian and colleagues (Mian, et al., 2005) predictive performance was based primarily on spectra obtained from Ciphergen SELDI chip platform which are associated with inherent low-resolution read-outs using low-resolution MS equipment whereas here protein biomarker detection was carried out using a higher resolution MALDI-MS analyzer, so the m/z value of 11700 may have some variation associated with it. Although both studies used ANNs the approaches applied were different; here novel stepwise analysis approaches were used which allow for the identification of individual mass ions with high predictive performance, whereas the SELDI analysis (Mian, et al., 2005) used larger mass ranges to identify regions of the profile which were important in discriminating between groups. Therefore it is important to consider different data mining techniques may elicit different markers with differing importance.
Bioinformatic sequence analysis of the six predictive peptides identified two peptide ions belonging to Alpha 1-acid glycoprotein (AGP) precursor 1/2 (AAG1/2) which when used together in a predictive model could account for 95% (47/50) of the metastatic melanoma patients. Additionally, another of the peptide ions was identified and confirmed to be associated with complement C3 component. Both proteins have been previously associated with metastatic disease in other types of cancers (Djukanovic, D et al (2000) Comparison of S100 protein and MIA protein as serum marker for malignant melanoma, Anticancer Res, 20, 2203-2207). This further confirms the value of the approach taken in the present invention. Other studies have also shown that increased levels of AGP are found in cancer (for example see Duche, J. C. et al (2000) Expression of the genetic variants of human alpha-1-acid glycoprotein in cancer, Clin Biochem, 33, 197-202). AGP, a highly heterogeneous glycoprotein, is an acute-phase protein produced mainly in the liver. However, its physiological significance is not yet fully understood, and as such AGP would not represent an expected melanoma biomarker.
To further assess whether the method of the invention could also be carried over to the analysis of gene expression data, as opposed to proteomic data, two publicly available datasets were analysed in accordance with the invention. Both of these datasets are associated with breast cancer. The first was a dataset published by van't Veer and co-workers (van't Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer, Nature, 415, 530-536) and the aims here were to identify subsets of genes which could accurately discriminate between patients who developed distant metastases within five years and those who did not. The initial analysis by van't Veer and colleagues (van't Veer, et al., 2002) used a form of unsupervised clustering and supervised classification whereby genes were selected by the correlation coefficient of expression with disease outcome. This approach led to the identification of a 70 gene classifier which predicted correctly disease outcome to an accuracy of 83%. The ANN stepwise approach of the present invention resulted in the identification of twenty genes which accurately predicted patient prognosis to a median accuracy of 100% for blind data over a number of random sample cross validation resampling events. Some of the genes which constitute this expression signature have previously been associated with cancer outcome. For example the first gene identified by our model was Carbonic Anhydrase IX, and was capable of predicting 70% of the samples correctly by itself. Carbonic Anhydrase IX (CA IX) has been suggested to be functionally involved in pathogenesis due to its increased expression and abnormal localization in colorectal tumors (Saarnio, J., et al (1998) Immunohistochemical study of colorectal tumors for expression of a novel transmembrane carbonic anhydrase, MN/CA IX, with potential value as a marker of cell proliferation, Am J Pathol, 153, 279-285). CA IX has also been suggested for use as a diagnostic biomarker due to its expression being related to cervical cell carcinomas (Liao, S. Y., et al. (1994) Identification of the MN antigen as a diagnostic biomarker of cervical intraepithelial squamous and glandular neoplasia and cervical carcinomas, Am J Pathol, 145, 598-609). Surprisingly, seven of the twenty genes identified as important by the ANN method of the invention represent expressed sequence tags (EST's) and the associated gene is therefore of unknown function. However, given their new-found predictive capability with regards to survival, further clinical analysis is now justified.
A further dataset was published by West et al. (West, M., et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles, Proc Nat/Acad Sci USA, 98, 11462-11467) and the ANN stepwise approach of the invention was applied to this dataset in order to identify groups of genes would accurately predict the estrogen receptor (ER) status and the lymph node (LN) status of the patient. The initial analysis by West and colleagues used regression models in order to calculate classification probabilities for the various outcomes. In their study, when analyzing ER status, a 100 gene classifier was identified which predicted 34 of the 38 samples used in the training set accurately and with confidence, and which performed well during cross-validation. Using the same approach, the authors identified a 100 gene classifier which could classify a training set of samples according to lymph node status for the samples used in the training set. However, this approach was less successful in predicting LN status during cross-validation, where all of the LN+ cases had estimated probabilities at approximately 0.5, indicating these predictions contained a great deal of uncertainty, possible due to high levels in variation in the expression profiles of these samples. Using the stepwise methodology of the present invention, two gene expression signatures were identified. The first discriminated 100% of the cases correctly with regards to whether they were positive or negative for ER, and the second predicted whether the tumour had spread to the axillary lymph node, again to an accuracy of 100%. The accuracies reported here are from multiple separate validation data splits, with samples treated as blind data over 50 models with random sample cross validation.
Clearly the stepwise ANN approach of the present invention provides significant advantages over the techniques used previously not only ion identifying biomarkers with improved predictive capability, but also in identifying novel biomarkers for use in diagnostic and prognostic cancer prediction.
In a further embodiment of the present invention, by using the logistic function, the ANN may be trained to predict against a continuous output variable, which in specific scenarios can be more intuitive than the use of a step-function to separate two classes. Here, a single layered network would be identical to the logistic regression model. However this approach has several disadvantages including the requirement of large numbers of data points per predictor, inter-correlations amongst predictors, and perhaps most importantly the predictor variables are usually required to be linearly related to the output measurement.
The use of the ANN of the present invention with one or more hidden layers allows for the estimation of non-linear functions. Universal approximation theorem states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layered perceptron ANN with a single hidden layer. This offers advantages over other machine learning classifiers (e.g. SVMs, Random Forest) where it may be difficult to approximate continuous output data.
This multi-layered perceptron ANN forms the basis of a novel algorithm utilising a stepwise modelling approach to identify the key components of a system in predicting against a continuous output variable, referred to hereafter as the “Risk Distiller” algorithm.
Potential uses for Risk Distiller in the medical arena include predicting actual time to progression, relapse, metastases or death in disease based scenarios, thus generating prognostic models with a view to tailoring therapies in a patient specific manner. This approach can be used in event data, and also may be adopted for predicting combined cohorts of censored and time to event data. Other biological uses include (but are not limited to) climate change prediction, prediction of weather patterns including ocean current measurements, predicting the effect of stresses on the productivity of crops with a view to forecasting crop yield. Other potential uses include financial forecasting and time series predictions, risk management and credit evaluation.
As described in more detail below, Risk Distiller has been successfully shown to identify a novel gene signature with the ability to predict time to distant metastases over a large series of cases spanning four separate patient cohorts with robust cross-validation. Here, the biomarkers identified were shown to be independent prognosticators of time to metastases. Based on the continuous prediction of time to event, Risk Distiller placed patients into distinct prognostic groups that showed large statistically significant differences in their actual time to metastases. For every year Risk Distiller predicted the patient would be metastases free, there was a two-fold less risk of them succumbing to this event.
The methods and systems of the present invention are not limited to biomarker data obtained solely from mass spectrometry analysis of biological samples. In alternative embodiments, labeled cDNA or cRNA targets derived from the mRNA of an experimental sample are hybridized to nucleic acid probes immobilized to a solid support. By monitoring the amount of label associated with each DNA location, it is possible to infer the abundance of each mRNA species represented. Such approaches are commonly referred to in the art as nucleic acid microarray, DNA microarray or simply gene-chip technologies. There are two standard types of DNA microarray technology in terms of the nature of the arrayed DNA sequence. In the first format, probe cDNA sequences (typically 500 to 5,000 bases long) are immobilized to a solid surface and exposed to a plurality of targets either separately or in a mixture. In the second format, oligonucleotides (typically 20-80-mer oligos) or peptide nucleic acid (PNA) probes are synthesized either in situ (i.e., directly on-chip) or by conventional synthesis followed by on-chip attachment, and then exposed to labeled samples of nucleic acids. The analysis of gene expression information can be performed using any of a variety of methods, means and variations thereof for carrying out array-based gene expression analysis. Array-based gene expression methods are known and have been described in the art (for example, U.S. Pat. Nos. 5,143,854; 5,445,934; 5,807,522; 5,837,832; 6,040,138; 6,045,996; 6,284,460; and 6,607,885).
Other biological sample analysis techniques may include protein/peptide microarrays (protein chips), quantitative polymerase chain reaction (PCR), mutiplex PCR, and various well-known nucleic acid sequencing technologies.
The invention is further illustrated by the following non-limiting examples.
A computational approach was taken to analyze genomic data in order to identify genes, proteins or gene/protein signatures, which correspond to prognostic outcome in patients with cancer. Genotypic, and subsequently phenotypic traits determine cell behaviour and, in the case of cancer, govern the cells' susceptibility to treatment. Since tumour cells are genetically unstable, it was postulated that sub-populations of cells arise that assume a more aggressive phenotype, capable of satisfying the requirements necessary for invasion and metastasis. The detection of biomarkers indicative of tumour aggression should be apparent, and consequently their identification would be of considerable value for early disease diagnosis, prognosis and response to therapy.
The present inventors have developed a novel method for determination of the optimal genomic/proteomic signature for predicting cancer within a clinically realistic time period and not requiring excessive processing power. The approach utilises ANNs and involves sequentially selecting and adding input neurons to a network to identify an optimum cancer biomarker subset based on predictive performance and error, in a form similar to stepwise logistic regression.
Three datasets were used to test and validate method of the invention. The first interrogates human serum samples with varying stages of melanoma. The samples were analysed by MALDI-TOF MS at Nottingham Trent University (Nottingham, United Kingdom) from samples collected by the German Cancer Research Centre (DKFZ, Heidelberg, Germany). The remaining two datasets were publicly available datasets which both originated from gene expression data derived from breast cancer patients.
The first dataset was derived from MALDI MS analysis for melanoma serum samples. The aims here were to firstly compare healthy control patients with those suffering from melanoma at the four different clinical stages, I, II, III and IV, in order to identify biomarker ions indicative of stage. Secondly, adjacent stages were to be analysed comparatively in the aim of identifying potential biomarkers representative of disease progression. All developed models were then validated on a second set of sample profiles generated separately from the first. This dataset contained 24,000 variables per sample.
The second dataset, published by van't Veer et al. (van't Veer, et al., 2002), used microarray technology to analyse primary breast tumour tissue in relation to development of metastasis. The authors generated data by gene expression analysis in a cohort of 78 breast cancer patients, 34 of which developed distant metastases within five years, and 44 which remained disease free after at least five years. Each patient had 24,482 corresponding variables specifying the Log10 expression ratio of a single known gene or expressed sequence tag (EST).
The third dataset publish by West et al. (West, et al., 2001) used microarray technology to firstly analyse primary breast tumors in relation to estrogen receptor (ER) state and secondly to assess whether the tumor had spread to the axillary lymph node (LN), providing information regarding metastatic state. This dataset consisted of 13 ER+/LN+ tumors, 12 ER−/LN+ tumors, 12 ER+/LN− tumors, and 12 ER−/LN− tumors, each sample having 7,129 corresponding gene expression values. The approach described here was then validated using a second dataset (Huang, et al., 2003) which was made available by the same group as the first, and contained a different population of patients, ran on a different microarray chip.
The ANN modelling used a supervised learning approach, multi-layer perceptron architecture with a sigmoidal transfer function, where weights were updated by a back propagation algorithm. Learning rate and momentum were set at 0.1 and 0.5 respectively. Prior to training the data were scaled linearly between 0 and 1 using minimums and maximums. This architecture utilized two hidden nodes in a single hidden layer and initial weights were randomized between 0 and 1. This approach has been previously shown to be a successful method of highlighting the importance of key inputs within highly dimensional systems such as this, while producing generalized models with accurate predictions (Ball, et al., 2002)
The same approach was applied across all datasets, with the only differences being the number of samples and input variables. Here, as an example the methodology as applied to the van't Veer dataset will be described. Data from the microarray experiments was taken in its raw form. This consisted of 78 samples each with 24,482 corresponding variables specifying the expression ratio of each single gene. Prior to training each model the data was randomly divided into three subsets; 60% for training, 20% for testing (to assess model performance during the training process) and 20% for validation (to independently validate the model on previously unseen data). This process is known as random sample cross validation and enables the generation of confidence intervals for the predictions on a separate blind data set, thus producing robust, generalized models.
Initially, each gene from the microarray dataset was used as an individual input in a network, thus creating n (24,482) individual models. These n models were then trained over 50 randomly selected subsets and network predictions and mean squared error values for these predictions were calculated for each model with regards to the separate validation set. The inputs were ranked in ascending order based on the mean squared error values for blind data and the model which performed with the lowest error was selected for further training. Thus 1,224,100 models were trained and tested at each step of model development.
Next, each of the remaining inputs were then sequentially added to the previous best input, creating n-1 models each containing two inputs. Training was repeated and performance evaluated. The model which showed the best capabilities to model the data was then selected and the process repeated, creating n-2 models each containing three inputs. This process was repeated until no significant improvement was gained from the addition of further inputs resulting in a final model containing the gene expression signature which most accurately modeled the data.
This process requires the training and testing of potentially millions of models. To facilitate this, software to automate the procedure has been created using Microsoft Visual Basic. Here, the inputs are added automatically, selecting the best contender biomarkers at each step.
This whole process was repeated from step 3, continually adding inputs until no improvement was gained from the addition of further inputs
Because there are no confirmatory blood markers for metastatic melanoma, we sought to develop a validated, robust and reproducible MALDI MS methodology using the same stepwise ANN approach to profile serum protein and tryptically digested peptides. This was applied to data derived from MALDI MS analysis representing (i) protein and (ii) digested peptide data from the control and diseased samples. Various analyses were carried out on these datasets in order to identify biomarker ions indicative of the classes shown in Table 1.
Biomarker patterns containing 9 ions from the protein data and 6 ions from the digested peptides were identified, which when used in combination correctly discriminated between control and Stage IV samples to a median accuracy of 92.3% (inter-quartile range 89.4-94.8%) and 100% (inter-quartile range 96.7-100%) respectively. Table 2a-b shows the performance for the models at each step of the analysis for the protein and peptide data. This shows that with the continual addition of key ions, an overall improvement in both the error associated with the predictive capabilities of the model for blind data, and also the median accuracies for samples correctly classified. Nine ions was determined to be the most effective subset of biomarker ions producing the best model performance for the protein data as no significant improvement was seen in predictive performance with the addition of further ions. No further steps were conducted beyond step 6 for the peptide data because after these step because no significant improvement in performance could be achieved. Therefore these models were considered to contain a subset of ions representing either the proteins or digested peptides, which most accurately modelled the data.
Next, because the analysis of the peptide data provides the potential for subsequent protein identification, it was decided that these peptide MALDI MS profiles would be analysed in the search for differential biomarker ions which would be representative of firstly disease stage (by analysing the individual stages against control populations) and secondly disease progression (by generating predictive models classifying between adjacent disease stages). The analyses conducted in this part of the study are summarised in Table 3.
Initially, in order to identify ions which were representative of disease stage, the stepwise approach was applied to identify subsets of biomarker ions which could predict between disease stage and control samples. This would therefore provide valuable information concerning which peptide ions were showing differential intensities that were specific to the disease stage of interest. Table 4 shows the biomarker subsets identified in each model, and their median performance when predicting validation subsets of data over 50 random sample cross validation resampling events.
Considering that 3500 individual ions are trained and tested at each step of analysis over 50 random sample cross validation resampling events, it seems unlikely that their consistent identification as the most important ions at a given step would be a consequence of chance, providing confidence that these ions are representing proteins which are showing a true change in intensity in patients with disease at differing stages.
Once biomarker ions representative of individual disease stage had been determined, it was decided important to analyse adjacent group stages of disease, which would potentially identify biomarker ions which would represent those responding differently as disease progressed, and would be predictive and indicative of disease stage. Table 5 shows the biomarker subsets identified in each model, and their median performance when predicting validation subsets of data over 50 random sample cross validation resampling events. It was interesting to find that subsets of ions could be identified which were able to predict between stages to extremely high accuracies; 98% for stage I v stage II and 100% for stage II v stage III and stage III v stage IV. Furthermore, only two peptide biomarker ions were required in order to perfectly discriminate between stage II and stage III, with one of these ions, 903, also being important in the classification of stage III v stage IV, suggesting that this ion is potentially of importance in disease progression to advanced stages, and appears to be downregulated as melanoma stage advances from stage II to IV, which could only be confirmed by further studies.
1299, 2309,
1251, 1283,
1299, 1968,
3432, 3443
1251, 1285,
1754, 2624,
The overall summaries for the stepwise analysis conducted here can be seen in
To study the question of stability of this procedure over multiple experiments and to assess batch to batch reproducibility of the mass spectrometry analysis, both the proteins and peptides were run by the group on two separate occasions and the results of the second experiment were used to validate the stepwise methodology. This dataset was obtained by a different operator and on a different date. The second sample set was then passed through the developed ANN models to blindly classify them as a second order of blind data for class assignment. For the protein data, the model correctly classified 85% of these blind samples correctly, with sensitivity and specificity values of 82 and 88% respectively, with an AUC value of 0.9 when evaluated with a ROC curve. For peptides, the model correctly classified 43/47 samples originating from control patients, and 43/43 samples from cancerous patients. This gave an overall model accuracy of 95.6%, with sensitivity and specificity values of 100 and 91.5% respectively, with AUC value of 0.98. This suggests that the peptide data was more reproducible than the protein data for this second batch of mass spectrometry analysis. The predictive peptide ions were subsequently sequenced and identified by colleagues using a variety of mass spectrometric techniques leading to the identification of two proteins; Alpha 1-acid glycoprotein (AGP) precursor 1/2 (AAG1/2) and complement C3 component.
Analysis of van't Veer et al. Dataset
The aims of the analysis were to utilise the novel stepwise ANN modelling approach of the invention in order to identify a gene expression signature which would accurately predict whether a patient would develop distant metastases within a five year time period, and thus identifying potential markers and giving an insight into disease aetiology. Following the rule of parsimony which suggests that the simplest model fitting the data should be used, an initial analysis was carried out using logistic regression (Subasi and Ercelebi (2005) Comput Methods Programs Biomed. 78(2):87-99). This method led to poor predictive performances with a median accuracy of just 53% (inter-quartile range 47-61%). With logistic regression, there is the potential disadvantage of auto-correlation between the large numbers of independent variables within the dataset, which is possibly the reason for the poor predictive performance suggesting that this dataset is not linearly separable.
The application of this approach resulted in the identification of a gene expression signature consisting of twenty genes which predicted patient prognosis to a median accuracy of 100% (inter-quartile range 100-100%, mean squared error of 0.085), where samples were treated as blind data over 50 models with random sample cross validation. The overall screening process assessed over ten million individual models. When evaluated with a ROC curve the model had an AUC value of 0.971 with sensitivity and specificity values of 98% and 94% respectively.
Homo sapiens HSPC337
Homo sapiens cDNA:
Median accuracy, lower and upper inter-quartile ranges, gene names (where known) and descriptions are shown.
To further validate the model, an additional set of 19 samples were selected, as in the original manuscript (van't Veer, et al., 2002). This set consisted of 7 patients who remained metastasis free, and 12 who developed metastases within five years. The 20 gene expression signature that had been identified correctly diagnosed all 19 samples correctly, further emphasising the present models' predictive power.
The aims here were to identify a gene expression signature which would accurately predict between firstly estrogen receptor (ER) status, and secondly to determine whether it was possible to generate a robust model containing genes which would discriminate between patients based upon lymph node (LN) status. As before, an initial analysis was carried out using logistic regression which again led to poor predictive performances with a median accuracy of 78% (inter-quartile range 67-88%) for the ER data, and just 56% (inter-quartile range 44-67%) for the LN dataset, which is comparable to the predictions one would gain from using a random classifier.
Here, using the stepwise methodology, two gene expression signatures were identified. The first discriminated 100% of the cases correctly with regards to whether they were positive or negative for ER, and the second predicted whether metastasis of the tumour to the axillary lymph node had occurred, to an accuracy of 100%. Again, the accuracies reported are from separate validation data splits, with samples treated as blind data over 50 models with random sample cross validation. The overall screening process assessed over five million individual models. When evaluated with a ROC curve the model had an area under the curve value of 1.0 with sensitivity and specificity values of 100% and 100% respectively for both ER and LN status.
The models developed using the gene subsets identified by the approach described were applied to 88 samples from Huang and colleagues (Huang, et al (2003) Lancet, 361, 1590-1596). These samples were then subjected to classification based upon ER and LN status as with the first dataset. 88.6% of the samples could be classified correctly based on ER status, with a sensitivity and specificity of 90.4 and 80% respectively. 83% of samples were correctly classified based upon their LN status, with a sensitivity of 86.7% and specificity of 80%. The ROC curves AUC values were 0.874 and 0.812 for the ER and LN gene subset models respectively. It was expected that the predictive accuracies would be reduced when the models were applied to this additional dataset, but the accuracies reported here remain extremely encouraging because of the larger sample size, the differences in sample characteristics and microarray analysis described above. The ability to predict ER status at a higher rate than that of LN status suggests that there is a greater level of variation in the gene expression profiles with respect to LN status compared to that of ER.
H. sapiens 5T4 gene for
Homo sapiens HPV16 E1
Homo sapiens U2
Median accuracy, lower and upper inter-quartile ranges, gene accession numbers, gene descriptions are shown.
Homo sapiens I-Rel
Homo sapiens
Median accuracy, lower and upper inter-quartile ranges, gene accession numbers, gene descriptions are shown.
The stepwise methodology described above facilitates the identification of subsets of biomarkers which can accurately model and predict sample class for a given complex dataset. In order to facilitate a more rapid biomarker subset analysis, the stepwise approach described adds only the best performing biomarker each step of analysis. Although this appears to be an extremely robust method of biomarker identification, the question remains as to whether there are additional subsets of biomarkers existing within the dataset, which are also capable of predicting class to high accuracies. If this is true, then this would lead to a further understanding of the system being modelled, and if multiple biomarkers were to appear in more than one model subset, then this would further validate their identification, and enhance the potential of their role in disease status warranting further investigation.
To achieve these aims, the same West dataset was used as previously (West, et al., 2001). As can be seen from table 8a-b, in addition to the number one ranked biomarker at step one (which was subsequently used as the basis for the gene biomarker signature described earlier), there are several other potential candidate biomarkers which by themselves are able to classify a significant proportion of the sample population into their respective classes. Therefore an individual stepwise analysis was conducted on each of the remaining top ten genes identified in step one of the analysis, for both ER and LN status.
a)-(b) shows the network performance at each step of analysis for all of these genes for (a) ER and (b) LN status. It is evident that all of these subsets have the ability to predict for blind subsets of samples to extremely high accuracies, with no significant differences between individual models. This suggests that there may be multiple genes acting in response to disease status, subsequently altering various pathways and altering the expression levels of many other genes. It is worthwhile to note that some of these genes were identified in many of the models (Table 9), for example an EST appeared in seven out of ten models, further highlighting its potential importance in LN status. This shows that there is not necessarily just one set of biomarkers which are correlates of a particular disease status of interest, but there may be many, and when one particular subset of biomarkers are affected in such a way that is indicative of disease status, then this may consequently have a cascade affect on many other biomarkers, altering their expression in a similar fashion.
To provide further evidence and confidence that the biomarker subsets identified in all of the above analyses by the stepwise approach were not random as a consequence of the high dimensionality of the datasets, two validation exercises were conducted. Firstly, ten inputs were randomly selected from the datasets and trained over 50 random sample cross validation events in an ANN model identically as for the stepwise method. This process was repeated 1,000 times, and the summary results are presented in Table 10.
It is clear from Table 10 that the variation amongst models generated with these random input subsets is small, suggesting that a randomly generated model is able to predict sample class to accuracies in the region of 64% for blind data. These models will very rarely predict significantly higher than this value, which is highlighted in
a)-(c) highlights the significance between the performance of the randomly generated models and those developed with the stepwise approach for the van't Veer and West gene expression datasets (van't Veer, et al., 2002; West, et al., 2001).
These results show that a random classifier would indeed as expected lead to classification accuracies close to random, and therefore it can be said that the stepwise approach truly identifies subsets of inputs which predict well on unseen data.
Now it was necessary to investigate whether this stepwise approach would identify the same inputs if the analysis was run several different occasions, starting over each time with the same dataset. To achieve this, the stepwise analysis was run and trained on the van't Veer dataset with samples randomly split into training, test, and validation subsets 10, 20, 50 and 100 times and subsequently trained. This was then repeated five times to calculate how consistent the ranking of the individual inputs was with regards to model performance. This consistency was calculated for the top fifty most important inputs, and was the ratio of actual ranking based upon the average error of the model, to the average ranking over the multiple runs. These are summarised in Table 11.
There was a significant increase in consistency amongst the performance of inputs when increasing from 10 to 20 (p=0.000), and 20 to 50 RSCV datasplits (p=0.000), but not from 50 to 100 (p=0.2213). Interestingly, for all analyses, the same two inputs were ranked as first and second every time, with the majority of the variation in rankings appearing towards the bottom of the top 50 list, which accounts for the 14 and 12% variability in the 50 and 100 RSCV event models respectively. This showed step 1 to be extremely consistent in important input identification across multiple analyses.
The same procedure was then carried out for step 2, with the input identified as the most important across all the models in step 1 used to form the basis of this second step. Table 12 shows the average consistency ratios for step 2.
It is clear that consistency across multiple repeats of the analysis showed a dramatic decline, with only the 100 RSCV model retaining its consistency in input identification, and the improvement in consistent input performance was statistically significant (p=0.000) at each increment. The 50 and 100 RSCV models both identified the same input as number one ranked, and it therefore appears evident that a minimum of 50 RSCV datasplits is preferable to ensure that the same inputs are consistently identified as important multiple times in 80-90% of analyses.
The present example demonstrates one aspect of the novel stepwise ANN approaches of the invention as utilised in data mining of biomarker ions representative of disease status applied to different datasets. This ANN based stepwise approach to data mining offers the potential for identification of a defined subset of biomarkers with prognostic and diagnostic potential. These biomarkers are ordinal to each other within the data space and further markers may be identified by examination of the performance of models for biomarkers at each step of the development process. In order to assess the potential of this methodology in biomarker discovery, three datasets were analysed. These were all from different platforms which generate large amounts of data, namely mass spectrometry and gene expression microarray data.
The present technology is able to support clinical decision making in the medical arena, and to improve the care and management of patients on an individual basis (so called “personalised medicine”). It has also been shown that gene expression profiles can be used as a basis for determining the most significant genes capable of discriminating patients of different status in breast cancer. In agreement with van't Veer et al. (West, et al., 2001) it has been demonstrated that whilst single genes are capable of discriminating between different disease states, multiple genes in combination enhance the predictive power of these models. In addition to this, the results provide further evidence that ER+ and ER− tumours display gene expression patterns which are significantly different, and can even be discriminated between without the ER gene itself. This suggests that these phenotypes are not only explained by the ER gene, but a combination of other genes not necessarily primarily involved in the response of ER, but which may be interacting with, and modulating ER expression in some unknown fashion. Unlike some analysis methods, the present ANN stepwise approach takes each and every gene into account for analysis, and does not use various cut-off values to determine significant gene expression, which overcomes previous data analysis limitations. These models can then form a foundation for future studies using these genes to develop simpler prognostic tests, or as candidate therapeutic targets for the development of novel therapies, with a particular focus being the determination of the influence that these genes may have upon ER expression and development of lymph node metastasis. Given the relevance of the genes identified by this method and the applicability of these to a wider population this approach is a valid way of identifying subsets of gene markers associated with disease characteristics. Confidence in the identified genes is increased further still in that many of these genes have known associations with cancer.
To conclude, the present example demonstrates that by using novel ANN methodologies, it is possible to develop a powerful tool to identify subsets of biomarkers that predict disease status in a variety of analyses. The potential of this approach is apparent by the high predictive accuracies as a result of using the biomarker subsets identified. These biomarker subsets were then shown to be capable of high classification accuracies when used to predict for additional validation datasets, and were even capable of being applied to predict the ER and LN status of a dataset very different in origin from the one used in the identification of the important gene subsets. This in combination with the various validation exercises that have been conducted suggests that these biomarkers have biological relevance and their selection is not arbitrary or an artefact of the high dimensionality of the system as they were shown to be robust to cope with sampling variability and reproducible across different sample studies.
Molecular diagnostics for the diagnosis of disease are becoming increasingly important in the early diagnosis and management of disease, the stratification of patients in clinical trials and the identification of patients who should receive certain therapies.
Before the advent of molecular diagnostics, clinicians categorized cancer cells according to their pathology, that is, according to their appearance under a microscope. Now taking data from new disciplines such as, genomics and proteomics; molecular diagnostics categorizes cancer using technology such as mass spectrometry and transcriptomic gene chips. Molecular diagnostics have been used most extensively in the field of cancer but increasingly are also being used in most clinical indications of disease.
Molecular diagnostics determines how genes and proteins are interacting in a cell. It focuses upon patterns of gene and protein activity in different types of cancerous or precancerous cells. Molecular diagnostics uncovers these sets of changes and captures this information as expression patterns. Also called “molecular signatures,” these expression patterns are improving the clinicians' ability to diagnose cancer. Molecular signatures include specific sets of genes whose expression patterns are correlated to a specific phenotypic output. Whilst the expression of each individual gene in isolation is not indicative of a defined phenotype it is the combination of all the genes within the panel that together provides a reliable and defined correlation to a pathological condition. Increasingly in bioinformatics and genomic analysis it has been recognised that a key step in recognising and predicting susceptibility to disease is through the identification of these molecular signatures in tissues taken from a patient. Whereas single gene target tests are crude and can often miss larger scale changes in cellular biology, detection and analysis of distinct molecular signatures can provide accurate prognosis of disease states within individuals earlier than was previously thought possible.
A diagnostic test known commercially as Mammaprint™ (Agendia, Amsterdam, Netherlands) for use in oncology is based on the original van't Veer dataset (Nature, 2002) in fresh frozen tissue. The Mammaprint™ test predicts low and high risk of distant metastasis (Ishitobi et al., Jpn J Clin Oncol, Jan. 27, 2010). This test is based on a 70 gene signature, which has a median sensitivity of 86% and currently markets at around US$3,000 per test placing it out of the spending range of most health service providers. The stratification defines “low risk patients” as having a 10% chance of recurrence within years whilst “high risk” patients have a 20% chance of recurrence within 10 years. Hence, the overall predictive accuracy is low. The diagnostic test can be used further to classify patients into oestrogen receptor (ER) and BRCA1 positive or negative as described in U.S. Pat. No. 7,514,209.
U.S. Pat. No. 7,081,340 describes a test which stratifies patients into broad categories of low, medium and high risk with a view to identifying patients who would most benefit from chemotherapy.
Other types of RNA expression analysis to diagnose breast cancer have focussed on combinations of genes identified by a variety of screening methods. Such methods include Veridex™ as set out in US patent publication no. 2009/0298052, which describes a breast cancer diagnostic for use intra-operatively to predict the presence of micrometastasis. The Ipsogen™ test, as set out in International Patent publication no. WO-2009/083780, describes a diagnostic segregating patients into basal or luminal breast cancer and further good or poor prognosis of the luminal breast cancer subtypes based upon the expression analysis of 16 different kinase genes. In US Patent Publication no. 2008/0206769 an analysis is made of 14 genes to derive a metastasis score, which is compared with a threshold comparator to give patients a risk of developing metastasis. The Diadexus™ test as set out in International Patent publication no. WO06121991 describes some 70 genes whose expression levels were used to provide for a differential diagnosis of good or poor prognostic outcome.
Currently there are no prognostic tests for breast cancer that are able to define the precise time to a given event in disease progression, such as projected time to death, progression of the disease to a later stage, and likelihood of recurrence or metastasis. The only tests currently available, as described above, predict broadly defined classes, such as poor or good prognostic group, without consideration of the individual's actual prognosis. One example of this is the well-established Nottingham Prognostic Index (Galea MH, et al. Breast Cancer Res Treat. 1992; 22(3):207-19).
In general the use of expression analysis has been used to provide a classification of good or poor prognosis in patients, or the classification of groups comprising individuals with similar risk of developing metastasis. The analysis of gene expression levels has not been used to provide a time to an event diagnosis that would guide clinical management of the disease or the timing of clinical intervention.
The present invention, for the first time, describes a method and apparatus that predicts a time to a given disease progression outcome, hereafter referred to as an “event”.
In the present example of the present invention in use, the inventors have analysed data from three public breast cancer gene expression microarray datasets with longitudinal follow up, to predict an event: the distant metastasis-free survival (DMFS) interval (n=530, ER+ cases). A gene signature has been incorporated into a decision support model comprising 31 genes that predict actual DMFS with high accuracy (Spearman's r=0.86). This signature has been validated on blind data from a fourth set, where it has shown good predictive results.
This novel test provides a more accurate diagnosis for the individual, moving away from the group based statistics or prognostic classes that are currently employed in the art. The 31 gene signature disclosed herein in Table 13 may be translated to a quantitative PCR test and used to diagnose the time to distant metastasis on fresh frozen [FF] material or formalin fixed paraffin embedded [FFPE] material, through an associated decision support tool. Alternatively the 31 gene signature can be translated to a gene microarray in format of a small bespoke array specifically for the purpose of analysing and providing a time to an event diagnostic. Further refinement allows for the 31 gene signature to be incorporated into a next-generation sequencing format—such as using Solexa™ deep sequencing technology.
The potential advantage for the diagnostic described herein is that it provides a time to an event prognosis to each patient that enables clinicians and patients to plan appropriate therapies and thus subsequent patient management. For those patients with a shorter predicted time to an event, a clinical approach prescribing aggressive chemo- and radio-therapy followed with Tamoxifen, for instance, may be deemed appropriate. On the other hand, patients with a mid- to late time to an event could benefit from Tamoxifen for several years with regular check ups. A significant part of the clinical validation exercise is to look very carefully at the mid- to late time to event groups identify subgroups within this cohort that would further allow differential treatment strategies to be identified.
The inventions described herein through the use of the gene expression panel coupled with ANN data mining and interrogation and the novel application of a continuous output from the ANN provide for a diagnostic or prognostic that predicts the time to an event, in this specific embodiment the development of distant metastasis.
Artificial Neural Networks (ANNs) have been selected as they provide a non-linear basis for identification of genes associated with particular clinical questions. It is well known that this type of ANN is a powerful tool for the analysis of complex data (Wei et al, 1998; Ball et al, 2002; Khan et al, 2001). A number of studies have indicated the approach can produce generalised models with a greater accuracy than conventional statistical techniques in medical diagnostics (Tafeit and Reibnegger, 1999; Reckwitz et al, 1999) without relying on predetermined relationships as in other modelling techniques. The application of these approaches has been presented in Lancashire et al (2009). The approaches have been developed since early application by Ball et al (2002).
A number of other methods may be developed for the purposes of developing disease and clinical classifiers. These in clued various forms of genetic algorithm, support vector machine, decision tress (extending to random forests) and Bayesian methodologies. The vast majority of these are applied to data mining and classifier development in a recursive fashion, resulting in extremely large panels of markers. The ANN algorithm mentioned above has shown an improvement in performance over these methods and has identified much smaller panels of genes with higher classification performance.
One of the major hurdles to the analysis of the data types described above is the high dimensionality and complexity of the data. This has been termed “the curse of dimensionality” (Bellman, 1961, Bishop, 1995) and often leads to an input space with many irrelevant or noisy inputs, subsequently causing predictive algorithms to behave badly as a result of them modelling extraneous portions of the space. Conventional statistical theory would indicate that for a valid representation of the population one should have at least twice as many replicates as the number of dimensions in the data. Clearly a data set requiring hundreds of thousands of samples is not feasible due to sample availability. It is estimated that to achieve power for transcriptomic micro array data based on conventional analyses (T test for example) in the order of 105 replicates would be required (Ponder pers con). In practice the powering issues associated with high dimensional data sets can be overcome by modelling individual components of the parameters in the array and applying a robust cross validation approach (Michiels et al, 2007). Furthermore, to validate panels of markers a secondary data set is beneficial. This approach has been incorporated into the algorithms described previously herein.
Power analysis was conducted using a multivariate regression model (which has less power than a non linear ANN based approach) for an r2 of 0.81 at an alpha value of 0.05 for 31 regressor variables 38 cases are required to give a power of 0.8. Analysis was conducted according to Lenth, R. V. (2006-9) Java Applets for Power and Sample Size [Computer software], retrieved from www.stat.uiowa.edu/˜rlenth/Power.
This approach also addresses the question of the applicability and validity of the biomarker panel to predictions for a broader population. When analyzing a particular data set one has to be careful to prevent over-fitting. The impact of over fitting is that the signature or pattern identified may be applicable for the data set (group) being modelled but as soon as the pattern is applied to a blind independent data set the signature ceases to be predictive. Any approach is particularly sensitive to this over-fitting when the population numbers are low. The problem with over fitting can be overcome by analysing a large number of replicates to achieve statistical power.
Using the logistic function, ANNs may be trained to predict against a continuous output variable, which in specific scenarios can be more intuitive than the use of a step-function to separate two classes. Here, a single layered network would be identical to the logistic regression model. However this logistic regression approach has several disadvantages including the requirement of large numbers of data points per predictor, inter-correlations amongst predictors, and perhaps most importantly the predictor variables are usually required to be linearly related to the output measurement.
Introduction of ANNs with one or more hidden layers allows for the estimation of non-linear functions. Universal approximation theorem states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layered perceptron ANN with a single hidden layer. This offers advantages over other machine learning classifiers (e.g. SVMs, Random Forest) where it may be difficult to approximate continuous output data.
This multi-layered perceptron ANN forms the basis of the present example and is referred to as “Risk Distiller”, a novel algorithm utilising a stepwise modelling approach to identify the key components of a system in predicting against a continuous output variable.
Potential uses for Risk Distiller in the medical arena include predicting actual time to event including progression, relapse, metastases or death in disease based scenarios, thus generating prognostic models with a view to tailoring therapies in a patient specific manner. This approach can be used in event data, and also may be adopted for predicting combined cohorts of censored and time to event data. Other biological uses include (but are not limited to) climate change prediction, prediction of weather patterns including ocean current measurements, predicting the effect of stresses on the productivity of crops with a view to forecasting crop yield. Other potential uses include financial forecasting and time series predictions, risk management and credit evaluation.
One of the criticisms of previous studies deriving biomarker signatures relating to clinical characteristics has been that the training and validation sets have come from single studies carried out in individual centres. To address this the present inventors have used three publications Chin et al (Cancer Cell 2006), Miller et al (PNAS 2005) and Desmedt et al (Clin Canc Res 2007) to initially derive a gene signature which for the first time predicts time to an event, in this case the event is distant metastasis-free survival [DMFS], through a decision support model. Examples of other suitable events include, but are not limited to, disease occurrence or recurrence, drug therapy failure, or more broadly time to the development of any specific phenotype defined by gene expression, gene silencing or similar molecular events.
Furthermore, existing signatures tend to be based on broad categories or classes or groups of individuals in the population. For example the van't Veer study (Nature 2002) found correlates of good or poor prognostic outcome groups. These were defined using a cut-off off of 5 years with the good group developing metastasis after 5 years and the poor group developing metastasis before 5 years. Clearly the selection of these cut-offs is somewhat arbitrary and an individual who develops metastasis at 4 years 11 months may have a very different profile from an individual who develops metastasis at 6 months. This definition of classes also introduces errors to the classification tool due to the within class heterogeneity. For example even in the Good Prognostic Group of the Nottingham Prognostic Index individuals may die at 6 months or 120 months. To date there has been little focus on the individual's prognosis or on non class based decision support models using a continuous output. A further aspect of this invention is the prediction of an event for an individual based on a molecular profile that is specific for the individual and not based on a class, such as good or poor prognosis.
The approach adopted in this example has progressed the characterisation of individual cases by utilisation of the aforementioned ANN based algorithm (see Example 1), but adapted to provide a continuous output (see
During the analysis of the primary data sets an internal Monte-Carlo cross validation approach was adopted to optimise the signature derived and prevent over-fitting of the decision support system. This approach mitigates the need for vast numbers of cases which Power Analysis indicates when conventional parametric statistics are employed as the solution is driven to a global solution and prediction for unseen cases. To further validate the decision support model, the biomarker signature was tested on a fourth independent dataset (source Sotiriou et al (JNCI 2006)). The ER+ biomarker signature performs well on unseen data used to develop the signature (n=127; r=0.86; p=<0.0001), a separate cohort of patients from the fourth study (n=20; r=0.93; p=<0.0001) and even for cases censored or lost to follow up (n=383; r=0.59; p=0.0001).
A comparison between the actual and the decision support model predicted Kaplan Meier curves was made by using Log rank tests. These produced a p value of 0.56 indicating equivalence of the model predictions with actual events (predicted median survival compared to actual median survival was 3.7 months versus 3.5 months respectively).
The genes identified when combined in a panel correlate positively, negatively and in a highly curvilinear fashion with DMFS. This prevents the generation of a simple rule based solution to the prediction of DMFS and requires incorporation of the panel into a decision support model through the model algorithm developed herein. A separate analysis of all of the genes individually showed they were significantly related to the DMFS hazard based on Cox proportional hazard survival models. A specific aspect of this invention is therefore a decision support model, which specifies the positive, negative or cofactorial aspect of the genes within the panel.
Further, a subset analysis allows output time to event information on individuals to be split into groups of <5 years, >5 years DMFS which reveals a clear and distinct clustering of cases based upon the 31 gene signature (see
The invention provides a diagnostic panel, comprising thirty-one genes, which when incorporated into a decision support model such as Risk Distiller predicts time to an event. Conversely, the invention provides a decision support model that when combined with the unique gene signature predicts time to an event, in this case DMFS. This is the first time such a decision tool has been developed for an individual's prognosis.
A further embodiment of the invention is the depiction of predicted time of survival of a population based on the use of the diagnostic predicting time to an event. Another embodiment of the invention is the specific predicted Kaplan Meier curve derived from data mining of publications to generate a working model against which individuals' gene expression information may be used to predict time to distant metastasis. A further utility of this invention is the derivation and depiction of the predicted Kaplan Meier curve from use of the Risk Distiller algorithm.
A further embodiment of this invention is, therefore, a gene panel comprising 1 or more of the thirty-one gene signature that specifies a subset of patients with a time to an event [DMFS] of less than 5 year or more than 5 years. Another embodiment of this invention is a decision support model that works to provide a time to an event for a subset of patients with a time to an event [DMFS] of less than 2 years, or a time to an event of 2.5 to 5 years, 5-10 years or greater than 10 years.
A further embodiment of this invention is a gene signature predicting a time to an event comprising a gene panel of 31 genes listed in Table 13. Further refinement of the gene panel allows patients to be grouped into 2 groups with DMFS of less than 5 years or more than 5 years and specific gene panels defining these groups are within the remit of the present invention.
It will be understood that the embodiments described above are given by way of example only and are not intended to limit the invention, the scope of which is defined in the appended claims. It will also be understood that the embodiments described may be used individually or in combination.
The following are hereby incorporated by reference in their entirety herein.
This application claims priority to U.S. provisional application 61/382,099, filed Sep. 13, 2010, the content of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61382099 | Sep 2010 | US |