The field of the invention is that of individual prediction methods for the screening, diagnosis, prognosis or therapeutic response of diseases and the side effects of medicaments in the case of complex and multifactorial diseases such as cancers and notably prostate cancer.
Nowadays, there are forms of cancer, and notably prostate cancer, that are widespread in humans in industrialized countries and whose incidence has substantially increased in recent years.
The diagnosis and the treatments proposed require the carrying out of invasive and expensive procedures. The current methods developed for determining populations at risk or the management strategies propose positive or negative predictive values (cancer/no cancer) according to tests (tumor markers, molecular signatures and the like) or results obtained from linear functions of the nomogram type, but their reliability is less than 80% and the results are rarely reproducible on an individual scale.
Currently, it has been proposed to evaluate a risk of prostate cancer by a blood test for the prostate specific antigen (PSA) which is the reference marker for deciding on an invasive procedure of the biopsy type for the histological confirmation of a prostate cancer, typically in the cases of detection of a measured level greater than 4 ng/ml, or even 2.5 ng/ml in some protocols.
Above 4 ng/ml of blood PSA level, the sensitivity is 30%, which means that among the people who have a total PSA level greater than 4 ng/ml, only of the order of 3 out of 10 have a prostate cancer.
At the threshold of 4 ng/ml, the specificity of the PSA test is of the order of 80%, which means that when the PSA threshold is less than 4 ng/ml, the absence of prostate cancer is real in 8 cases out of 10.
Tools for evaluating the nomogram-type risk incorporating several parameters have been developed in order to respond to individual questions and have in particular been described in the journal [S. F. Shariat, P. I. Karakiewicz, C. G. Roehrborn and M. W. Kattan, An updated catalog of prostate cancer predictive tools, Cancer (113), p. 3075-99, 2008].
Nomograms are statistical tools intended for decision-making, which contain information obtained from hundreds of concrete observations on proven cases of prostate cancer. These tools help patients and doctors during decision-making. They provide predictions calculated from a variety of clinical data obtained from previously treated prostate cancers. They are slide rules or abacuses constructed on the basis of multivaried logistic regressions. These nomograms have a mean accuracy rate of 80%, which remains insufficient. Patients nevertheless obtain therefrom undeniable advantages because they are free of the partiality and the subjectivity found in various clinicians and health care professionals. By way of example, 12 questions and associated predictive tools are proposed by the Fondation de Recherche Canadienne sur le Cancer de la Prostate [Canadian Foundation for Research on Prostate Cancer].
The existing solutions used in this type of predictive tools are most often based on the collection of clinical and evaluation data using linear methods of modeling relative to the parameters. The methods developed are insufficient in terms of reliability and do not make it possible to carry out hierarchical predictions such as: risk of cancer, risk of rapidly progressing cancer, risk of cancer resistant to a treatment which are sufficiently low.
Decision taking in good concepts of personalized medicine could ideally take into account characteristics specific to the patient, for instance constituent genetic data or family histories. These informative data on cancer susceptibility, appropriately modeled, would, in the case of prostate cancer, make it possible to assist patients and specialists in deciding on the relevance of age of entry in a screening process and of the risk of a positive biopsy, and could even be decisive in terms of management of the patent diagnosed. This is because some genetic markers are correlated with the aggressiveness of prostate cancer [O. Cussenot, et al., Effect of genetic variability within 8q24 on aggressiveness patterns at diagnosis and familial status of prostate cancer, Clin Cancer Res (14) pp 5635-9; 2208] and can therefore assist in deciding on the relevance of a treatment, typically radical prostatectomy for localized forms of cancer. The notion of susceptibility to cancer to which the present invention refers can in fact be used in various clinical situations.
The search for relevant markers represents the challenge of predictive medicine. It is a technological challenge with respect to genomics, but also with respect to mathematics. The etiology relating to the causes and the progression of prostate cancers is complex and is the result of multiple stochastic interactions between constitutional genetic factors, acquired tissue factors and environmental factors. The conviction that genetic factors are important in the etiology of prostate cancer comes from the observation of clusters of cases in certain families [Carter B S Mendelian inheritance of familial prostate cancer, PNAS (89) 3367-7 (1992)]. It has been possible to demonstrate highly penetrating mutations i.e. the presence of which signifies a strong probability of becoming sick, such as those of the BRCA1 gene; see, for example [J. A Douglas et al., Common variation in the BRCA1 gene and prostate cancer risk Cancer Epidemiol Biomarkers Prev (16) pp 1510-6 (2007)].
Only 5% of prostate cancer cases appear to correspond to the simplest Mendelian inheritance model [G. Cancel-Tassin and O. Cussenot Prostate cancer genetics Minerva Urol Nefrol (4) p 289-300 (2005)]. The investigation of more complex interactions, between alleles with low penetrance, i.e. in models where each allele is only involved a small amount in the tumorigenesis process, has taken over from the search for a mutation in candidate genes. Thus, the search for genetic markers for thorough identification of the points in the genome that may be involved in susceptibility to prostate cancer has resulted in the implementation of association studies, such as the “genome wide association studies”, which produce genotyping data covering as much as possible the human genome for DNA sequence polymorphisms. This genotyping produced for control individuals and individuals suffering from prostate cancer should make it possible, by comparison, to identify polymorphisms statistically associated with the pathological condition of interest. For prostate cancer, three GWAS studies are currently a benchmark; Gudmundsson, J. et al., Genome-wide association study identifies a second prostate cancer susceptibility variant at 8q14 Nat Genet (39) p 631-7 (2007), Thomas G. et al., Multiple loci identified in a genome-wide association study of prostate cancer Nat Genet (40) p 310-5 (2008) and Eeles, R. A. Multiple newly identified loci associated with prostate cancer susceptibility Nat Genet (40) 316-21 (2008).
A second challenge for predictive medicine consists in modeling associations of variables [E. F. Easton Genome-wide association studies in cancer Hum Mol Genet (17) R109-15 (2008)], complex analyses of combinations of variables being a particular field of algorithm research.
In this context, the present invention provides an individual prediction method for the screening or diagnosis or prognosis or therapeutic response of cancer and more particularly well suited to prostate cancer, based on the collection of very large amounts of genetic data to which clinical data can be attached and comprising the production of an advanced model which makes it possible to deliver a risk value which can be advantageously further subjected to a validation procedure.
More specifically, the subject of the present invention is an individual prediction method for the screening or diagnosis or therapeutic management or prognosis of prostate cancer comprising collecting individual input data (xi) and providing predictive information on the risk (y) linked to a type of disease, characterized in that:
the genetic input information comprising at least one variable or a combination of variables (all the nucleotide locations cited correspond to those defined by the “UCSC genome browser”, assembly of March 2006) among the following:
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or of a variable defining the genotype linked to the SNP rs7576160 and/or to one or more of its neighbors in the interval 37855761-38126567 of chromosome 2 and/or of a variable defining the genotype linked to the SNP rs2012385 and/or to one or more of its neighbors in the interval 241767109-242119399 of chromosome 2.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or of a variable defining the genotype linked to the SNP rs2190453 and/or to one or more of its neighbors in the interval 17464539-17757162 of chromosome 11 and/or of a variable defining the genotype linked to the SNP rs888298 and/or to one or more of its neighbors in the interval 63815611-64165896 of chromosome 17.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or of a variable defining the genotype linked to the SNP rs2788140 and/or to one or more of its neighbors in the interval 210157195-210446272 of chromosome 1 and/or of a variable defining the genotype linked to the SNP rs7934514 and/or to one or more of its neighbors in the interval 99092040-99333419 of chromosome 11.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or of a variable defining the genotype linked to the SNP rs3828054 and/or to one or more of its neighbors in the interval 149382371-149874970 of chromosome 1 and/or of a variable defining the genotype linked to the SNP rs1499955 and/or to one or more of its neighbors in the interval 116302446-117011700 of chromosome 3.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and of a variable defining the genotype linked to the SNP rs8110935 and/or to one or more of its neighbors in the interval 62026584-62294837 of chromosome 19.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and of a variable defining the genotype linked to the SNP rs4855539 and/or to one or more of its neighbors in the interval 69049525-69153397 of chromosome 3 and/or of a variable defining the genotype linked to the SNP rs4242382 and/or to one or more of its neighbors in the interval 128539973-128619555 of chromosome 8.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs6492998 or to one of its neighbors in the interval 38991207-39584443 of chromosome 15 and of a variable defining the genotype linked to the SNP rs11526176 and/or to one or more of its neighbors in the interval 27414591-27808301 of chromosome 7 and of a variable defining the genotype linked to the SNP rs6681102 or to one of its neighbors in the interval 236815776-236998150 of chromosome 1.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs1511695 and/or to one or more of its neighbors in the interval 218280585-218521047 of chromosome 1 and of a variable defining the genotype linked to the SNP rs4669835 and/or to one or more of its neighbors in the interval 12111054-12324507 of chromosome 2 and of a variable defining the genotype linked to the SNP rs12605415 or to one of its neighbors in the interval 23907695-24187878 of chromosome 18.
According to one variant of the invention, the input data correspond to the combination of the four cancer history variables, of an age category variable, of a variable defining the genotype linked to the SNP rs4242384 and/or to one or more of its neighbors in the interval 128539973-128619555 of chromosome 8 and of a variable defining the genotype linked to the SNP rs9364048 and/or to one or more of its neighbors in the interval 70074721-70679396 of chromosome 6.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs749915 and/or to one or more of its neighbors in the interval 39097014-39163238 of chromosome 4 and of a variable defining the genotype linked to the SNP rs13226041 and/or to one or more of its neighbors in the interval 104002818-104863625 of chromosome 7 and of a variable defining the genotype linked to the SNP rs721429 and/or to one or more of its neighbors in the interval 61335448-62195826 of chromosome 17.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2352946 and/or to one or more of its neighbors in the interval 84695541-84776802 of chromosome 16 and of a variable defining the genotype linked to the SNP rs6755695 and/or to one or more of its neighbors in the interval 79446556-79664842 of chromosome 2 and of a variable defining the genotype linked to the SNP rs1138253 and/or to one or more of its neighbors in the 4098195-4506560 of chromosome 19.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs13148138 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and of a variable defining the genotype linked to the SNP rs1773842 and/or to one or more of its neighbors in the interval 29356293-29651117 of chromosome 10 and of a variable defining the genotype linked to the SNP rs10148742 and/or to one or more of its neighbors in the interval 43257771-43665346 of chromosome 14.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and of a variable defining the genotype linked to the SNP rs11526176 and/or to one or more of its neighbors in the interval 27414591-27808301 of chromosome 7.
According to one variant of the invention, the input data correspond to the combination of a variable defining the genotype linked to the SNP rs2048873 and/or to one or more of its neighbors in the interval 113062733-113411386 of chromosome 2 and/or of a variable defining the genotype linked to the SNP rs6804627 and/or to one or more of its neighbors in the interval 60928379-60979489 of chromosome 3 and of a variable defining the genotype linked to the SNP rs10245886 and/or to one or more of its neighbors in the 47461234-47557773 of chromosome 7.
According to one variant of the invention, the individual prediction method relates to the screening, diagnosis, prognosis or therapeutic response of a prostate cancer, the data being of the clinical type such as individual data relating to the age of the patient, their weight, their height, the personal and family history of cancer, of the biological type with, for example, the PSA level, and of the genetic type such as the identification of genetic polymorphism markers considered to be linked to the development of the disease and selected from the abovementioned lists.
According to one variant of the invention, the method of the invention comprises a “learning” process:
the constitution of a database of examples (Bex) consisting of input data (xmi) and of proven results (ym*);
the construction of at least one optimum model by statistical learning comprising the following steps:
the exploitation of the said optimum model from the said individual data (xi) so as to provide the said predictive information (y) on the risk linked to a disease.
According to one variant of the invention, the method comprises the construction, in parallel, of a set of optimum models, each model being produced from a family (Fk) of functions, the predictive information on the risk linked to a disease resulting from the exploitation of the set of optimum models.
According to one variant of the invention, the method comprises:
the creation of a learning base (BA) and a validation base (BV) from the examples base;
a process for validating the predictive result (y*) by comparison between the said predictive result obtained with a model constructed with the set of input data belonging to the learning base, and the proven result obtained from a set of similar input data belonging to the validation base.
According to one variant of the invention, the method comprises, for a given base comprising N data, the construction of the learning base carried out by random sampling (without replacement) of M data belonging to the examples base, N-M remaining data constituting the validation base.
According to one variant of the invention, the family of functions is of the MLP (Multi Layer Perceptron) type, a subset of the family of networks of neurons or of the Support Vector Machines (SVM) type or of the Relevance Vector Machines (RVM) type or of the frequentist model type relating to the nearest neighbor method.
According to one variant of the invention, the estimation delivered by the model ym=fi (xmi, θj) is compared to the proven result ym* with a cost function of the cross-entropy score type in the case of the discrimination:
−[y*log(ƒ(x,θ)+(1−y*)log(1−ƒ(x,θ)]
or of the log likelihood criterion type noted
−log(P(y|x,θ))
and corresponding to the probability of obtaining y from the parameters x and θ or of the quadratic deviation type in the case of the regression:
(ƒ(x,θ)−y*)2.
According to one variant of the invention, the comparison between the said predictive result obtained with a model constructed with the set of input data belonging to the learning base, and the proven result obtained from a set of input data belonging to the validation base is carried out with a cost function similar to that used in the comparison between the estimation delivered by the model and the proven result y*.
According to one variant of the invention, the final result of the modeling can be obtained by fusion of optimum models that can be constructed from two different sets of variables and obtained from different families of functions. In this fusion phase, it is useful to select the models to be fused and also the method of fusion to be implemented (model response means, product, majority vote, Choquet integral, Sugeno integral [Ludmila I. Kuncheva, James C. Bezdek, and Robert P. W. Duin. Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition, 34:299-314, 2001]). This is because a strategy that will consist in fusing all of the optimum models constructed is not generally satisfactory. It is necessary to carry out a selection of an optimum subset of models from all the optimum models constructed, while having recourse to optimization methods, such as, for example, genetic algorithms.
According to one variant of the invention, the individual clinical data correspond to the combination of four cancer history variables and of one age category variable, the said history variables relating respectively to the family history of breast cancer, the history of prostate cancer, the personal history of cancer and the family history of other cancers.
The subject of the invention is also an individual prediction device for the screening, diagnosis or prognosis, therapeutic response of a prostate cancer comprising first means for acquiring individual information data by a user, at least a first software interface on which the said first means operate, characterized in that it additionally comprises a software using the method according to the invention and providing a predictive information on the risk linked to prostate cancer.
According to one variant of the invention, the said predictive information on the risk is restored to the user via the said software interface.
According to one variant of the invention, the device additionally comprises means of communication between the first acquisition means and the software, allowing the transmission of the information data and that of the predictive information.
According to one variant of the invention, the device additionally comprises second individual information data acquisition means and a second software interface, the first acquisition means relating to the acquisition of information of the clinical type, and the second means relating to the acquisition of information derived from a sample from the individual.
The invention will be understood more clearly and other advantages will appear on reading the description which follows and which is given without limitation and by virtue of the accompanying figures among which:
a to 3e illustrate respectively the performances of algorithms of the Multi-Layer Perceptron type in relation to discriminating between patients suffering from prostate cancer and controls with, as input variables, the age category and respectively the genotype associated with the SNP rs2969612, rs1167190, rs1314813, rs2174183 and rs1604724;
The benefit of the present invention lies in particular in making available to doctors a tool that helps in decision making for a personalized management of their patients. Its novelty lies in the combination of an exclusive database and multidimensional statistical analyses. The user can thus benefit from a knowledge derived from multi-disciplinary research studies in medicine, biology, genetics, mathematics and from objective results. The medical impact of this expert system is also economical because it allows practitioners to better detect the early and curable stages of the disease, to reduce costs and the side effects associated with invasive diagnostic and therapeutic methods. Finally, for the patient, the aim is to obtain an optimum management of their pathology, a reduction in the risk of overtreatment, an increase in their life expectancy and an improvement in their quality of life.
According to the invention, the prediction tool is produced by virtue of the upstream construction of statistical learning models. We are going to describe the principle of construction below.
A model, constructed in the context of the theory of statistical learning, is generally a parameterized mathematical function ƒ which contains adjustable parameters θ and belonging to a larger family of functions F.
This function makes it possible to deliver an estimation y as a function of a number of inputs x which are input variables of the problem.
In the case of the present invention:
This estimate y may be considered as being a function ƒ dependent on the inputs x and of the parameters θ.
The whole difficulty of creating a model lies in the adjustment of the parameters θ. These parameters θ are adjusted in a so-called learning phase which requires examples and the use of dedicated algorithms.
In general, all the models constructed by statistical learning require examples. Indeed, as a system capable of learning, these models use the principle of induction, that is to say learning by experience. The examples base consists of a set of N pairs (x, y*) representative of the process studied which it is desired to model.
The variable x is, as above, a value among a set of input values and y* is the real output associated with these inputs considered as the truth which it is desired to estimate (cancer/no cancer diagnosis delivered by a specialist for example). This database is represented in the form of a table of N lines, where each line represents an example (the input values for an individual and its associated class). The aim of the learning is to construct a model, from these N examples, in order to estimate in fine the response which the specialist would have given on a new case that has never been encountered. The expression “capacity for generalization” is used in this case. In the procedure for creating models, the one which will deliver the best capacity for generalization will be chosen.
The representativeness of the data is a very important notion since it determines the quality of the model constructed and since the information which can be learnt from the model is contained in the base through the N examples. The expression “representativeness” is understood to mean the exhaustive character of the cases contained in the base. That is to say that it should be ensured that the model has met a set of cases similar to those encountered in its future use as an estimator. The phase for constituting the learning base is therefore a key step and should be performed rigorously.
The following paragraph describes how the learning algorithm adjusts the parameters of the model according to the constituent elements of the learning base.
During the learning phase, the algorithm modifies the adjustable parameters θ of the model so that the estimation y is as close as possible to that of the proven result also called “supervisor” y*. The criterion which it is therefore desired to minimize by acting on the parameters θ is the deviation between the response of the model and the response of the supervisor on the cases available. This deviation can be obtained in various ways according to the problem treated and is called “cost function”:
Typically, the “cost function” which it is sought to minimize may be for example one of the following functions:
−[y*log(ƒ(x,θ))+(1−y*)log(1−ƒ(x,θ))];
log(P(y|x,θ))
(ƒ(x,θ)−y*)2.
The learning phase therefore consists in finding a set of parameters θ, for a function ƒi of the family F of functions which minimizes the cost function over all the examples, with the aid of the optimization algorithms.
However, a model capable of predicting information that is already known is of little benefit. It is necessary to ensure that it is capable of correctly predicting cases that are not present but are represented in the learning base, and which follow the same laws as those that served for the learning. That is why the example base is generally split into a learning base BA, for adjusting the parameters of the model, and a validation base BV, also called validation base, for testing the model chosen and verifying its robustness.
The important thing for the two sets is to be as representative as possible of the total examples base on the one hand, and of the problem treated on the other hand. If the learning base is not, there is a risk of not correctly modeling the phenomena which is sought. If the validation base is not, there is a risk of the validation scores giving a false idea of the performances of the models, if the example base is not representative of the real cases, no practical application can be derived therefrom.
When sufficient data is available, the two sets (learning base and validation base) are constructed by randomly sampling the elements of the examples base. Thus, on the basis of N elements, a random selection is made of M which will be used for the training, and the remaining (N-M) will serve for the validation.
For the validation score not to be dependent on the particular sampling of a single partition of the total base into learning base and validation base, the procedure is repeated a number of times.
Accordingly, we are going to describe in greater detail the process proposed in the present invention.
In a first step, a family F of functions, the choice depending on the problem posed and the a priori knowledge thereof, is selected. Typically, in the context of the invention, the problem encountered falls in the category of problems of discrimination, that is to say that it is sought to classify new individuals into two groups: patients or controls.
In a second step, a type of function ƒi belonging to the family F is chosen.
In a third step, an optimum model ƒi(x,θ) is constructed by the learning procedure by adjusting the parameters θ.
This construction of a model is repeated with n−1 functions so as to test a sufficient type of functions ƒ1, ƒ2, . . . , ƒn, the respective qualities of their optimum models are compared.
In a fourth step, the function ƒi is selected which leads to the optimum model having the best validation score, thus determining the so-called function ƒi which “generalizes the best”.
In a fifth step, the parameters θ of the function selected in the preceding step are evaluated with all the examples of the learning base. The optimum model
ƒiop(x,θ)
is thus obtained which, from individual input data xi will be able to provide the predictive result y.
Among the numerous families of functions available, the following families may notably be mentioned:
MLPs (Multi Layer Perceptrons), a subset of the family of networks of neurons,
logistic regression (subset of the family of MLPs);
Support Vector Machines (SVMs);
Relevance Vector Machines (RVMs);
frequentist models related to the nearest-neighbor method.
Most of these types of function are notably described in the reference manual “Réseaux de Neurones, Méthodologie et Applications” by G. Dreyfus et al., Eyrolles Publishing or in “Pattern Recognition and Machine Learning” by C. M. Bishop, Springer 2006. The Relevance Vector Machines are described in “Sparse Bayesian learning and the relevance vector machine”, Tipping, M. E. (2001), Journal of Machine Learning Research 1, 211-244.
The main contribution of the models previously described, compared with the models already used to evaluate risks, lies in the non-linearity of the statistical learning models. Indeed, the models generally used are said to be linear compared with the parameters, which induces a greater ease of implementation, generally at the cost of a lower predictive power. In the case of models described above, which are non-linear compared with the parameters, the implementation is more delicate but makes it possible:
to obtain, in general, better performances of the model;
to detect the synergies between input variables.
The possibility of exploiting the synergies between the input variables is an essential aspect of the inventive character of the subject of the present invention. It constitutes the main contribution of the collaboration of mathematicians in biological and medical discoveries in these studies. Indeed, the mathematical and statistical tools at the disposal of doctors and biologists generally do not make it possible to detect these synergies.
Furthermore, these algorithms have high learning capacities, it is very important to be able to measure their performances in order to verify that they do not overadjust to the training examples (the expression learning “by heart” or “overlearning” is then used). The methodologies for statistical learning make it possible, notably by virtue of the use of the validation examples, to solve this problem and to ensure that the model obtained represents a general phenomenon and not a particular case of training examples. This makes it possible to model phenomena for which little or no a priori knowledge is available.
According to the present invention, a model is prepared that is capable, from the explanatory variables obtained, for example, from variable-selecting methodologies described in the present invention, of predicting a response interpreted as a probability of being a patient or a control.
The present problem falls in the category of problems of discrimination, that is to say that it is sought to classify new individuals into two groups: patients or controls.
Numerous families of functions are suited to the resolution of these problems. Some are very simple to carry out but do not make it possible to take into account the synergies between the variables. Now, it is not known a priori if such relationships exist or not. It is therefore advisable to choose a family of functions capable of taking account thereof if they exist.
A family that is simple to describe and generally effective is that of the Multi-Layer Perceptrons or MLPs. It is a type of network of neurons which is generally represented according to the scheme illustrated in
The mathematical formula is of the following form:
Where L is the “Logistic” function, Si are functions of the “Sigmoid” type (such as for example the “hyperbolic tangents” function), n is the number of hidden neurons, p the number of input variables and et θ denotes the parameter vector consisting of the components θi and θij or 1≦i≦n and 1≦j≦p. It should be noted that the mathematical object θ is different if it comprises one or two indices. θij denotes the element ij of the matrix θ (matrix of the parameters between the inputs and the hidden neurons) and θi denotes the element i of the parameter vector between the hidden neurons and the output.
Given that the number m of variables is dictated by the problem treated, only the number n of hidden neurons may be chosen in the modeling phase. That is why the functions constituting the family of MLPs for the problem treated are differentiated solely by their number of “hidden neurons”, each of them representing in reality a sigmoid function. For example, the function representing the model obtained from a logistic regression, a modeling method that is well known in the medical field, belongs to this family. It is indeed a particular case of MLP having no hidden neuron. In this case, the model is linear relative to the parameters and the construction of the model then uses learning techniques different from those used in the context of the MLPs.
The higher the number of hidden neurons an MLP possesses, the more it is capable of modeling complex phenomena. It has indeed been demonstrated that any continuous function could be approximated by an MLP having sufficient hidden neurons.
However, in the present case, only the modeling of “general” behaviors is taken into account, and not the specific characteristic of the individuals as present in the database. It is therefore advisable to find an MLP with an optimum number of hidden neurons in order to construct the model that is as general as possible. For that, it is possible to decide a priori to test 5 MLPs, each having from 1 to 5 hidden neurons, and to construct for each an optimum model which will be evaluated on validation data. The MLP having the best power for generalization is then selected.
Taking into account the number of examples available, it is possible to carry out a simple random construction of the validation and training sets. However, as the data contain a lot of pointless information, it is not possible to be content with a single training/validation pair because there is a risk of constructing a model suited to a subproblem, and of validating it on something else. For that, the models are evaluated by a cross-validation procedure. The principle is the following:
By virtue of this procedure, all the data are used to calculate the validation score, which makes it possible to avoid focusing on these particular cases.
The cost function used for the training is partly dictated by the problem posed (discrimination) and the family of function (MLP). In the present case, the cross-entropy may be advantageously used.
The validation score corresponds to a measurement of the evaluation of the quality of the model. This score may correspond to its good classification level, that is the sum of the number of patients and of controls correctly identified, divided by the total number of individuals in the validation base. This score is simple to calculate and easy to interpret and use, although it occults the performances class by class (it may indeed happen that one of the classes is better identified than the other). This score may also be the AUC (Area Under Curve), that is to say the area under the ROC (Receiver Operating Characteristic) curve as illustrated in
These figures show how the discrimination performance in the vicinity of the SNP rs2174183 evolves, an ROC curve has thus been established by replacing it with the SNPs rs2969612, rs1167190, rs1314813 or rs1604724.
Having made all the preceding choices, the procedure for selecting the “ideal” MLP function may be launched. The one which makes it possible to obtain the best validation score is selected in order to construct the final model.
For the so-called optimum final model, that is to say the one which is effectively used to calculate the risk, a training procedure is launched on the identified “ideal” function. The training set used is this time the entire example base because no validation is necessary any longer.
According to a more elaborate variant of the invention, it is also possible, for various families of functions F, to produce an optimum model thus leading to the determination of a set of optimum models, intended to manage during use individual input data in order to provide a predictive result.
According to a more elaborate variant of the invention, it is also possible, for various families of functions F, to produce an optimum model resulting from a fusion of decision of other optimum models constructed from all or part of the input variables. This step, which leads to a more elaborate variant of the invention, falls within the scope of the seventh step described below.
The objective of the fusion of information is to improve decision making in terms of robustness and reliability from the combination, via a mathematical operator, of the decisions or of the scores provided by the family of functions [I. Bloch. Fusion d'informations numériques: panorama méthodologique. Dans Journées Nationales de la Recherche en Robotique, Guidel, Morbihan, Octobre 2005]. These operators should take advantage of the complementarities between the various functions at the start of the fusion but also take into consideration their irrelevance. The fusion operators are numerous [Ludmila I. Kuncheva, James C. Bezdek, and Robert P. W. Duin. Decision templates for multiple classifier fusion: an experimental comparison. Pattern Recognition, 34:299-314, 2001] and may be based on various mathematical formalisms such as the theory of probabilities, the theory of belief functions or fuzzy measurements [G. J. Klir and M. J. Wierman. Uncertainty-based information. Elements of generalized information theory, 2nd edition. Studies in fuzzyness and soft computing. Physica-Verlag, 1999].
Statistical or automated learning algorithms may moreover be used for a parametric fusion but they generally require more information a priori for the estimation of the fusion operator.
Regardless of the formalism used, the fusion operators may take the form of a table of rules of combination of the “logical AND/OR” type, of a product of scores with or without a priori which may be conditional or not as in the case of the fusion based on the generalized or non-generalized Bayes theorem [Ph. Smets. Beliefs functions: The Disjunctive Rule of Combination and the Generalized Bayesian Theorem. Int. Jour. of Approximate Reasoning, 9:1-35, 1993], of distances to models predefined by learning or expertise, of weighted sums with or without taking into account the interactions between the inputs of the fusion.
The explanatory power and the interpretation of the results, which are important criteria for the medical and industrial applications, are generally a lot easier via the use of specific fusion operators instead of statistical or automated learning algorithms.
Accordingly and according to the invention, when the method of prediction has been constructed, it is possible to provide the user, typically the doctor or any other entity of the laboratory type, with a tool that helps in decision making that is at the same time impartial, reliable and allows a personalized use at different stages of the patient's progress, thereby making it possible, with a single tool, to perform hierarchical predictions, comprising inputs of the clinical data or genetic data type, the said tool providing at the output such as evaluation of a risk or degree of progression of the disease detected.
With such a tool, it becomes possible to perform an early and non-invasive identification of the risk of developing a prostate cancer with evaluation of the seriousness (including of cancer as a function of occupational exposure to carcinogens, the genetic variants determining sensitivity to these agents to a greater or lesser degree).
It is also possible to evaluate the risk of recurrence of the cancers according to the treatment, including the validation of clinical trials for the pharmaceutical industry, in the form of an activity of a “data search” or biostatistical department.
It is also possible to evaluate the risks of complication of the radiotherapy or curietherapy (or of exposure to ionizing radiation in general), the risks for other urological pathologies (benign prostatic hypertrophy, urinary incontinence).
Working on the genotype of patients makes it possible to access elements which may be highly crucial in the appearance of a pathology and easy to collect. A simple collection of saliva sample indeed makes it possible to easily work on invariant constitutional DNA. The genetic material is informative because it is capable, by identification of the genetic profile, of determining the risk of developing the disease but also the risk of it being aggressive.
According to one example of use, the application is introduced by the practitioner who acquires information which they have for a patient, such as for example the blood level of total PSA or of free PSA, the age, the weight, the height, the family and personal history, the results of examinations of the rectal touch type and the genotypes of interest. They select the relevant questions and the application interrogates the statistical model or the various statistical models at their disposal. The tool gives personalized and hierarchical response with, for example, for prostate cancer, the risk of developing an aggressive cancer at a given age, the risk of developing metastases or a recurrence of the tumor after initial treatment (at a given age).
In this case, the information of the clinical type is sent by a patient or by a practitioner to the professional provider of results via communication networks which may be of the internet type.
In parallel, information obtained from samples of the blood and/or saliva type analyzed in a laboratory are also sent to the predictive result professional, the entire information is processed by the model(s) previously produced so as to give a predictive result, the said result being sent back to a health professional who is thus able to inform the patient thereof.
In parallel, a second user, which may be an analytical laboratory, sends another stream of information obtained from blood or salivary samples x2i and acquired at the level of a second interface 11 and also sent to the provider FRP via a distant link. After processing all the data received via an interface 12 introduced by the provider FRP, the latter sends the result y to a third user U3 authorized to inform the patient in question. Typically, when the user U1 is the practitioner, there may only be two users U1 and U2. On the other hand, if the patient has the possibility of directly sending the information to the professional FRP, the result y cannot be directly sent to them by FRP.
The professional provider of results can at any time enrich their databases of examples by new cases treated so as to provide more efficient predictive results.
For submitting cases remotely, provision is made for protecting the personal data of each patient, compatible with the security and ethical rules in use.
We are going to describe below examples of combinations of input data or variables which are particularly suited to the calculation of the risk of onset of prostate cancer.
A first variable is called “family history of prostate cancer”, the values for this variable make it possible to define the family context for the onset of prostate cancer of the patient. The values attributed to each individual depend on the age and/or the degree of relationship and/or the number of cases of onset of prostate cancer in their family.
A second variable is called “family history of breast cancer”, the values for this variable make it possible to define the family context for the onset of breast cancer of a patient. The values attributed to each individual depend on the age and/or the degree of relationship and/or the number of cases of onset of breast cancer in their family.
A third variable is called “personal history of cancer”, it makes it possible to distinguish between the patients who have already had a cancer, regardless of its type.
A fourth variable is called “family history of other cancers”, the values for this variable define the family context for the onset of cancer (other than breast or prostate cancer) and depend on the age and/or the degree of relationship and/or the number of cases of onset of other forms of cancer for a given patient.
A fifth variable is the age encoded in the form of categories of ages.
These variables can be used in combination or alone as input variables of relevant algorithms in order to obtain a calculation of the risk of onset of prostate cancer or to determine the predisposition to prostate cancer.
The predictive value of these variables is reinforced by their use in combination with markers of individual biological variability such as for example single genetic polymorphisms also called SNPs (Single Nucleotide Polymorphisms). An essential property of genetic markers, to which SNPs belong, is their capacity to be transmitted in linkage disequilibrium with markers in their vicinity defined in terms of chromosomal location. The expression genetic distance between two markers or SNP is used. It is considered that two markers are thus genetically linked when the frequency of recombinations between them is rare. The existence of these genetic linkages is responsible for the fact that the SNPs in the vicinity of an SNP of interest are capable of providing the same information or part of the information on a predisposition character. Since for each SNP the relevance of various SNPs present in its vicinity is available, it is possible to obtain for each SNP of great interest the list of neighboring SNPs which can provide information on the predisposition to prostate cancer. The definition of such an interval is of great interest from a practical point of view since it makes it possible to choose markers which provide relevant information among a list according to practical criteria of commercial availability of reagents and experimental criteria for example.
The usual technique for choosing how to delimit intervals would be to calculate the linkage disequilibrium between an SNP and its neighbors, but it is not this notion that has been retained. These intervals have been delimited by correlation calculations actually based on the observation of an effect. The limit given is that beyond which an effect is no longer observed.
In the present application, mention is made of the use of an SNP of interest and/or of one or more of its neighbors. Indeed, each of the SNPs genetically linked to the SNP of interest is capable of providing all or part of the information provided by the SNP of interest. The genetic linkage depends on the physical distance between two genetic elements (in general expressed as nucleotides) and on the frequency of the recombinations between these two elements. The SNP of interest may itself be the causal agent of the predisposition which it is sought to predict, it may also simply be genetically linked to it. Through a transitivity effect, an SNP genetically linked to the SNP of interest will also be able to be genetically linked to the causal predisposition factor. This possibility explains the need to introduce a first “or”. The “and” is also derived from the property given by the genetic linkages. If the predisposition factor is positioned between two genetically linked SNPs, the fact that the alleles present for each SNP are recognized in an individual makes it possible to complete the information on the probability of presence of the causal agent of a predisposition. All these properties seemed to us to be best represented by the wording used in the claims.
Because the nucleotide position systems of reference are changeable, as much precision as possible has been given to the description of the SNPs of interest in the list which follows.
SNPs are currently the genetic markers most widely used, but it is obvious that each SNP can be replaced with a molecular biology marker of any nature so long as the physical or statistical link is obvious for those skilled in the art; the interchangeability of the variables is mathematically very simple to verify provided that there is information on the new variable for a sufficient number of individuals.
SNP rs2174183 located in 4q28.1 on chromosome 4 between the positions 127907634-127908134 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs2174183: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs2174183 which can provide information on the predisposition to prostate cancer are defined in a database according to the following table and are positioned in the interval 127602673-128447913 of chromosome 4 or between the SNPs rs12651126 and rs13122922 on chromosome 4:
The relevance of the associated SNPs and of the SNP of interest for discriminating between patients suffering from prostate cancers and controls may be demonstrated by establishing ROC curves (corresponding to a variable relating to the sensitivity to a test also called “Receiver Operating Characteristic”) as illustrated in
SNP rs7576160 located in 2p22.2 on chromosome 2 between positions 37957978-37958478 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genome Sequence in the Vicinity of rs7576160: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs7576160 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 37855761-38126567 of chromosome 2 or between the SNPs rs7562836 and rs17021897 of chromosome 2.
SNP rs2012385 located in 2q38.1 on chromosome 2 between positions 242070828 and 242071328 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genome Sequence in the Vicinity of rs2012385: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs2012385 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 241767109-242119399 of chromosome 2 or between the SNP rs1540528 and rs7567892 of chromosome 2.
SNP rs2190453 located in 11p15.1 on chromosome 11 between the positions 17489723-17490223 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs2190453: Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs2190453 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 17464539-17757162 of chromosome 11 or between the SNP rs12278956 and rs1003921 of chromosome 11.
SNP rs888298 located in 17q24.2 on chromosome 17 between positions 63955680 to 63956180 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs888298: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs888298 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 63815611-64165896 of chromosome 17:
SNP rs8110935 located in 19q13.43 on chromosome 19 between positions 62239851-62240351 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs8110935: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs8110935 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 62026584-62294837 of chromosome 19 or between the SNP rs1860565 and rs1565944 of chromosome 19.
SNP rs2788140 located in 1q32.3 on chromosome 1 between positions 210171227-210171727 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genome Sequence in the Vicinity of rs2788140: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs2788140 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 210157195-210446272 of chromosome 1 or between the SNPs rs12135924 and rs7546833 of chromosome 1.
SNP rs7934514 located in 11q22.1 on chromosome 11 between positions 99214118-99214618 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs7934514: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs7934514 which can provide information on the predisposition to prostate cancer are defined in our data base according to the following table are positioned between the interval 99092040-99333419 of chromosome 11 or between the SNPs rs605559 and rs12574821 of chromosome 11.
SNP rs3828054 located in 1q21.3 on chromosome 1 between positions 149779269-149779769 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs3828054: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs3828054 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 149382371-149874970 of chromosome 1 or between the SNPs rs11807526 and rs6702842 of chromosome 1.
SNP rs1499955 located in 3q13.31 on chromosome 3 between positions 116719413-116719913 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs1499955: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs1499955 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 116302446-117011700 of chromosome 3 or between the SNPs rs9289008 and rs2289271 of chromosome 3
SNP rs4855539 located in 3p14.1 on chromosome 3 between positions 69108069-69108569 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs4855539: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs4855539 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 69049525-69153397 of chromosome 3:
SNP rs4242382 located in 8q24.21 on chromosome 8 between positions 128586505-128587005 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs4242382: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs4242382 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 128539973-128619555 of chromosome 8 or between the SNP rs7830412 and rs4407842 of chromosome 8.
SNP rs11526176 located in 7p15.2 on chromosome 7 between positions 27546048-27546548 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs11526176: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs11526176 which can provide information on the risk of onset of prostate cancer are defined in our data base according to the following table and are positioned in the interval 27414591-27808301 of chromosome 7 or between the SNP rs11761572 and rs2237344.
SNP rs6492998 located in 15q15.1 on chromosome 15 between positions 39,333,673-39,334,173 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs6492998: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs6492998 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table are positioned in the interval 38991207-39584443 of chromosome 15:
SNP rs6681102 located in 1q43 on chromosome 1 between positions 236,853,987-236,854,487 according to the location determined by the UCSC genome browser, assembly of March 2006.
Genomic Sequence in the Vicinity of rs6681102: Polymorphic Nucleotide in Bold.
The SNPs in the vicinity of the SNP rs6681102 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 236815776-236998150 of chromosome 1:
SNP rs2048873 located in 2q13 on chromosome 2 between positions 113139055-113139555 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs2048873, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs2048873 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 113062733-113411386 of chromosome 2.
SNP rs6804627 located in 3p14.2 on chromosome 3 between positions 60963960-60964460 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs6804627, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs6804627 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 60928379-60979489 of chromosome 3.
rs6804627
SNP rs10245886 located in 7p12.3 on chromosome 7 between positions 47546720-47547220 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs10245886, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs10245886 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 47461234-47557773 of chromosome 7.
rs10245886
SNP rs1511695 located in 1q41 on chromosome 1 between positions 218514703-218515203 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs1511695, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs1511695 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 218280585-218521047 of chromosome 1.
SNP rs4669835 located in 2p25.1 on chromosome 2 between positions 12289824-12290324 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs4669835, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs4669835 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 12111054-12324507 of chromosome 2.
SNP rs12605415 located in 18q12.1 on chromosome 18 between positions 24135069-24135569 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs12605415, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs12605415 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 23907695-24187878 of chromosome 18.
SNP rs749915 located in 4p14 on chromosome 4 between positions 39151013-39151513 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs749915, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs749915 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 39097014-39163238 of chromosome 4.
SNP rs13226041 located in 7q22.2 on chromosome 7 between positions 104851579-104852079 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs13226041, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs13226041 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 104002818-104863625 of chromosome 7.
rs13226041
SNP rs721429 located in 17q24.2 on chromosome 17 between positions 62122117-62122617 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs721429, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs721429 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 61335448-62195826 of chromosome 17.
SNP rs9364048 located in 6q13 on chromosome 6 between positions 70455536-70456036 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs9364048, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs9364048 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 70074721-70679396 of chromosome 6.
SNP rs4242384 located in 8q24.21 on chromosome 8 between positions 128586505-128587005 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs4242384, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs4242384 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 128539973-128619555 of chromosome 8.
rs4242384
SNP rs2352946 located in 16q24.1 on chromosome 16 between positions 84758022-84758522 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs2352946, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs2352946 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 84695541-84776802 of chromosome 16.
rs2352946
SNP rs6755695 located in 2p12 on chromosome 2 between positions 79511959-79512459 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs6755695, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs6755695 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 79446556-79664842 of chromosome 2.
rs6755695
SNP rs1138253 located in 19p13.3 on chromosome 19 between positions 4276183-4276683 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs1138253, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs1138253 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 4098195-4506560 of chromosome 19.
rs1138253
SNP rs10148742 located in 14q21.3 on chromosome 14 between positions 43356636-43357136 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs10148742, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs10148742 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 43257771-43665346 of chromosome 14.
rs10148742
SNP rs1773842 located in 10p11.23 on chromosome 10 between positions 29389042-29389542 according to the UCSC genome browser numbering, assembly of March 2006.
Genomic Sequence in the Vicinity of rs1773842, Polymorphic Nucleotide in Bold
The SNPs in the vicinity of the SNP rs1773842 which can provide information on the predisposition to prostate cancer are defined in our database according to the following table and are positioned in the interval 29356293-29651117 of chromosome 10.
rs1773842
The so-called cancer history variables and the age category variable may be combined with the SNPs mentioned above as input variables of algorithms of the logistic regression type MLP SVM RVM or another type of statistical learning algorithm. The classifiers thus obtained can be used as they are, but it is also possible to optimize the performance of the tool by producing meta-classifiers which have been developed by fusing the classifiers. This fusion operation is similar to that of variable selection, a step during which the optimization, with respect to a certain fusion criterion, comes from the search for complementarity between the classifiers: classifiers or meta-classifiers can then be used for carrying out a calculation of risk of prostate cancer.
Among all the possible combinations of input variables, in addition to the current biological and clinical data (such as the PSA), it would be possible not to use the family history or the age combined directly with the SNPs and to constitute a meta-classifier using them in a second step, but they were selected as being particularly relevant (all the nucleotide locations cited correspond to that defined by the UCSC genome browser, assembly of March 2006):
the combination of the four cancer history variables, that is to say family history of prostate cancer, family history of breast cancer, personal history of cancer, family history of other cancers, and an age category variable;
the combination of the four cancer history variables, an age category variable and a variable defining the genotype linked to the SNP rs2174183 or to one of its neighbors in the interval 127602673-128447913 of chromosome 4;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2174183 and/or or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or a variable defining the genotype linked to the SNP rs7576160 and/or to one or more of its neighbors in the interval 37855761-38126567 of chromosome 2 and/or a variable defining the genotype linked to the SNP rs2012385 and/or to one or more of its neighbors in the interval 241767109-242119399 of chromosome 2;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or a variable defining the genotype linked to the SNP rs2190453 and/or to one or more of its neighbors in the interval 17464539-17757162 of chromosome 11 and/or a variable defining the genotype linked to the SNP rs888298 and/or to one or more of its neighbors in the interval 63815611-64165896 of chromosome 17;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or a variable defining the genotype linked to the SNP rs2788140 and/or to one or more of its neighbors in the interval 210157195-210446272 of chromosome 1 and/or a variable defining the genotype linked to the SNP rs7934514 and/or to one or more of its neighbors in the interval 99092040-99333419 of chromosome 11;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or a variable defining the genotype linked to the SNP rs3828054 and/or to one or more of its neighbors in the interval 149382371-149874970 of chromosome 1 and/or a variable defining the genotype linked to the SNP rs1499955 and/or to one or more of its neighbors in the interval 116302446-117011700 of chromosome 3;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2352946 and/or to one or more of its neighbors in the interval 84695541-84776802 of chromosome 16 and a variable defining the genotype linked to the SNP rs6755695 and/or to one or more of its neighbors in the interval 79446556-79664842 of chromosome 2 and a variable defining the genotype linked to the SNP rs1138253 and/or to one or more of its neighbors in the interval 4098195-4506560 of chromosome 19;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and a variable defining the genotype linked to the SNP rs8110935 and/or to one or more of its neighbors in the interval 62026584-62294837 of chromosome 19;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and a variable defining the genotype linked to the SNP rs4855539 and/or to one or more of its neighbors in the interval 69049525-69153397 of chromosome 3 and a variable defining the genotype linked to the SNP rs4242382 and/or to one or more of its neighbors in the interval 128539973-128619555 of chromosome 8;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2174183 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and a variable defining the genotype linked to the SNP rs11526176 and/or to one or more of its neighbors in the interval 27414591-27808301 of chromosome 7;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs6492998 and/or to one of its neighbors in the interval 38991207-39584443 of chromosome 15 and/or a variable defining the genotype linked to the SNP rs11526176 and/or to one or more of its neighbors in the interval 27414591-27808301 of chromosome 7 and/or a variable defining the genotype linked to the SNP rs6681102 and/or to one of its neighbors in the interval 236815776-236998150 of chromosome 1;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2048873 and/or to one or more of its neighbors in the interval 113062733-113411386 of chromosome 2 and/or a variable defining the genotype linked to the SNP rs6804627 and/or to one or more of its neighbors in the interval 60928379-60979489 of chromosome 3 and a variable defining the genotype linked to the SNP rs10245886 and/or to one of its neighbors in the interval 47461234-47557773 of chromosome 7;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs1511695 and/or to one or more of its neighbors in the interval 218280585-218521047 of chromosome 1 and a variable defining the genotype linked to the SNP rs4669835 and/or to one or more of its neighbors in the interval 12111054-12324507 of chromosome 2 and/or a variable defining the genotype linked to the SNP rs12605415 and/or to one of its neighbors in the interval 23907695-24187878 of chromosome 18;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs749915 and/or to one or more of its neighbors in the interval 39097014-39163238 of chromosome 4 and/or a variable defining the genotype linked to the SNP rs13226041 and/or to one or more of its neighbors in the interval 104002818-104863625 of chromosome 7 and/or a variable defining the genotype linked to the SNP rs721429 and/or to one or more of its neighbors in the interval 61335448-62195826 of chromosome 17;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs4242384 and/or one or more of its neighbors in the interval 128539973-128619555 of chromosome 8 and a variable defining the genotype linked to the SNP rs9364048 and/or to one of its neighbors in the interval 70074721-70679396 of chromosome 6;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs2352946 and/or to one or more of its neighbors in the interval 84695541-84776802 of chromosome 16 and a variable defining the genotype linked to the SNP rs6755695 and/or to one or more of its neighbors in the interval 79446556-79664842 of chromosome 2 and a variable defining the genotype linked to the SNP rs1138253 and/or to one of its neighbors in the interval 4098195-4506560 of chromosome 19;
the combination of the four cancer history variables, an age category variable, a variable defining the genotype linked to the SNP rs13148138 and/or to one or more of its neighbors in the interval 127602673-128447913 of chromosome 4 and/or a variable defining the genotype linked to the SNP rs1773842 and/or to one or more of its neighbors in the interval 29356293-29651117 of chromosome 10 and a variable defining the genotype linked to the SNP rs10148742 and/or to one or more of its neighbors in the interval 43257771-43665346 of chromosome 14.
On the basis of the SNP list presented, there is a high probability of relevant information on predisposition to breast cancer and to other forms of cancer being obtained on the principle of the same invention. In order to verify it, it would be necessary to put together a database of examples of patients and of controls suffering from the form of cancer of interest, to form their medical files and either to reiterate the combinations of input variables that we have given or to re-initiate a small process of variable selection in order to reform small, more specific combinations. A process of statistical learning and of meta-modeling could then be re-initiated. Since the various forms of cancer share tumorigenesis mechanisms, it is probable that the relevant information can be obtained in this way.
Example of a Method According to the Invention Using Certain SNP Selections and Comparison with Prediction Methods of the Known Art:
According to one method example, the present invention was developed in two steps, one aimed at selecting the relevant genetic markers that constitute the core of the tool and a second step consisting in carrying out the mathematical modeling that can take them into consideration in order to establish a risk calculation.
The method of the present invention was developed on the basis of the following steps: with data specific to the Centre de Recherche pour les Pathologies Prostatiques “CeRePP” [Prostate Disease Research Center], established by Professor Cussenot and collaborators thereof, 1315 individuals having given their consent were referenced, they belong to two separate categories: patients suffering from prostate cancer and controls. In order to limit the appearance of statistical biases, the two categories of individuals were paired in the best way possible, the most obvious example of a variable to be equilibrated being, for example, age.
Since the probability of developing prostate cancer varies with age, patients and controls should have age distributions as close as possible, otherwise the artifact linked to this statistical bias with respect to age may be unduly exploited by the statistical learning algorithms, as a discriminating variable, leading to incorrect modeling.
The medical files of the patients contain the status with respect to prostate cancer, the family history of prostate cancer, the family history of breast cancer, the family history of other cancers, and the personal history of cancer.
The individuals considered were then genotyped sufficiently thoroughly to cover the entire genome. With regard to the analysis, the applicant was able to provide individual genotypes for 27188 SNPs distributed over the 24 chromosomes of the human genome.
The 27188 SNPs and also the other variables were then subjected to a process of variable selection with the use, for example:
Genetic algorithms belong to the evolutionary algorithm family. Their name does not come from the possible applications in the field of genetics, but from an analogy between how they operate and the theories of evolution of the living world. They are generally used to solve optimization problems. The principle is to generate a population of potential solutions in the solution search space. Each potential solution is evaluated by a function, known as “fitness” function, adapted to the problem to be treated. At each iteration of the algorithm, new potential solutions are generated in the search space by selecting the best solutions of the preceding iteration and making use of two other functions, namely combinations and mutations. More specifically:
These operations are inspired by the theory of evolution in order to cause the solution population to gradually evolve toward the optimum solution. These genetic algorithms can therefore be used in the variable-selection phase, where each potential solution is a model constructed from a set of variables. Only the sets of variables which make it possible to obtain the best models are used.
Mutual information is a measure derived from information theory which consists in quantifying the mutual dependence of two random variables (or groups of random variables).
More strictly, the mutual information of two random variables X and Y is defined in the following way:
where p(x,y) is the joint probability of X and Y, and where p(x) and p(y) are, respectively, the marginal probabilities of X and of Y. In the context of discrete random variables, the integrals are replaced with the sum in the following way:
The mutual information quantifies the mutual dependence of two random variables X, Y or two groups of variables X, Y, i.e., in which measure knowledge regarding X reduces the uncertainty regarding Y. This mutual information calculation can therefore be used in the context of a selection of variables using this measure to determine the mutual dependence of a variable, or a group of variables (in this case, the SNPs), with the output (the status).
The first step in the work carried out by the applicant therefore consisted of a variable selection or dimension reduction.
It was thus able to isolate SNPs in small groups. The originality of these groups lies in the complementarity or the synergy between the SNPs that the algorithm calculations made it possible to demonstrate.
In addition to the SNPs discovered by virtue of implementing the methods described in the present invention, mention may be made of the example of the SNP rs4242382 which was already identified in the literature, and in particular in the article by G. Thomas et al., Multiple loci identified in a genome-wide association study of prostate cancer, Nature Genetics, vol 40, num3, March 2008. In this article, the SNPs are selected on the basis of their p-value. The authors thus identified the SNP rs4242382 as the applicant identified also by means of its methods. On the other hand, said methods made it possible to identify a synergy between this SNP and two other SNPs among the 27188 SNPs available in the base. This group of 3 SNPs is identified as group B1. The applicant then compared the performances obtained by the models constructed from group B1 with the performances of the models constructed from the best 3 SNPs, in the sense of the p-values, of the Nature Genetics article. The results are presented in
Some of the SNPs selected in the present invention, such as rs2174183, are not directly located in a gene; the biological function to which it is linked is unknown and could be elucidated with knowledge of complex regulations such as epigenetic regulations or microRNA, which are entirely new, and which are emerging in the cancerogenesis field.
These groups of SNPs discovered (each group contains a few SNPs) possibly in synergy with “history” and “age” variables, were then used as input data for the construction of models of patient/control discrimination by statistical learning.
At this stage, it is possible to establish the performance of the discrimination by means of a ROC curve. At the end of this modeling and validation phase, a statistical model is provided which has been constructed from input data of SNP and/or age and/or history type and which can be used on new data of the same types in order to estimate the status of an individual when the latter is unknown. The models therefore make it possible to recognize an individual who is at risk of prostate cancer according to certain performances illustrated by the ROC curves. It was thus possible to provide a series of models which themselves served as input data for establishing a meta-model by “fusion” techniques.
The result is a method for the discrimination of individuals suffering or not suffering from prostate cancer, which is original by virtue of the variable-selection methods used, the SNPs and the combinations of which it is constituted, the modeling and then the meta-modeling, or fusion, carried out and also the extent of the performances obtained.
The age of the patients and the family history of cancer, carefully encoded, are represented in the input data. This is because interactions were found between these variables and the SNPs that were discovered. While it was known that the history contains information that is highly predictive with respect to the risk of prostate cancer (and, moreover, the risk of cancer in general), it is the interaction with the SNPs that were discovered that constitutes the added value of our work.
The invention can therefore be presented in the following way:
The particular feature of the invention is to make it possible to discriminate individuals suffering from prostate cancer and healthy individuals, i.e., when the individuals are of unknown status, it makes it possible to identify those having a healthy-individual or affected-individual profile, and the degree of predisposition of said individuals to prostate cancer. For practical use, the degree of predisposition to prostate cancer may be given, for example, by means of a calculation of risk at a given age, by means of a curve of risk variation as a function of age, the tool as a whole finally taking the form of a practical application.
The alleles at risk are unspecified for each SNP; this knowledge, which is advantageous for studying the biological mechanism involved, is not essential to the operating of the invention, since it is, in the end, a very complex combination of the value of each input variable that can be associated with a particular risk. Thus, in a group containing three different SNPs, chosen as input variables, each one can be represented by two different alleles, which represent 3 genotypes per SNP and 27 different genetic profiles when combining the whole (3 SNP1 genotypes×3 SNP2 genotypes×3 SNP3 genotypes). The risk information with the best performance is linked to each particular combination among 27. For about ten combinations of SNPs distributed over several groups, it will therefore be necessary to clarify 270 genotypes, which is not necessary for correct operating of the invention and which was not necessary for its design since it is precisely a question of automatic learning, and the algorithms used establish and use the relevant genotype-risk association rules.
In order to use the invention, it is necessary to know the genetic profile of an individual and to have collected the biological data thereof. This can currently be carried out simply by those skilled in the art. For this, it is necessary to collect a sample of body fluid or tissues, to extract the DNA therefrom by means of a process well known to those skilled in the art of molecular biology, and to establish the genotype of each individual with respect to the SNPs of interest by means of a method to be chosen from the various technologically or commercially available solutions; simply, PCR TaqMan® (Applied Biosystems) genotyping techniques or conventional DNA sequencing techniques can be used.
The results obtained with the method of the invention are compared with those obtained and published by Zheng S L, Sun J, Wiklund F, et al., Cumulative association of five genetic variants with prostate cancer. NEngl JMed 2008; 358:910-9. The efficiency of the SNP selection carried out in the context of the invention is also compared with the efficiency of the selection carried out and published in the article G. Thomas et al., Multiple loci identified in a genome-wide association study of prostate cancer, Nature Genetics, vol 40, num3, March 2008.
In the remainder of the description, the following model names are agreed:
The first article relates to 5 SNPs having a link with prostate cancer. According to the authors, each SNP has a moderate link, but when the 5 SNPs are combined, the predictive capacity of the models is improved.
The following SNPs are involved: rs4430796, rs1859962, rs16901979, rs6983267 and rs1447295.
The authors use age, region, family history identified in terms of antecedents, called “Atcd”, and the five SNPs to construct their models (identified as model 3 in the article). They obtain an AUC for this model of 0.633 (the confidence interval at 95% being 0.617 to 0.65).
The aim of the comparison is to determine the provision of information linked to the addition of the SNPs described in the article and the provision of information linked to the addition of the SNPs obtained on the basis of the methods described in the present invention.
The comparison is carried out according to several steps:
The results are presented in
Finally, the applicant compared models constructed with the same SNP groups (NEJM and D2) without using the history variables in order to measure the provision from the SNPs alone.
The results are presented in
It should also be noted that the performances of the model of the present invention are better with fewer SNPs. Specifically, the NEJM model contains 5 SNPs, whereas the D2 model of the invention contains only 3 SNPs. This comparison makes it possible to conclude that the SNP selection described in the present invention makes it possible to create models which obtain better AUCs and therefore have a greater capacity for discrimination.
The applicant also established comparisons with the results published in the study by G. Thomas et al., Multiple loci identified in a genome-wide association study of prostate cancer, Nature Genetics, vol 40, num3, March 2008.
The team which published this study is part of the CGEMS consortium, i.e. they use the same 27188 SNPs as those presented in the present invention, but on different populations. Their strategy for detecting the SNPs of interest is based on calculating the p-values (statistical test). The aim of the comparison is to determine the provision of information linked to the addition of the SNPs described in the article and the provision of information linked to the addition of the SNPs obtained using the methods described in the present invention.
The comparison is carried out according to several steps:
The results are presented in
A comparison with the same NG1 and D2 groups was carried out by the applicant without using the history variables. The results are presented in
Finally, the applicant carried out a comparison of the same type on the basis of the best 7 SNPs of the Nature Genetics article. The experimental procedure is identical:
The results are presented in
In conclusion, it appears that, in any event, the models of the present invention have better performance levels than those constructed from the SNPs of the known art.
Number | Date | Country | Kind |
---|---|---|---|
08 04414 | Aug 2008 | FR | national |
This application is a National Stage of International patent application PCT/EP2009/059930, filed on Jul. 31, 2009, which claims priority to foreign French patent application No. FR 08 04414, filed on Aug. 1, 2008, the disclosures of which are incorporated by reference in their entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2009/059930 | 7/31/2009 | WO | 00 | 3/29/2011 |