The invention provides a method for selecting at least one potential marker molecule indicating an user defined phenotype feature of an organic object.
To investigate pathways within an organism, contrast agents CA are used.
Marker molecules MM can be located on a surface of a cell, within a cell or can be any molecules involved in a biochemical pathway of an organism.
It is an object of the present invention to provide a method and a system for automatically selecting potential marker molecules MM indicating an user defined phenotype feature of an organic object, such as an organism.
The invention provides a method and a system for selecting at least one potential marker molecule indicating an user defined phenotype feature of an organic object. In an embodiment according to the present invention, genotype data of genes of a group of organic objects and phenotype data of said group of organic objects is provided. Then the genotype data and the phenotype data is categorized to generated categorized data of said group of organic objects. The phenotype feature is related statistically to the generated categorized data to extract genes or genes combinations having a strong statistical relationship with the phenotype feature. The extracted genes and proteins corresponding to the extracted genes are selected as potential marker molecules.
In an embodiment of the method according to the present invention, the genotype data includes different types of genotype data comprising allelic data of the genes as a first type of genotype data stored in a first data format, gene expression data as a second type of genotype data stored in a second data format, and proteomic data of proteins corresponding to the genes as a third type of genotype data stored in a third data format.
In one embodiment of the method according to the present invention, the phenotype data includes different types of phenotype data comprising imaging data as a first type of phenotype data stored in a first data format,
blood profile data as a second type of phenotype data stored in a second data format,
urine metabolic data as a third type of phenotype data stored in a third data format,
physical data as a fourth type of phenotype data stored in a fourth data format,
demographic data as a fifth type of phenotype data stored in a fifth data format, and
user defined phenotype feature data a sixth type of phenotype data stored in a sixth data format.
In an embodiment of the method according to the present invention, the different types of genotype data and the different types of phenotype data are each categorized respectively by performing the following steps, i. e. normalizing the data to generate normalized data, calculating a relevant indicative value on the basis of said normalized data and comparing the calculated value to at least one user defined threshold value to generate the categorized data.
In an embodiment of the method according to the present invention, the phenotype feature is related statistically with the generated categorized data by means of a machine learning algorithm.
In one embodiment of the method according to the present invention, the machine learning algorithm is a learning Bayesian network algorithm.
In one embodiment of the method according to the present invention, each categorized type of data forms a node of a network, wherein statistical relationships between said nodes are extracted by means of a machine learning algorithm.
In one embodiment of the method according to the present invention, each type of genotype data and each type of phenotype data is stored in a corresponding database.
In a preferred embodiment of the method according to the present invention, for each marker molecule a complementary contrast agent, which is attachable to the marker molecule, is selected.
The selected contrast agent is used for molecular imaging of an activation ste of a pathway in which the marker molecule is involved.
In an embodiment of the method according to the present invention, imaging of said pathway is performed by means of X-rays, magnetic resonance, ultrasound or nuclear radiation sensing devices.
In a preferred embodiment of the method according to the present invention, the phenotype feature is related statistically to the generated categorized data by specifying statistical dependencies between said phenotype feature and the generated categorized data.
The investigated organic objects are formed either by cells, organic tissues, organs, organisms, human beings, plants or micro-organisms.
The invention further provides a system for selecting at least one marker molecule indicating a phenotype feature of an organic object comprising:
a first database for storing genotype data of genes of a group of organic objects,
a second database for storing phenotype data of said group of organic objects, and
a calculation unit connected to the first and the second database for categorizing the genotype data and the phenotype data to generate the categorized data of the group of organic objects,
wherein the calculation unit relates statistically the phenotype feature with the generated categorized data to extract genes having a strong statistical relationship with the phenotype feature,
wherein the extracted genes and proteins corresponding to the extracted genes are output by the calculation unit as marker molecules.
In a preferred embodiment, for each selected marker molecule a complementary contrast agent, which is selectively attachable to the marker molecule, is selected. The selected contrast agent can be used for molecular imaging of a pathway in which said marker molecule is involved.
In the following preferred embodiments of the method and the system for selecting potential marker molecules indicating an user defined phenotype feature of an organic object are described with reference to the enclosed drawings and the detailed description below.
As can be seen from
The calculation unit 4 relates in step S3 a user defined phenotype feature of the investigated organic object with the generated categorized data to extract genes G having a strong statistical relationship with the phenotype feature. The extracted genes G and proteins P corresponding to the extracted genes are output by the calculation unit 4 after step S4 as potential marker molecules MM. For the selected potential marker molecules corresponding complementary contrast agents CA, which are attachable to the respective marker molecules are selected in step S5. The selected contrast agents CA can be used for molecular imaging of the pathway of said organism in which the marker molecule MM is involved.
In the embodiment shown in
The third database 3A stores mass spectroscopic data as a type of phenotype data in a corresponding data format. The forth database 3B stores image data as a further type of phenotype data in another corresponding data format.
As can be seen from
On the basis of the different data sources storing genotype data and phenotype data, the computer system 1 according to the present invention categorizes separately the respective genotype data and the respective phenotype data of each database 2, 3 separately to extract categorized data to a generic feature layer. In this way, it is possible to handle heterogeneous data from different data sources. After categorizing of the data has been performed by means of user defined input, the computer system 1 subsequently relates statistically a user defined phenotype feature of the investigated organism with the categorized data to extract genes having a strong statistical relationship with this phenotype feature. In an embodiment of the computer system 1 according to the present invention, the statistical relation is performed by correlating the phenotype feature with the generated categorized data. The extracted genes and proteins corresponding to the extracted genes are selected by the computer system 1 as potential marker molecules MM for which complementary contrast agents CA can be found.
The correlation analysis is run at the meta-layer level so that it is independent of the structure of the data giving rise to the feature combination. In a preferred embodiment, the user defined phenotype feature is related statistically with the generated categorized data by means of a machine learning algorithm. This machine learning algorithm is in a preferred embodiment a learning Bayesian network algorithm.
The modularity of the computer system 1 according to the present invention allows flexible adaption to user needs. Emphasizis is put on data pre-processing and feature extraction used to generate the meta-layer categorized data as shown in
First, the user defines phenotype feature for which he wishes to find a potential marker molecule MM. For instance, the user defines the phenotype feature whether the patient has a good or poor prognosis. The selected phenotype feature is related statistically with the categorized data as shown in
A researcher might want to search for SNPs and genes that are likely to be involved in a disease mechanism and to find corresponding marker molecules. This is done by using the search function for BioChip databases and an SNP database. For the investigated patients, a genetic testing is performed specifying the allele combinations and the results are stored in the computer system 1. Transparent to the user, the computer system 1 initiates an upload of allele data to the SNP database and keeps the link to the experiment and patient. Subsequently, a number of gene expression experiments are carried out and eventually under different conditions, i. e. before and after treatment, early disease, progressed disease etc. The resulting expression data is also stored in the computer system 1. Transparent to the user, the computer system 1 initiates an upload to a BioChip database and keeps links to the patients and the experiment. Finally, the investigated patients are in parallel imaged and phenotyped in various other ways. The resulting data is stored in the computer system 1. On the basis of this data, the researcher analyzes the data to extract genotype/phenotype relationships, gene expression/phenotype relationships and eventually mutative molecular disease pathways.
Furthermore, the researcher might be primarily interested in studying the impact of certain SNPs upon signal transduction pathways which later may cause diseases. The researcher collects information about all genes which are known to participate in a certain signal transduction pathway. In the next step, the SNPs are identified which are in or close to one of the respective genes within a range defined by a certain threshold. The SNPs are then classified into coding and non-coding wherein the latter are only accepted in case they are within a known enhancer-promoter region of a gene and part of an intronic sequence that could play a role in splicing or alternative splicing. The coding SNPs are subclassified in synonymous or non-synonymous wherein the latter are used for subsequent analysis. The impact of SNPs is analyzed, i.c. whether they might have an impact on the protein structure or not. Based on the SNP pattern which has been identified by said process a representative patient population, i. e. a test group, is searched for in the database which contain individuals having one or more of these SNPs. The control group of individuals having none of these SNPs is collected as well.
For both above described scenarios, the user, i. e. the researcher, can identify molecules which are involved in pathways of the organism, i. e. tRNA, mRNA, proteins etc. These found marker molecules MM are then the primary target for a contrast agent development, said contrast agents CA being selectively attachable to the target molecules. The found contrast agents CA are then used for image acquisition with X-rays, magnetic resonance, ultrasound or nuclear radiation sensing devices.
Some data stored in databases is already categorical, such as SNP data. In contrast, gene expression data requires the step of gene selection. In the computer system 1 according to the present invention, manual gene selection is supported as well as a number of data driven gene selection techniques.
The system 1 according to the present invention provides univariate tests, such as correlation, statistical dependency analysis to check for differential expression with respect to the experimental conditions, e. g. time, pharmacological treatment, drug dose etc. and correlation and statistical tests to check for differential expressions with genotypic information, i. e. behavior of one SNP or Haplotype and occurrence of a pattern of SNPs motivated by the location and potential impact on the expression of a certain gene or group of genes.
Both independent quantities, i. e. experimental conditions and SNP variance, are present in the feature meta-layer and are henceforth available for analysis. In an embodiment of the present invention, a T-test, ANOVA, a Chi-square dependency test, an Entropy test, Kolmogorov-Smirnov-Test, Markhof Blanket and mutual information are provided to check correlations. In a preferred embodiment, in order to avoid that the test yields many false positive, false discovery thresholding, a logic combination of tests and performance estimation by cross-validation is provided.
After the gene selection discretization of expressional levels is performed. Depending on the type of distribution and type of discretization, the discretization thresholds are determined according to standard deviation of expressional levels or the minimization of the entropy by applying the minimum description length principle across patients.
Once the feature meta-layer is generated by the use of feature extraction components, genotype/phenotype relations are learned on the basis of a predictive model. With this purpose association mining and collaborative filtering are deployed for an unsupervised screening of the data.
In addition, robust learning Bayesian networks are applied with causal interpretation to extract by a machine learning process relationships between different entities of the feature level. Each feature type, e. g. SNP, gene expression and tumour size is represented by the network. The machine learning consists of finding statistical relationships between the nodes which are graphically represented by edges and a set of probability values.
The stored genotype data is treated as unconditional causes. During the learning process, the genotype data are related to other features like gene expression levels and to phenotypic outcomes. The effect of experimental conditions is taken into account by including them in the network to be learned as well.
After the machine learning, the following probabilistic knowledge can be extracted:
Feature selection: Many nodes in the network do have no or weak interaction with others. Some nodes, however, strongly interact with each other and/or phenotypic features. These nodes are identified as key features on the basis of the data.
Causal pathways: Relationships and associations between different molecular or macroscopic entities are made explicit.
Predictive power: By taking into account many features simultaneously, a superior predictive power is achieved, i. e. SNP, gene, protein and/or metabolite combinations forming a biomarker.
Generative modeling: Once generated, the predictive model is used to play in-silico-what-if-scenarios to conduct virtual experiments before these experiments are actually carried out in the wet lab.
Stratification: Patients are stratified into groups which may have similar molecular and phenotype feature patterns.
Personalization: The analysis allows revealing the differences in the patient population with respect to responses to drugs and other forms of treatment, consequently leading to a personalized treatment by avoiding potential risk factors.
Feedback for the experimentalist: The analysis allows a comparison of the diagnostic and predictive power of each modality. Therefore, it is possible to make improvement suggestions for both, the sample preparation and data acquisition.
In a preferred embodiment, the method for selecting at least one potential marker molecule indicating an user defined phenotype feature of an organic object is performed by a program stored on a data carrier.