Claims
- 1. A method of determining clinically relevant information from gene expression data, comprising:
conducting a statistical analysis of the gene expression data to identify trends and dependencies among the data; and deriving a probabilistic model from the gene expression data, the probabilistic model being indicative of a probable classification of the data into clinically relevant information.
- 2. The method of claim 1, wherein the probabilistic model is derived using a discrete Bayesian analysis.
- 3. The method of claim 1, further comprising generating an empirical probability distribution function (pdf) of stochastic variables in accordance with the gene expression data and extracting information regarding class membership of clinically relevant information with respect to a new set of data values.
- 4. The method of claim 1 wherein, the gene expression data is processed by:
determining an estimate for one or more hypothesis-conditional probability density functions p(x|Hk) for a set X of the gene expression data conditioned on a set H of hypotheses relating to the gene expression data; determining a set of prior probability density functions p(Hk) for each hypothesis of the set H; and determining a set of posterior test-conditional probability density functions p(Hk|x) for the hypotheses conditioned on a new data x; wherein the p(x|Hi) estimates include a global estimate produced in accordance with uncertainties in the statistical characteristics of the gene expression data relating to each hypothesis-conditional pdf p(x|Hk).
- 5. The method of claim 4, wherein the uncertainties in the statistical characteristics are specified as an ellipsoid about the gene expression data for each hypothesis and each ellipsoid is defined by an m-dimensional ellipsoid Eq,k for each hypothesis Hk and is specified by:
- 6. The method of claim 4, wherein the hypothesis-conditional p(x|Hk) estimates further include a local estimate produced in accordance with a discrete neighbor counting process for the gene expression data relative to the global estimate for the corresponding hypothesis-conditional pdf.
- 7. The method of claim 6, wherein the local estimate for the hypothesis is specified as a probability that an observed vector of tests x and an associated discrete neighbor counting pattern {C1,k(x)},l=1, . . . , Lk, k=1, . . . , N might actually be observed, wherein the neighbor counting pattern comprises counting neighbors in the distance layers for each class: {C1,k}, l=1, . . . , Lk, wherein the integer C1,k is the number of neighbors associated with the k-th hypothesis whose test values are distanced from a next test value within the 1-th globally-transformed distance layer for the k-th class:
- 8. The method of claim 7, wherein the selected k-th class of the gene expression data corresponds to a selected training subset class of the gene expression data.
- 9. The method of claim 4, further including:
performing a training mode in which a training subset class of the gene expression data is used to produce the hypothesis-conditional probability density functions p(x|Hk); and performing a prediction mode in which a set of posterior probabilities is determined for the set H of hypotheses, wherein the hypothesis-conditional probability density functions p(x|Hk) are produced from the global estimates and from local estimates produced in accordance with a discrete neighbor counting process for the gene expression data relative to the global estimate for the corresponding hypothesis-conditional pdf.
- 10. The method of claim 9, wherein the local estimate for a hypothesis is specified as a probability that an observed vector of tests x and an associated discrete neighbor counting pattern {C1,k(x)}, l=1, . . . , Lk, k=1, . . . , N might actually be observed, wherein the neighbor counting pattern comprises counting neighbors in the distance layers for each class: {C1,k}l=1, . . . , Lk, wherein the integer C1,k is the number of test elements associated with the
k-th hypothesis whose test values are distanced from a next test value within the l-th globally-transformed distance layer for the k-th class: 54Cl,k=∑inkϑl,i,k,ϑl,i,k={1if d_l-1,k<di,k≤d_l,k,d_0,k=00otherwise where nk is the total number of data records in a selected k-th class and the index i runs over all these data records.
- 11. The method of claim 10, wherein the selected k-th class of the gene expression data corresponds to the training subset class of the gene expression data.
- 12. The method of claim 1, wherein the posterior test-condition probabilities provide the clinically relevant information.
- 13. The method of claim 1, wherein the clinically relevant information is compound toxicity; disease diagnosis; disease stage; disease outcome; drug efficacy; or survivability in clinical trials.
- 14. The method of claim 13, wherein the clinically relevant information is disease diagnosis.
- 15. The method of claim 14, wherein the diseases are selected from cancer; cardiovascular diseases; diabetes; HIV/AIDS; hepatitis; neurodegenerative diseases; ophthalmic diseases;-blood diseases; respiratory diseases; endocrine diseases; bacterial, fungal parasitic or viral infections; inflammatory diseases; reproductive diseases and any other disease or disorder for which gene expression data can be used to predict clinically relevant information.
- 16. The method of claim 15, wherein the disease is a cancer, selected from ovarian, lung, pancreatic, prostate, brain, breast, and colon cancer.
- 17. The method of claim 16, wherein the disease is breast cancer.
- 18. The method of claim 15, wherein the disease is a cardiovascular disease.
- 19. The method of claim 18, wherein the cardiovascular disease is selected from arteriosclerosis, angina, high blood pressure, high cholesterol, heart attack, stroke and arrhythmia.
- 20. The method of claim 15, wherein the disease is a inflammatory disease.
- 21. The method of claim 20, wherein the inflammatory disease is selected from asthma, chronic obstructive pulmonary disease, rheumatoid arthritis, inflammatory bowel disease, glomerulonephritis, neuroinflammatory disease, multiple sclerosis, and disorders of the immune system.
- 22. The method of claim 15, wherein the disease is a neurodegenerative disease.
- 23. The method of claim 19, wherein the neurodegenerative disease is selected from Alzheimer's disease (AD), Parkinson's disease, Huntington's disease, amyotrophic lateral sclerosis (ALS), and other brain disorders.
- 24. The method of claim 15, wherein the disease is a pulmonary disease.
- 25. The method of claim 1, wherein the gene expression data is obtained from the levels of genes expressed, particular genes expressed, post-translational modifications of genes, encoded proteins that are expressed, glycosylation or splice variants of genes.
- 26. The method of claim 1, further comprising an update step in which new data is convolved with the a priori probability of a discretized state vector of a hypothesis to generate the a posteriori probability of the hypothesis.
- 27. The method of claim 26, further comprising a prediction step wherein trends in the data are captured via Markov chain models of the discretized state.
- 28. The method of claim 1, further comprising compiling data into a database.
- 29. A method for generating an a posteriori tree of clinically relevant information for a subject, wherein the method comprises:
performing an analysis of gene expression data for a population of individuals, wherein the data comprises a matrix of discriminations between clinically relevant information selected from a predetermined list of clinically relevant information; performing a Bayesian statistical analysis to estimate a series of hypothesis-conditional probability density functions p(x|Hi) where a hypothesis Hi is one of a set H of the clinically relevant information; determining a prior probability density function p(Hi) for each of the hypotheses Hi; determining a posterior test-conditional probability density function p(Hi|x) for each of the hypotheses Hi test data records; and generating a posterior tree of possible clinically relevant information for a test subject in accordance with test results for the test subject.
- 30. The method of claim 29, wherein the matrix of discriminations is a pair-wise matrx.
- 31. A method of developing a test to screen for one or more inapparent diseases, comprising:
conducting a statistical analysis of data in order to identify trends and dependencies among the data, wherein the data comprises gene expression data from a subject; deriving a probabilistic model from the data, the probabilistic model being indicative of a probable disease diagnosis for a patient, wherein the probabilistic model is derived using a discrete Bayesian analysis; identifying from among the input data, the data that contributes to the diagnosis; and identifying the genes that generated the data that contributes to the diagnosis.
- 32. A method of diagnosing a disease condition of a patient, the method comprising:
receiving a set of gene expression data comprising gene expression data from a population X of individuals; estimating a hypothesis-conditional probability density function p(x|H 1) where the hypothesis H1 relates to a diagnosis condition for a test patient x, and estimating a hypothesis-conditional probability density function p(x|H2) where the hypothesis H2 relates to a non-diagnosis condition for a test patient; determining a prior probability density function p(H) for the each of the hypotheses H1 and H2; determining a posterior test-conditional probability density function p(H|x) for each of the hypotheses H1 and H2 on the test data x; and providing a diagnosis probability of a new patient for the H disease condition, based on the determined posterior test-conditional probability density function p(H1|x) as compared to the posterior test-conditional probability density function p(H2|x) and gene expression data of the new patient.
- 33. The method of claim 32, wherein the diseases are selected from cancer; cardiovascular diseases; diabetes; HIV/AIDS; hepatitis; neurodegenerative diseases; ophthalmic diseases; blood diseases; respiratory diseases; endocrine diseases; bacterial, fungal parasitic or viral infections; inflammatory diseases; reproductive diseases and any other disease or disorder for which gene expression data can be used to predict clinically relevant information.
- 34. A method of diagnosing a disease from data, comprising:
conducting a statistical analysis of the data in order to identify trends and dependencies among the data, wherein the data comprises gene expression data from a subject; deriving a probabilistic model from the data, the probabilistic model being indicative of a probable disease diagnosis for a patient, wherein the disease is an inapparent disease.
- 35. The method of claim 34, wherein the probabilistic model is derived using a discrete Bayesian analysis.
- 36. The method of claim 34, further comprising compiling data into a database.
- 37. The method of claim 34, further comprising an update step in which new data is convolved with the a priori probability of a discretized state vector of a hypothesis to generate the a posteriori probability of the hypothesis.
- 38. The method of claim 36, further comprising a prediction step wherein trends in the data are captured via Markov chain models of the discretized state.
RELATED APPLICATIONS
[0001] Benefit of priority is claimed to U.S. Provisional Patent Application Serial No. 60/366,441, filed Mar. 19, 2002 to Padilla et al. entitled “Discrete Bayesian Analysis Of Data”. This application is also related to International PCT application No. (attorney docket no. 24737-1918PC), filed Mar. 19, 2003.
[0002] The disclosures of the above-referenced provisional patent application and international PCT application are hereby incorporated herein by reference in their entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60366441 |
Mar 2002 |
US |