Claims
- 1. A method for selecting epigenetic features, comprising the steps of:
a) collecting and storing biological samples containing genomic DNA; b) collecting and storing available phenotypic information about said biological samples;
thereby defining a phenotypic data set, c) defining at least one phenotypic parameter of interest; d) using said defined phenotypic parameters of interest to divide said biological samples in at least two disjunct phenotypic classes of interest; e) defining an initial set of epigenetic features of interest; f) measuring and/or analysing said defined epigenetic features of interest of said biological samples; thereby generating an epigenetic feature data set; g) selecting those epigenetic features of interest and/or combinations of epigenetic features of interest that are relevant for epigenetically based prediction of said phenotypic classes of interest; h) defining a new set of epigenetic features of interest based on the relevant epigenetic features of interest and/or combinations of epigenetic features of interest generated in step (g).
- 2. The method of claim 1 wherein steps (f) to (g) are repeated based on the new set of epigenetic features of interest defined in step (h).
- 3. The method of claim 1 or 2 wherein the biological samples comprise cells, cellular components which contain DNA, sources of DNA comprising, for example, cell lines, biopsies, blood, sputum, stool, urine, cerebral-spinal fluid, tissue embedded in paraffin such as tissue from eyes, intestine, kidney, brain, heart, prostate, lung, breast or liver, histologic object slides, and all possible combinations thereof.
- 4. The method of any one of the claims 1 to 3 wherein the phenotypic information and/or phenotypic parameter of interest are selected from the group comprising kind of tissue, drug resistance, toxicology, organ type, age, life style, disease history, signalling chains, protein synthesis, behaviour, drug abuse, patient history, cellular parameters, treatment history and gene expression and combinations thereof.
- 5. The method of any one of the claims 1 to 4 wherein the epigenetic features of interest are cytosine methylation sites in DNA.
- 6. The method of any one of the claims 1 to 5 wherein the initial set of epigenetic features of interest is defined using preliminary knowledge data about their correlation with phenotypic parameters.
- 7. The method of any one of the claims 1 to 6 wherein an epigenetic feature or a combination of epigenetic features is relevant for epigenetically based prediction of said phenotypic classes of interest if the accuracy and/or the significance of the epigenetically based prediction of said phenotypic classes of interest is likely to decrease by exclusion of the corresponding epigenetic feature data;
- 8. The method of any one of the claims 1 to 7 wherein said phenotypic parameters of interest are used to divide said biological samples in two disjunct phenotypic classes of interest.
- 9. The method of claim 8 wherein said epigenetically based prediction of said two disjunct phenotypic classes of interest is done by a machine learning classifier.
- 10. The method of any one of the claims 1 to 7 wherein from said disjunct phenotypic classes of interest pairs of classes or pairs of unions of classes are selected then subjecting each pair of classes or pair of unions of classes to the method of claims 9.
- 11. The method of claim 9 wherein said selecting step comprises:
a) defining a candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest, b) defining a feature selection criterion, c) ranking the candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest according to said feature selection criterion, and d) selecting the highest ranking epigenetic features of interest and/or combinations of epigenetic features of interest.
- 12. The method of claim 11 wherein said candidate set of epigenetic features of interest is the set of all subsets of said defined epigenetic features of interest.
- 13. The method of claim 11 wherein said candidate set of epigenetic features of interest is the set of all subsets of a given cardinality of said defined epigenetic features of interest.
- 14. The method of claim 11 wherein said candidate set of epigenetic features of interest is the set of all subsets of cardinality 1 of said defined epigenetic features of interest.
- 15. The method of claim 11 wherein said epigenetic feature data set is subject to principal component analysis, the principal components defining said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 16. The method of claim 11 wherein said epigenetic feature data set is subject to multidimensional scaling, the calculated coordinate vectors defining said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 17. The method of claim 11 wherein said epigenetic feature data set is subject to isometric feature mapping, the calculated coordinate vectors defining said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 18. The method of claim 11 wherein said epigenetic feature data set is subject to cluster analysis, then combining the epigenetic features of interest belonging to the same cluster to define said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 19. The method of claim 18 wherein said cluster analysis is hierarchical clustering.
- 20. The method of claim 18 wherein said cluster analysis is k-means clustering.
- 21. The method of claim 11 wherein said epigenetic feature selection criterion is the training error of the machine learning classifier trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 22. The method of claim 11 wherein said epigenetic feature selection criterion is the risk of the machine learning classifier trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 23. The method of claim 11 wherein said epigenetic feature selection criterion are the bounds on the risk of the machine learning classifier trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 24. The method of any one of the claims 14 to 20 wherein said epigenetic feature selection criterion is the use of test statistics for computing the significance of difference of said phenotypic classes of interest given the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 25. The method of claim 24 wherein said statistical test is a t-test.
- 26. The method of claim 24 wherein said statistical test is a rank test.
- 27. The method of claim 26 wherein said rank test is a Wilcoxon rank test.
- 28. The method of any one of the claims 14 to 20 wherein said epigenetic feature selection criterion is the computation of the Fisher criterion for said phenotypic classes of interest given the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 29. The method of any one of the claims 14 to 20 wherein said epigenetic feature selection criterion is the computation of the weights of a linear discriminant for said phenotypic classes of interest given the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 30. The method of claim 29 wherein said linear discriminant is the Fisher discriminant.
- 31. The method of claim 29 wherein said linear discriminant is the discriminant of a support vector machine classifier for said phenotypic classes of interest trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 32. The method of claim 29 wherein said linear discriminant is the discriminant of a perceptron classifier for said phenotypic classes of interest trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 33. The method of claim 29 wherein said linear discriminant is the discriminant of a Bayes Point Machine classifier for said phenotypic classes of interest trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest
- 34. The method of any one of the claims 14 to 20 wherein said epigenetic feature selection criterion is subjecting the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest to principal component analysis and calculating the weights of the first principal component.
- 35. The method of any one of the claims 14 to 20 wherein said epigenetic feature selection criterion is the mutual information between said phenotypic classes of interest and the classification achieved by an optimally selected threshold on the given epigenetic feature of interest.
- 36. The method of any one of the claims 14 to 20 wherein said epigenetic feature selection criterion is the number of correct classifications achieved by an optimally selected threshold on the given epigenetic feature of interest.
- 37. The method of claim 15 wherein said epigenetic feature selection criterion are the eigenvalues of the principal components.
- 38. The method of claim 11 wherein a defined number of highest ranking epigenetic features of interest and/or combinations of epigenetic features of interest is selected.
- 39. The method of claim 11 wherein all except a defined number of lowest ranking epigenetic features of interest and/or combinations of epigenetic features of interest are selected.
- 40. The method of claim 11 wherein the epigenetic features of interest and/or combinations of epigenetic features of interest with a feature selection criterion score greater than a defined threshold are selected.
- 41. The method of claim 11 wherein all except the epigenetic features of interest and/or combinations of epigenetic features of interest with a feature selection criterion score lesser than a defined threshold are selected.
- 42. The method of claim 2 wherein the steps (f) to (g) are repeated until a defined number of epigenetic features of interest and/or combinations of epigenetic features of interest are selected.
- 43. The method of claim 2 wherein the steps (f) to (g) are repeated until all epigenetic features of interest and/or combinations of epigenetic features of interest with a feature selection criterion score greater than a defined threshold are selected.
- 44. The method of any one of claims 38 to 43 wherein the optimal number of epigenetic features of interest and/or combinations of epigenetic features of interest and/or the optimal feature selection criterion score threshold is determined by crossvalidation of the classifier on test subsets of the epigenetic feature data.
- 45. The method of claim 1 or 2 wherein the feature data set corresponding to said defined new set of epigenetic features of interest is used to train a machine learning classifier.
- 46. A computer program product for selecting epigenetic features comprising
a) computer code that receives as input an epigenetic feature dataset for a plurality of epigenetic features of interest, the epigenetic feature dataset being grouped in disjunct classes of interest; b) computer code that selects those epigenetic features of interest and/or combinations of epigenetic features of interest that are relevant for machine learning class prediction based on the corresponding epigenetic feature data set; c) computer code that defines a new set of epigenetic features of interest based on the relevant epigenetic features of interest and/or combinations of epigenetic features of interest generated in step (b); d) a computer readable medium that stores the computer code.
- 47. The computer program product of claim 46 comprising computer code that repeats steps (b) based on the new set of epigenetic features defined in step (c).
- 48. The computer program product of claim 46 or 47 wherein an epigenetic feature of interest and/or combination of epigenetic features of interest is relevant if the accuracy and/or the significance of the machine learning class prediction is likely to decrease by exclusion of the corresponding epigenetic feature data.
- 49. The computer program product of any one of the claims 46 to 48 wherein said computer code groups the epigenetic feature data set in disjunct pairs of classes and/or pairs of unions of classes of interest before applying the computer code of steps (b) and (c).
- 50. The computer program product of any one of the claims 46 to 49 wherein said computer code for selecting the relevant epigenetic features of interest and/or combinations of epigenetic features of interest comprises
a) computer code that defines candidate sets of epigenetic features of interest and/or combinations of epigenetic features of interest, b) computer code that ranks said candidate sets of epigenetic features of interest and/or combinations of epigenetic features of interest according to a feature selection criterion; and c) computer code that selects the highest ranking epigenetic features of interest and/or combinations of epigenetic features of interest.
- 51. The computer program product of claim 50 wherein said candidate set of epigenetic features of interest is the set of all subsets of said epigenetic features of interest.
- 52. The computer program product of claim 50 wherein said candidate set of epigenetic features of interest is the set of all subsets of a given cardinality of said epigenetic features of interest.
- 53. The computer program product of claim 50 wherein said candidate set of epigenetic features of interest is the set of all subsets of cardinality 1 of said epigenetic features of interest.
- 54. The computer program product of claim 50 wherein the computer code performs principal component analysis on said epigenetic feature data, the principal components defining said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 55. The computer program product of claim 50 wherein the computer code performs multidimensional scaling on said epigenetic feature data set, the calculated coordinate vectors defining said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 56. The computer program product of claim 50 wherein the computer code performs isometric feature mapping on said epigenetic feature data set, the calculated coordinate vectors defining said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 57. The computer program product of claim 50 wherein the computer code performs cluster analysis on said epigenetic feature data set, then combining the epigenetic features of interest belonging to the same cluster to define said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 58. The computer program product of claim 57 wherein said cluster analysis is hierarchical clustering.
- 59. The computer program product of claim 57 wherein said cluster analysis is k-means clustering.
- 60. The computer program product of claim 50 wherein said epigenetic feature selection criterion is the training error of the machine learning classifier trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 61. The computer program product of claim 50 wherein said epigenetic feature selection criterion is the risk of the machine learning classifier trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 62. The computer program product of claim 50 wherein said epigenetic feature selection criterion are the bounds on the risk of the machine learning classifier trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 63. The computer program product of any one of the claims 53 to 59 wherein said epigenetic feature selection criterion is the use of test statistics for computing the significance of difference of said classes of interest given the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 64. The computer program product of claim 63 wherein said statistical test is a t-test.
- 65. The computer program product of claim 63 wherein said statistical test is a rank test.
- 66. The computer program product of claim 65 wherein said rank test is a Wilcoxon rank test.
- 67. The computer program product of any one of the claims 53 to 59 wherein said epigenetic feature selection criterion is the computation of the Fisher criterion for said classes of interest given the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 68. The computer program product of any one of the claims 53 to 59 wherein said epigenetic feature selection criterion is the computation of the weights of a linear discriminant for said classes of interest given the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 69. The computer program product of claim 68 wherein said linear discriminant is the Fisher discriminant.
- 70. The computer program product of claim 68 wherein said linear discriminant is the discriminant of a support vector machine classifier for said classes of interest trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 71. The computer program product of claim 68 wherein said linear discriminant is the discriminant of a perceptron classifier for said classes of interest trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 72. The computer program product of claim 68 wherein said linear discriminant is the discriminant of a Bayes Point Machine classifier for said classes of interest trained on the epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest.
- 73. The computer program product of any one of the claims 53 to 59 wherein the computer code performs principal component analysis on said epigenetic feature data corresponding to said candidate set of epigenetic features of interest and/or combinations of epigenetic features of interest said epigenetic feature selection criterion are the weights of the first principal component.
- 74. The computer program product of any one of the claims 53 to 59 wherein said epigenetic feature selection criterion is the mutual information between said classes of interest and the classification achieved by an optimally selected threshold on the given epigenetic feature of interest.
- 75. The computer program product of any one of the claims 53 to 59 wherein said epigenetic feature selection criterion is the number of correct classifications achieved by an optimally selected threshold on the given epigenetic feature of interest.
- 76. The computer program product of claim 54 wherein said epigenetic feature selection criterion are the eigenvalues of the principal components.
- 77. The computer program product of claim 50 wherein the computer code selects a defined number of highest ranking epigenetic features of interest and/or combinations of epigenetic features of interest.
- 78. The computer program product of claim 50 wherein the computer code selects all except a defined number of lowest ranking epigenetic features of interest and/or combinations of epigenetic features of interest.
- 79. The computer program product of claim 50 wherein the computer code selects the epigenetic features of interest and/or combinations of epigenetic features of interest with a feature selection criterion score greater than a defined threshold.
- 80. The computer program product of claim 50 wherein the computer code selects all except the epigenetic features of interest and/or combinations of epigenetic features of interest with a feature selection criterion score lesser than a defined threshold.
- 81. The computer program product of claim 47 wherein the steps (b) and (c) are repeated until a defined number of epigenetic features of interest and/or combinations of epigenetic features of interest are selected.
- 82. The computer program product of claim 47 wherein the computer code repeats the steps (b) and (c) until all epigenetic features of interest and/or combinations of epigenetic features of interest with a feature selection criterion score greater than a defined threshold are selected.
- 83. The computer program product of any one of claims 77 to 82 wherein the computer code calculates the optimal number of epigenetic features of interest and/or combinations of epigenetic features of interest and/or the optimal feature selection criterion score threshold by crossvalidation of the classifier on test subsets of said epigenetic feature data.
- 84. The computer program product of claim 46 comprising computer code that uses the epigenetic feature data set corresponding to said defined new set of epigenetic features of interest to train a machine learning classifier.
RELATED APPLICATION
[0001] This application claims the priority of U.S. Provisional Application, Serial No. 60/278,333 filed on Mar. 26, 2001. The 60/278,333 application is incorporated herein by reference for all purposes. All cited references are hereby incorporated in their entireties.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60278333 |
Mar 2001 |
US |