Claims
- 1. A binary tree-structured classification and prediction method for supervised learning on complex datasets where a plurality of predictors determine an outcome, the method comprising the steps of:
a) transforming the predictors if the predictors are not quantitative; b) transforming the outcome with optimal scoring; c) regressing the transformed outcome with optimal scoring on the transformed predictors according to a node specific variable ranking of each of the transformed predictors; d) repeat step c) for subsets of the transformed predictors, from least significant to most, until only one single transformed predictor remains; e) cross-validating nested families of the transformed predictors produced by step d); f) selecting an optimal subset of the transformed predictors based on step e); g) defining a binary splitting criterion using the optimal subset; h) determining whether a significant association exists between the outcome and the transformed predictors in the optimal subset; and i) repeat step c) if the significant association exists.
- 2. The method according to claim 1, further comprising the steps of:
coding the predictors into dummy indicator variables; and coding the outcome into a dummy indicator.
- 3. The method according to claim 2, in which the dummy indicator variables are vectors and the dummy indicator is a real number.
- 4. The method according to claim 1, in which the predictors are categorical.
- 5. The method according to claim 1, in which the predictors are not categorical.
- 6. The method according to claim 1, in which the predictors are selected from the group consisting of ethnicity, mutations at different loci, genotype at selected single nucleotide polymorphisms (SNPs), age, height, and body mass index.
- 7. The method according to claim 1, in which the outcome is characterized as categorical.
- 8. The method according to claim 1, in which the outcome represents disease status or credit worthiness.
- 9. The method according to claim 1, in which each of the subsets yields a reduction in generalized Gini index of diversity.
- 10. The method according to claim 1, further comprising the step of:
determining the node specific variable ranking using a statistic from bootstrap estimates.
- 11. The method according to claim 1, further comprising the step of:
imputing missing values in the predictors.
- 12. The method according to claim 1, in which the significant association exists when a permutation test statistic is larger than a pre-determined threshold.
- 13. A computer product for implementing the method according to claim 1, the computer product comprising a computer readable medium carrying computer-executable instructions implementing the steps of claim 1.
- 14. The computer product of claim 13, in which the computer-executable instructions are characterized as R code.
- 15. A classification and prediction system for supervised learning on complex datasets where a plurality of predictors determine an outcome, the system comprising:
the computer product of claim 13.
- 16. A method of using the system of claim 15, comprising the steps of:
imputing any missing values in the predictors; growing a large tree using a training data set provided by the system; using a cross-validation or customized test data set to determine an appropriate tree size; pruning the large tree to the appropriate tree size; calculating learning performance and the test data set performance; plotting the pruned tree; obtaining binary partition information for each given split in the pruned tree; obtaining details of each intermediate and terminal node of the pruned tree; and defining high risk groups based on the terminal node and the binary partition information.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] This application claims the benefit of a provisional patent application No. 60/381,556, filed May 17, 2002, the entire content and appendices of which are hereby incorporated herein by reference.
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT
[0002] This invention was supported in part by the National Institutes of Health (NIH), grants No. HL54527 and No. CA59039. The U.S. Government may have certain rights in this invention.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60381556 |
May 2002 |
US |