Claims
- 1. A method of forming a decision tree to simultaneously classify multiple properties of a training item set, said training item set containing training items each having a plurality of X descriptors and a plurality of original properties, each of said plurality of X descriptors and said plurality of original properties corresponds to a physical aspect of a training item, said method comprising:
converting each of said training item into one or more converted training items, said converted training items each having said plurality of X descriptors, a K descriptor and a property; splitting said converted training items at a root node of a overgrown tree by a logic test on said K descriptor; growing said overgrown tree by repeatedly splitting said converted training items at sub-root nodes of said overgrown tree by logic tests on X descriptors; pruning said overgrown tree to produce a pure-specific tree; growing a maximally generic tree, said maximally generic tree having a root node that splits said converted training items by a logic test on a X descriptor; and permuting said maximally generic tree to produce said decision tree, said decision tree being less generic than said maximally generic tree and more generic than said pure-specific tree.
- 2. The method of claim 1, wherein said converting comprises converting a training item having a first original property of value Y1 and a second original property of value Y2 into a first converted training item and a second converted training item, said first converted training item having a K descriptor of value one and a property of value Y1, said second converted training item having a K descriptor of value two and a property of value Y2.
- 3. The method of claim 1, wherein said converting comprises converting a training item having a first original property of value Y1 and a second original property of unknown value into a converted training item, said converted training item having a K descriptor of value one and a property of value Y1.
- 4. The method of claim 1, wherein said repeatedly splitting said converted training items at sub-root nodes of said overgrown tree by logic tests on X descriptors comprises:
at each sub-root node of said overgrown tree, splitting said converted training items by a logic test on a X descriptor that would cause the largest drop in impurity for said overgrown tree; and repeating said splitting process until no splitting can reduce the impurity or until there is no more than one converted training item at each node of said overgrown tree.
- 5. The method of claim 1, wherein said pruning said overgrown tree comprises pruning said overgrown tree using minimal cost-complexity pruning.
- 6. The method of claim 5, wherein said using minimal cost-complexity pruning comprises:
using a formula Rα=R0+αNleaf, wherein Rα is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes, and α is a parameter; and enabling a user to specify a value of α.
- 7. The method of claim 5, wherein said using minimal cost-complexity pruning comprises:
using a formula Rα=R0+αNleaf, wherein Rα is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes of said pure-specific tree, and α is a parameter; and identifying a value of α by a cross-validation method.
- 8. The method of claim 1, wherein said growing a maximally generic tree comprises:
marking said pure-specific tree as a generic tree; marking a root node of said generic tree as a non-final K-node; for each non-final K-node of said generic tree:
saving a sub-tree beginning with said non-final K-node as an alternate K-branch of said generic tree; evaluating whether to split said non-final K-node by a X descriptor; if said evaluation result is negative:
restoring said alternate K-branch to said generic tree; and marking said non-final K-node as a final K-node; and if said evaluation result is positive:
splitting said non-final K-node by a X descriptor; splitting immediate descendent nodes of said node by said K-descriptor; splitting and optionally pruning non-immediate descendent nodes of said non-final K-node; and marking each of said immediate descendent nodes as a non-final K-node if said node may be further split; and identifying said generic tree as said maximally generic tree.
- 9. The method of claim 8, wherein said splitting said non-final K-node by a X descriptor comprises splitting said node by a X descriptor logic test that would reduce impurity for all of said plurality of original properties.
- 10. The method of claim 8, wherein said splitting said non-final K-node by a X descriptor comprises:
for each X descriptor logic test, evaluating a drop in impurity for each of said plurality of original properties, had said non-final K-node been split by said descriptor; assigning each X descriptor logic test a score that equals the least drop in impurity for one of said plurality of original properties; and splitting said non-final K-node by a X descriptor logic test having the highest score.
- 11. The method of claim 1, wherein said permuting said maximally generic tree comprises:
using a formula Rαβ=R0+α(Nleaf−βNgeneric), wherein Rαβ is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes of said decision tree, α is a parameter, Ngeneric is a number of generic nodes of said decision tree, and β is another parameter; enabling a user to specify a value of α; and enabling a user to specify a value of β.
- 12. The method of claim 1, wherein said permuting said maximally generic tree comprises:
using a formula Rαβ=R0+α(Nleaf−βNgeneric), wherein Rαβ is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes of said decision tree, α is a parameter, Ngeneric is a number of generic nodes of said decision tree, and β is another parameter; identifying a value of α by a cross-validation method; and enabling a user to specify a value of β.
- 13. The method of claim 8, wherein said splitting and optionally pruning non-immediate descendent nodes of said non-final K-node comprises:
using a formula Rα=R0+αNleaf, wherein Rα is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes of a sub-tree starting at said non-final K-node, and α is a parameter; and determining a value of α using a cross-validation method; and wherein said permuting said maximally generic tree comprises:
using a formula Rαβ=R0+α(Nleaf−βNgeneric), wherein Rαβ is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes of said decision tree, α is a parameter determined by said cross-validation method, Ngeneric is a number of generic nodes of said decision tree, and β is another parameter; and enabling a user to specify a value of β.
- 14. The method of claim 7, wherein said permuting said maximally generic tree comprises:
using a formula Rαβ=R0+α′( Nleaf−βNgeneric), wherein Rαβ is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes of said decision tree, α′ is a parameter, Ngeneric is a number of generic nodes of said decision tree, and β is another parameter; identifying a value of α′ based on the identified value of α found by the cross-validation method; and enabling a user to specify a value of β.
- 15. The method of claim 14, wherein identifying a value of α′ comprises identifying a value of α′ as the identified value of α found by the cross-validation method.
- 16. The method of claim 14, wherein identifying a value of α′ comprises identifying a value of α′ as the identified value of α found by the cross-validation method subtracted by a positive number.
- 17. The method of claim 14, wherein identifying a value of α′ comprises prompting the user to specify a value of α′ that is identical to or smaller than the identified value of α found by the cross-validation.
- 18. The method of claim 1, further comprising finding all decision trees that are less generic than said maximally generic tree and more generic than said pure-specific tree.
- 19. The method of claim 18, wherein said finding all decision trees comprises:
providing a formula Rαβ=R0+α(Nleaf−βNgeneric), wherein Rαβ is a cost-complexity measure to be minimized, R0 is a miscalculation cost on said converted training items, Nleaf is a number of leaf nodes of said decision tree, α is a parameter, Ngeneric is a number of generic nodes of said decision tree, and β is another parameter; repeatedly decreasing a value of β1 until permuting said maximally generic tree using the formula Rαβ=R0+α(Nleaf−β1 Ngeneric) produces said pure-specific tree; repeatedly increasing a value of β2 until permuting said maximally generic tree using the formula Rαβ=R0+α(Nleaf−β2 Ngeneric) produces said maximally generic tree; and for a plurality of values of β that are greater than said value of β1 and smaller than said value of β2, permuting said maximally generic tree using the formula Rαβ=R0+α(Nleaf−βNgeneric) to produce said decision trees.
- 20. The method of claim 18, further comprising displaying said decision trees as labeled portions of said maximally generic tree.
- 21. The method of claim 18, further comprising:
displaying said decision trees as labeled portions of said maximally generic tree; and for each displayed decision tree, displaying a range of parameter values, wherein permuting said maximally generic tree using the formula Rαβ=R0+α(Nleaf−βNgeneric) with a β value within the displayed range of parameter values would produce said displayed decision tree.
- 22. The method of claim 1, further comprising using said produced decision tree to predict multiple unknown properties of a physical item, said physical item having said plurality of X descriptors.
- 23. A method of forming a decision tree to simultaneously classify multiple properties of a training item set, said training item set containing training items each having a plurality of X descriptors and a plurality of original properties, each of said plurality of X descriptors and said plurality of original properties corresponds to a physical aspect of a training item, said method comprising:
converting each of said training item into one or more converted training items, said converted training items each having said plurality of X descriptors, a K descriptor and a property; splitting said converted training items at a root node of a pure-specific tree by a logic test on said K descriptor; growing said pure-specific tree by repeatedly splitting said converted training items at sub-root nodes of said pure-specific tree by logic tests on X descriptors; growing a maximally generic tree, said maximally generic tree having a root node that splits said converted training items by a logic test on a X descriptor; and permuting said maximally generic tree to produce said decision tree, said decision tree being less generic than said maximally generic tree and more generic than said pure-specific tree.
- 24. The method of claim 23, further comprising using said produced decision tree to predict multiple unknown properties of a physical item, said physical item having said plurality of X descriptors.
- 25. A method of forming a decision tree to simultaneously classify multiple properties of a training item set, said training item set containing training items each having a plurality of X descriptors and a plurality of original properties, each of said plurality of X descriptors and said plurality of original properties corresponds to a physical aspect of a training item, said method comprising:
converting each of said training item into one or more converted training items, said converted training items each having said plurality of X descriptors, a K descriptor and a property; producing a pure-specific tree, said pure-specific tree having a root node that splits said converted training items by a logic test on said K descriptor; growing a maximally generic tree, said maximally generic tree having a root node that splits said converted training items by a logic test on a X descriptor; and permuting said maximally generic tree to produce said decision tree, said decision tree being less generic than said maximally generic tree and more generic than said pure-specific tree.
- 26. The method of claim 25, further comprising using said produced decision tree to predict multiple unknown properties of a physical item, said physical item having said plurality of X descriptors.
- 27. The method of claim 25, wherein said producing a pure-specific tree comprises producing said pure-specific tree using a Bonferoni-modified t-test statistic splitting method.
- 28. A method of using a decision tree to simultaneously predict multiple unknown properties of a physical item having a plurality of X descriptors with known values, said decision tree produced by:
converting each of a set of training items into one or more converted training items, said training items each having said plurality of X descriptors with known values and at least one known property, said converted training items each having said plurality of X descriptors, a K descriptor and a property; splitting said converted training items at a root node of a overgrown tree by a logic test on said K descriptor; growing said overgrown tree by repeatedly splitting said converted training items at sub-root nodes of said overgrown tree by logic tests on X descriptors; pruning said overgrown tree to produce a pure-specific tree; growing a maximally generic tree, said maximally generic tree having a root node that splits said converted training items by a logic test on a X descriptor; and permuting said maximally generic tree to produce said decision tree, said decision tree being less generic than said maximally generic tree and more generic than said pure-specific tree; Said method comprising: starting from a root node of said decision tree, evaluating said physical item against one or more logic tests on X descriptors and following a tree path resulted from such evaluating; evaluating said physical item against one or more logic tests on said K descriptor and following multiple sub-paths resulted from said evaluating, each of said multiple sub-paths corresponding to each of said multiple unknown properties of said physical item; from each of said sub-paths, continue evaluating said physical item until each of said continued sub-paths reaches a respective terminal node; and for each of said unknown properties of said physical item, identify a value associated with each of said respective terminal nodes as a predicted value of said unknown property.
- 29. A method of using a decision tree to simultaneously predict multiple unknown properties of a physical item having a plurality of X descriptors with known values, said decision tree produced by:
converting each of a set of training items into one or more converted training items, said training items each having said plurality of X descriptors with known values and at least one known property, said converted training items each having said plurality of X descriptors, a K descriptor and a property; splitting said converted training items at a root node of a pure-specific tree by a logic test on said K descriptor; growing said pure-specific tree by repeatedly splitting said converted training items at sub-root nodes of said pure-specific tree by one or more logic tests on X descriptors; growing a maximally generic tree, said maximally generic tree having a root node that splits said converted training items by a logic test on a X descriptor; and permuting said maximally generic tree to produce said decision tree, said decision tree being less generic than said maximally generic tree and more generic than said pure-specific tree; Said method comprising: starting from a root node of said decision tree, evaluating said physical item against one or more logic tests on X descriptors and following a tree path resulted from such evaluating; evaluating said physical item against one or more logic tests on said K descriptor and following multiple sub-paths resulted from said evaluating, each of said multiple sub-paths corresponding to each of said multiple unknown properties of said physical item; from each of said sub-paths, continue evaluating said physical item until each of said continued sub-paths reaches a respective terminal node; and for each of said unknown properties of said physical item, identify a value associated with each of said respective terminal nodes as a predicted value of said unknown property.
- 30. A method of using a decision tree to simultaneously predict multiple properties of a physical item having a plurality of X descriptors with known values, said decision tree produced by:
converting each of a set of training items into one or more converted training items, said training items each having said plurality of X descriptors with known values and at least one known property, said converted training items each having said plurality of X descriptors, a K descriptor and a property; producing a pure-specific tree, said pure-specific tree having a root node that splits said converted training items by a logic test on said K descriptor; growing a maximally generic tree, said maximally generic tree having a root node that splits said converted training items by a logic test on a X descriptor; and permuting said maximally generic tree to produce said decision tree, said decision tree being less generic than said maximally generic tree and more generic than said pure-specific tree; Said method comprising: starting from a root node of said decision tree, evaluating said physical item against one or more logic tests on X descriptors and following a tree path resulted from such evaluating; evaluating said physical item against one or more logic tests on said K descriptor and following multiple sub-paths resulted from said evaluating, each of said multiple sub-paths corresponding to each of said multiple unknown properties of said physical item; from each of said sub-paths, continue evaluating said physical item until each of said continued sub-paths reaches a respective terminal node; and for each of said unknown properties of said physical item, identify a value associated with each of said respective terminal nodes as a predicted value of said unknown property.
- 31. A method of using a decision tree to simultaneously predict multiple unknown properties of a physical item, said physical item having a plurality of X descriptors, said decision tree having a top layer of generic nodes that each splits by a logic test on one of said plurality of X descriptors, a middle layer of K nodes that each splits by a logic test on K, K representing property types of said multiple properties of said physical item, and a bottom layer of specific nodes that each splits by a logic test on one of said plurality of X descriptors, said method comprising:
starting from a root node of said decision tree, evaluating said physical item against logic tests of said top layer of generic nodes and following a tree path resulted from said evaluating; evaluating said physical item against logic tests of said middle layer of K nodes and following multiple sub-paths resulting from said evaluating, each of said multiple sub-paths corresponding to each of said multiple unknown properties of said physical item; from each of said sub-paths, continue evaluating said physical item until each of said continued sub-paths reaches a respective terminal node; and for each of said unknown properties of said physical item, identify a value associated with each of said respective terminal nodes as a predicted value of said unknown property.
- 32. A method of using a decision tree to predict multiple unknown properties of a physical item, said physical item having a plurality of X descriptors, said decision tree having a top layer of generic nodes that each splits by a logic test on one of said plurality of X descriptors, a middle layer of K nodes that each splits by a logic test on K, K representing property types of said multiple properties of said physical item, and a bottom layer of specific nodes that each splits by a logic test on one of said plurality of X descriptors, said method comprising:
converting said physical item into one or more converted items, said converted items each having said plurality of X descriptors, a K descriptor and an unknown property, said unknown property being one of said multiple unknown properties, said converted items each having a different K descriptor value corresponding to a different unknown property; for each of said converted items, evaluating said converted item against said decision tree until a respective terminal node is reached; and for each of said unknown properties of said physical item, identify a value associated with each of said respective terminal nodes as a predicted value of said unknown property.
- 33. A system of forming a decision tree to simultaneously classify multiple properties of a training item set, said training item set containing training items each having a plurality of X descriptors and a plurality of original properties, each of said plurality of X descriptors and said plurality of original properties corresponds to a physical aspect of a training item, said system comprising:
a converting module configured to convert each of said training item into one or more converted training items, said converted training items each having said plurality of X descriptors, a K descriptor and a property; an initial tree producing module configured to produce a pure-specific tree, said pure-specific tree having a root node that splits said converted training items by a logic test on said K descriptor; a tree growing module configured to grow a maximally generic tree, said maximally generic tree having a root node that splits said converted training items by a logic test on a X descriptor; and a tree permuting module configured to permute said maximally generic tree to produce said decision tree, said decision tree being less generic than said maximally generic tree and more generic than said pure-specific tree.
- 34. A system of using a decision tree to simultaneously predict multiple unknown properties of a physical item, said physical item having a plurality of X descriptors, said decision tree having a top layer of generic nodes that each splits by a logic test on one of said plurality of X descriptors, a middle layer of K nodes that each splits by a logic test on K, K representing property types of said multiple properties of said physical item, and a bottom layer of specific nodes that each splits by a logic test on one of said plurality of X descriptors, said system comprising:
an evaluation module configured to:
starting from a root node of said decision tree, evaluating said physical item against logic tests of said top layer of generic nodes and following a tree path resulted from said evaluating; evaluating said physical item against logic tests of said middle layer of K nodes and following multiple sub-paths resulting from said evaluating, each of said multiple sub-paths corresponding to each of said multiple unknown properties of said physical item; and from each of said sub-paths, continue evaluating said physical item until each of said continued sub-paths reaches a respective terminal node; and an identification module configured to identify a value associated with each of said respective terminal nodes as a predicted value of said unknown property for each of said unknown properties of said physical item.
- 35. A system of using a decision tree to predict multiple unknown properties of a physical item, said physical item having a plurality of X descriptors, said decision tree having a top layer of generic nodes that each splits by a logic test on one of said plurality of X descriptors, a middle layer of K nodes that each splits by a logic test on K, K representing property types of said multiple properties of said physical item, and a bottom layer of specific nodes that each splits by a logic test on one of said plurality of X descriptors, said system comprising:
a converting module configured to convert said physical item into one or more converted items, said converted items each having said plurality of X descriptors, a K descriptor and a property; a evaluation module configured to: for each of said converted items, evaluating said converted item against said decision tree until a respective terminal node is reached; and a identification module configured to: for each of said unknown properties of said physical item, identify a value associated with each of said respective terminal nodes as a predicted value of said unknown property.
- 36. A decision tree stored in a computer readable form for simultaneously classifying multiple properties of a set of physical items, each of said physical items having a plurality of X descriptors, said decision tree comprising:
a top layer of generic nodes, each of said generic nodes splits by a logic test on one of said plurality of X descriptors; a middle layer of K nodes, each of said K nodes splits by a logic test on a K descriptor, said K descriptor representing property types of said multiple properties; a bottom layer of specific nodes, each of said specific nodes splits by a logic test on one of said plurality of X descriptors; and terminal nodes each corresponding to a value of one of said multiple properties.
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application claims priority under 35 U.S.C. §119(e) to U.S. Provisional Patent Application No 60/259,622, entitled Method and System for Classifying Compounds, and filed Jan. 3, 2001. The entire disclosure of this application is hereby incorporated by reference in its entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60259622 |
Jan 2001 |
US |