Claims
- 1. A method for automatically building and evaluating data mining models comprising the steps of:
receiving information specifying models to be built; generating a model setting array based on the information specifying models to be built, the model setting array including at least one model settings combination; preparing specified source data for training and testing each model to be built; for each model settings combination included in the model settings array, performing the steps of building a model based on the model settings combination, evaluating the built model, and outputting results of the model building and evaluation; selecting a model from among all the built models based on a selection criterion; and outputting the selected model.
- 2. The method of claim 1, wherein the information specifying models to be built comprises at least one of: types of models to be built, values of parameters used to build models, datasets to be used to build and test models, and parameters specifying evaluation criteria.
- 3. The method of claim 2, wherein the values of parameters used to build models comprises a series of values for the parameters or a range of values for the parameters.
- 4. The method of claim 1, wherein the step of generating a model setting array comprises the steps of:
generating model settings combinations by varying, in different combinations, parameters included in the information specifying models to be built; and generating an array to store the generated model settings combinations.
- 5. The method of claim 4, wherein the step of preparing specified source data for training and testing each model to be built comprises the steps of:
mapping source data column values to standardized forms to be used in building the models, and performing automatic assignment of groups and/or ranges of values to collections and/or bins of values, each represented by an automatically assigned integer.
- 6. The method of claim 5, wherein the standardized forms comprise sequentially assigned integers.
- 7. The method of claim 4, wherein the step of building a model based on the model settings combination comprises the steps of:
choosing an appropriate code method corresponding to a user-specified algorithm, if a user has specified an algorithm; choosing a default method, if the user has not specified an algorithm; invoking the chosen code method to build a representation of a model of a type appropriate to the specified algorithm; and training the model with the prepared training source data.
- 8. The method of claim 4, wherein the information specifying models to be built comprises at least one of: types of models to be built, values of parameters used to build models, datasets to be used to build and test models, and parameters specifying evaluation criteria.
- 9. The method of claim 8, wherein the values of parameters used to build models comprises a series of values for the parameters or a range of values for the parameters.
- 10. The method of claim 8, wherein the step of evaluating the built model comprises the step of:
evaluating the built model using an evaluation function.
- 11. The method of claim 10, wherein the evaluation function comprises at least one of:
cumulative targets at a percentage of the population, a test of model accuracy, lift at a percentage of the population, time to build the model, and time to score the model.
- 12. The method of claim 10, wherein the evaluation function comprises at least two of:
cumulative targets at a percentage of the population, a test of model accuracy, lift at a percentage of the population, time to build the model, and time to score the model.
- 13. The method of claim 1, wherein the selection criterion is to choose a model with a largest value of a weighted sum of positive and negative relative accuracy, where the weight is specified as a parameter or setting.
- 14. The method of claim 1, wherein the selection criterion comprises a figure of merit, FOM, wherein: FOM=
- 15. The method of claim 14, wherein W is determined as a ratio of:
a cost of predictions that are all correct except for a portion of negative predictions being false negatives, to a cost of predictions that are all correct except for a percentage of positive predictions being false positives.
- 16. The method of claim 1, wherein the step of selecting a model from among all the built models based on a selection criterion comprises the step of:
selecting a plurality of models from among all the built models based on the selection criterion.
- 17. A system for automatically building and evaluating data mining models comprising:
a processor operable to execute computer program instructions; a memory operable to store computer program instructions executable by the processor; and computer program instructions stored in the memory and executable to perform the steps of:
receiving information specifying models to be built; generating a model setting array based on the information specifying models to be built, the model setting array including at least one model settings combination; preparing specified source data for training and testing each model to be built; for each model settings combination included in the model settings array, performing the steps of building a model based on the model settings combination, evaluating the built model, and outputting results of the model building and evaluation; selecting a model from among all the built models based on a selection criterion; and outputting the selected model.
- 18. The system of claim 17, wherein the information specifying models to be built comprises at least one of: types of models to be built, values of parameters used to build models, datasets to be used to build and test models, and parameters specifying evaluation criteria.
- 19. The system of claim 18, wherein the values of parameters used to build models comprises a series of values for the parameters or a range of values for the parameters.
- 20. The system of claim 17, wherein the step of generating a model setting array comprises the steps of:
generating model settings combinations by varying, in different combinations, parameters included in the information specifying models to be built; and generating an array to store the generated model settings combinations.
- 21. The system of claim 20, wherein the step of preparing specified source data for training and testing each model to be built comprises the steps of:
mapping source data column values to standardized forms to be used in building the models, and performing automatic assignment of groups and/or ranges of values to collections and/or bins of values, each represented by an automatically assigned integer.
- 22. The system of claim 21, wherein the standardized forms comprise sequentially assigned integers.
- 23. The system of claim 20, wherein the step of building a model based on the model settings combination comprises the steps of:
choosing an appropriate code method corresponding to a user-specified algorithm, if a user has specified an algorithm; choosing a default method, if the user has not specified an algorithm; invoking the chosen code method to build a representation of a model of a type appropriate to the specified algorithm; and training the model with the prepared training source data.
- 24. The system of claim 20, wherein the information specifying models to be built comprises at least one of: types of models to be built, values of parameters used to build models, datasets to be used to build and test models, and parameters specifying evaluation criteria.
- 25. The system of claim 24, wherein the values of parameters used to build models comprises a series of values for the parameters or a range of values for the parameters.
- 26. The system of claim 24, wherein the step of evaluating the built model comprises the step of:
evaluating the built model using an evaluation function.
- 27. The system of claim 26, wherein the evaluation function comprises at least one of:
cumulative targets at a percentage of the population, a test of model accuracy, lift at a percentage of the population, time to build the model, and time to score the model.
- 28. The system of claim 26, wherein the evaluation function comprises at least two of:
cumulative targets at a percentage of the population, a test of model accuracy, lift at a percentage of the population, time to build the model, and time to score the model.
- 29. The system of claim 17, wherein the selection criterion is to choose a model with a largest value of a weighted sum of positive and negative relative accuracy, where the weight is specified as a parameter or setting.
- 30. The system of claim 17, wherein the selection criterion comprises a figure of merit, FOM, wherein: FOM=
- 31. The system of claim 30, wherein W is determined as a ratio of:
a cost of predictions that are all correct except for a portion of negative predictions being false negatives, to a cost of predictions that are all correct except for a percentage of positive predictions being false positives.
- 32. The system of claim 17, wherein the step of selecting a model from among all the built models based on a selection criterion comprises the step of:
selecting a plurality of models from among all the built models based on the selection criterion.
- 33. A computer program product for automatically building and evaluating data mining models, comprising:
a computer readable medium; computer program instructions, recorded on the computer readable medium, executable by a processor, for performing the steps of
receiving information specifying models to be built; generating a model setting array based on the information specifying models to be built, the model setting array including at least one model settings combination; preparing specified source data for training and testing each model to be built; for each model settings combination included in the model settings array, performing the steps of building a model based on the model settings combination, evaluating the built model, and outputting results of the model building and evaluation; selecting a model from among all the built models based on a selection criterion; and outputting the selected model.
- 34. The computer program product of claim 33, wherein the information specifying models to be built comprises at least one of: types of models to be built, values of parameters used to build models, datasets to be used to build and test models, and parameters specifying evaluation criteria.
- 35. The computer program product of claim 34, wherein the values of parameters used to build models comprises a series of values for the parameters or a range of values for the parameters.
- 36. The computer program product of claim 33, wherein the step of generating a model setting array comprises the steps of:
generating model settings combinations by varying, in different combinations, parameters included in the information specifying models to be built; and generating an array to store the generated model settings combinations.
- 37. The computer program product of claim 36, wherein the step of preparing specified source data for training and testing each model to be built comprises the steps of:
mapping source data column values to standardized forms to be used in building the models, and performing automatic assignment of groups and/or ranges of values to collections and/or bins of values, each represented by an automatically assigned integer.
- 38. The computer program product of claim 37, wherein the standardized forms comprise sequentially assigned integers.
- 39. The computer program product of claim 36, wherein the step of building a model based on the model settings combination comprises the steps of:
choosing an appropriate code method corresponding to a user-specified algorithm, if a user has specified an algorithm; choosing a default method, if the user has not specified an algorithm; invoking the chosen code method to build a representation of a model of a type appropriate to the specified algorithm; and training the model with the prepared training source data.
- 40. The computer program product of claim 36, wherein the information specifying models to be built comprises at least one of: types of models to be built, values of parameters used to build models, datasets to be used to build and test models, and parameters specifying evaluation criteria.
- 41. The computer program product of claim 37, wherein the values of parameters used to build models comprises a series of values for the parameters or a range of values for the parameters.
- 42. The computer program product of claim 37, wherein the step of evaluating the built model comprises the step of:
evaluating the built model using an evaluation function.
- 43. The computer program product of claim 42, wherein the evaluation function comprises at least one of:
cumulative targets at a percentage of the population, a test of model accuracy, lift at a percentage of the population, time to build the model, and time to score the model.
- 44. The computer program product of claim 42, wherein the evaluation function comprises at least two of:
cumulative targets at a percentage of the population, a test of model accuracy, lift at a percentage of the population, time to build the model, and time to score the model.
- 45. The computer program product of claim 33, wherein the selection criterion is to choose a model with a largest value of a weighted sum of positive and negative relative accuracy, where the weight is specified as a parameter or setting.
- 46. The computer program product of claim 33, wherein the selection criterion comprises a figure of merit, FOM, wherein: FOM=
- 47. The computer program product of claim 46, wherein W is determined as a ratio of:
a cost of predictions that are all correct except for a portion of negative predictions being false negatives, to a cost of predictions that are all correct except for a percentage of positive predictions being false positives.
- 48. The computer program product of claim 33, wherein the step of selecting a model from among all the built models based on a selection criterion comprises the step of:
selecting a plurality of models from among all the built models based on the selection criterion.
CROSS-REFERENCE TO RELATED APPLICATIONS
[0001] The benefit under 35 U.S.C. §119(e) of provisional application No. 60/378,952, filed May 10, 2002, is hereby claimed.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60378952 |
May 2002 |
US |