1. Technical Field
The invention relates generally to evaluating a data mining algorithm, and more specifically, to a method, system and program product that allow the performance of one or more data mining algorithms to be quantified and/or compared.
2. Related Art
As businesses increasingly rely upon computer technology to perform essential functions, data mining is rapidly becoming vital to business success. Specifically, many businesses gather various types of data about the business and/or its customers so that operations can be gauged and optimized. Typically, a business will gather data into a database or the like and then utilize a data mining tool to mine the data.
Often, the data mining tool can use one of several data mining algorithms in order to mine the data. For example, the data mining algorithm can be selected based on the goals that a user is seeking to accomplish (e.g., classification, fraud detection, etc.). Making such a selection is relatively straightforward since each data mining algorithm is generally configured to fulfill specific goals. However, multiple data mining algorithms may be configured to fulfill the same goals. As a result, it is desired to select the best performing data mining algorithm for the particular data that is being mined.
Choosing the best performing data mining algorithm from a set of potential data mining algorithms is currently a time consuming and highly subjective process. In particular, a user typically runs each data mining algorithm against sample data, analyzes the results produced by each data mining algorithm, and compares the results to those produced by other data mining algorithms. To perform the analysis effectively, the user must have detailed knowledge about the goals, how the results compare to the goals, etc.
Additionally, each data mining algorithm may also be configurable by adjusting one or more tuning parameters. When such an adjustment is made, the data mining algorithm must be re-run against the sample data and the new results will need to be analyzed and compared to other results. Consequently, selecting a data mining algorithm may require several iterations of adjusting parameters for one or more data mining algorithms and analyzing and comparing the results that each run produces. Further, the user must have detailed knowledge about the way that parameter adjustments impact the performance of a data mining algorithm in order to make intelligent adjustment choices.
Due to the varying knowledge and subjectivity from user to user, selection of a data mining algorithm remains highly inefficient and inconsistent. Further, no quantifiable solution exists for evaluating the performance of a data mining algorithm that is currently in use.
As a result, a need exists for an improved solution for evaluating a data mining algorithm. In particular, a need exists for a method, system and program product for evaluating a data mining algorithm in which a performance value can be calculated for the data mining algorithm.
The invention provides an improved solution for evaluating one or more data mining algorithms. Specifically, under the present invention, a method, system and program product are provided that calculate a performance value for each data mining algorithm. In one embodiment, a set of goals is obtained for the set of data mining algorithms. Each goal can be assigned a weight by, for example, assigning a weight to each error case for the goal. Based on the rate of errors for each error case and the associated weights, the performance value can be calculated. The performance values for multiple data mining algorithms can be compared to determine the data mining algorithms that performed best. As a result, the invention allows the performance of the data mining algorithms to be quantified and consistently compared.
A first aspect of the invention provides a method of evaluating a data mining algorithm, the method comprising: obtaining a set of goals for the data mining algorithm; assigning a weight to each goal in the set of goals; applying the data mining algorithm to a dataset; and calculating a performance value for the data mining algorithm based on the set of weights and a set of results for the applying step.
A second aspect of the invention provides a method of evaluating a set of data mining algorithms, the method comprising: selecting the set of data mining algorithms; obtaining a set of goals for the set of data mining algorithms; assigning a weight to each goal in the set of goals; applying each data mining algorithm to a dataset; and calculating a performance value for each data mining algorithm based on the set of weights and a set of results for the applying step.
A third aspect of the invention provides a system for evaluating a set of data mining algorithms having a set of goals, the system comprising: an assignment system for assigning a weight to each goal in the set of goals; an application system for applying each data mining algorithm to a dataset; and a performance system for calculating a performance value for each data mining algorithm based on the weights assigned to the set of goals and a set of results for the applying step.
A fourth aspect of the invention provides a program product stored on a recordable medium for evaluating a set of data mining algorithms having a set of goals, which when executed comprises: program code for assigning a weight to each goal in the set of goals; program code for applying each data mining algorithm to a dataset; and program code for calculating a performance value for each data mining algorithm based on the weights assigned to the set of goals and a set of results for the applying step.
The illustrative aspects of the present invention are designed to solve the problems herein described and other problems not discussed, which are discoverable by a skilled artisan.
These and other features of this invention will be more readily understood from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings that depict various embodiments of the invention, in which:
It is noted that the drawings of the invention are not to scale. The drawings are intended to depict only typical aspects of the invention, and therefore should not be considered as limiting the scope of the invention. In the drawings, like numbering represents like elements between the drawings.
As indicated above, the invention provides an improved solution for evaluating one or more data mining algorithms. Specifically, under the present invention, a method, system and program product are provided that calculate a performance value for each data mining algorithm. In one embodiment, a set of goals is obtained for the set of data mining algorithms. Each goal can be assigned a weight by, for example, assigning a weight to each error case for the goal. Based on the rate of errors for each error case and the associated weights, the performance value can be calculated. The performance values for multiple data mining algorithms can be compared to determine the data mining algorithms that performed best. As a result, the invention allows the performance of the data mining algorithms to be quantified and consistently compared.
It is understood that as used herein, “set” is used to denote “one or more” of an object. Further, it is understood that when a “set of data mining algorithms” is discussed, the set could comprise a single data mining algorithm configured by a single set of parameters. Alternatively, the set could include a data mining algorithm that is configured using two or more distinct sets of parameter values and/or parameters. In the latter case, this could be considered a plurality of data mining algorithms.
Turning to the drawings,
CPU 14 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and server. Memory 16 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Further, computer 12 may include a storage system 24 that can comprise any type of data storage for storing and retrieving information necessary to carry out the invention as described below. As such, storage system 24 may include one or more storage devices, such as a magnetic disk drive or an optical disk drive. Moreover, similar to CPU 14, memory 16 and/or storage system 24 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms. Further, memory 16 and/or storage system 24 can include data distributed across, for example, a LAN, WAN or a storage area network (SAN) (not shown).
I/O interface 18 may comprise any system for exchanging information to/from external device(s). I/O devices 22 may comprise any known type of external device, including speakers, a CRT, LED screen, handheld device, keyboard, mouse, voice recognition system, speech output system, printer, monitor/display, facsimile, pager, etc. It is understood, however, that if computer 12 is a handheld device or the like, a display could be contained within computer 12, and not as an external I/O device 22 as shown. Bus 20 provides a communication link between each of the components in computer 12 and likewise may comprise any known type of transmission link, including electrical, optical, wireless, etc. In addition, although not shown, additional components, such as cache memory, communication systems, system software, etc., may be incorporated into computer 12.
Shown stored in memory 16 is an evaluation system 28 that evaluates a set of data mining algorithms 29. To this extent, evaluation system 28 is shown including a selection system 30 that can obtain the set of data mining algorithms 29. Evaluation system 28 can also include an assignment system 32 that assigns a weight to each goal in a set of goals for the data mining algorithm(s) 29, and an application system 34 that can apply the set of data mining algorithms 29 to a sample dataset to produce a set of results for each data mining algorithm 29. Additionally, a performance system 36 can calculate a performance value for each data mining algorithm 29 based on the set of results and the weights assigned to the set of goals. Evaluation system 28 can also include a ranking system 38 for ranking the set of data mining algorithms 29, and a summary system 40 that presents at least some of the data mining algorithms 29 (e.g., best performing) to a user for review. Still further, evaluation system 28 can include a generation system 42 to generate a data mining model based on a data mining algorithm 29 selected by the user. While the various systems are shown implemented as part of evaluation system 28, it is understood that some of the various systems can be implemented independently, combined, and/or stored in memory for one or more separate computers 12 that communicate over a network. Further, it is understood that some of the systems and/or functionality may not be implemented, or additional systems and/or functionality may be included as part of evaluation system 28.
As noted previously, selection system 30 obtains a set of data mining algorithms 29 to be evaluated. In one embodiment, user 26 and/or another system can provide the set of data mining algorithms 29 to selection system 30. Alternatively, selection system 30 can select the set of data mining algorithms 29 from, for example, a plurality of data mining algorithms 29 stored in storage system 24. To this extent, the set of data mining algorithms 29 can be selected based on a business problem selected by user 26. In this case, selection system 30 can present a series of choices that allow user 26 to narrow the problem and eventually select the particular business problem. For example, selection system 30 can present a series of windows that allow user 26 to make increasingly specific selections, thereby allowing user 26 to select the set of data mining algorithms 29 in a user-friendly manner.
Once user 26 selects a business problem 56, selection system 30 (
In still another embodiment, user 26 (
In any event, assignment system 32 (
In one embodiment, a goal can be given more/less weight based on the acceptability of an error in fulfilling the goal. For example, the goal could comprise predicting if a sample is diseased.
In order to evaluate each data mining algorithm 29 (
Alternatively, user 26 (
To evaluate the set of data mining algorithms 29 (
In any event, the application of each data mining algorithm 29 (
Performance system 36 (
In any event, ranking system 38 (
One or more data mining algorithms 29 (
To this extent, generation system 42 (
It is understood that the present invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer/server system(s)—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, carries out the respective methods described herein. Alternatively, a specific use computer (e.g., a finite state machine), containing specialized hardware for carrying out one or more of the functional tasks of the invention, could be utilized. The present invention can also be embedded in a computer program product, which comprises all the respective features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program, software program, program, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
The foregoing description of various embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to a person skilled in the art are intended to be included within the scope of the invention as defined by the accompanying claims.