Predicting rare events is difficult to model using traditional techniques. Most traditional techniques require balanced datasets to produce an accurate model. In other words, the model construction technique requires approximately equal numbers of target events and non-target events. This is a problem for trying to predict rare events, where the target event does not occur as often as the non-target events. Additionally, traditional techniques can be complicated and unintuitive, making adjustment and experimentation difficult. Traditional techniques often have heavy “pre-processing” costs that slow experimentation down, and generally reduce the ability to produce an accurate model due to time costs.
Example embodiments of the present invention relate to predicting rare event outcomes using classification trees. One example of a rare event that may be predicted by example embodiments of the present invention is a hospitalization event within a certain time period for a particular person. Hospitalization events are traumatic and expensive, requiring accurate predictions for the benefit of both the patient and insurance companies who insure the patient. Example embodiments of the present invention may create classification trees that essentially comprise a set of rules related to predictor variables. This approach has several advantages over other approaches (e.g., neural networks, regression analysis, etc.). Since the classification trees are essentially a set of structured rules, they can be checked manually for consistency, can be readily and visually explained, and can be readily integrated with other rules. Other approaches create a “black box” situation, where data goes in and a prediction comes out. The logic inside the box is complicated and unintuitive, which does not create a very user-friendly modeling system.
The classification tree may include a root node representing all of the available data records. The data records may then be divided into child nodes that include subsets of the records associated with the parent node. The child nodes may be organized based on one or more attributes of the data records (e.g., age over 30, gender, height, etc.). The goal in the construction of the child nodes may be to increase the concentration of positive outcomes with respect to the relevant event (e.g., hospitalization events) in one child node, and increase the concentration of negative outcomes with respect to the relevant event (e.g., no hospitalization event) in the other child node. Once the tree has achieved a sufficient level of purity in the leaf nodes, the tree may be used to create a model capable of predicting the occurrence of a rare event and an associated confidence of prediction.
Example embodiments of the present invention relate to predicting rare event outcomes using classification trees. One example of a rare event that may be predicted by example embodiments of the present invention is a hospitalization event within a certain time period for a particular person. Hospitalization events are traumatic and expensive, requiring accurate predictions for the benefit of both the patient and insurance companies who insure the patient. Example embodiments of the present invention may create classification trees that essentially comprise a set of rules related to predictor variables. This approach has several advantages over other approaches (e.g., neural networks, regression analysis, etc.). Since the classification trees are essentially a set of structured rules, they can be checked manually for consistency, can be readily and visually explained, and can be readily integrated with other rules.
Decision trees are easily understood, providing a graphical representation of the intuitive logic behind the set of rules those trees represent. In addition, decision trees are very flexible and can handle large datasets with minimal pre-processing of the data. Because of these two benefits, example embodiments of the present invention are easily manipulated to test different modeling situations. Fast, easy, and flexible model adjustments allow for a more accurate predictive model to be refined through adjustment and experimentation.
Data used in the predictor model may be pulled from a number of sources, and the types of data will depend on the event to be predicted. One example may be hospitalization events; meaning, based on data and the sequence of events occurring with respect to a specific person, predicting the likelihood that that person will require hospitalization in any given timeframe. In the example of predicting hospitalization events, relevant data may include: personal data about the patient's background and health data about the patient's medical history, etc. Examples may include: date of birth, height (after a certain age), ethnicity, gender, family history, geography (e.g., place where the patient lives), family size including marital status, career field, education level, medical charts, medical records, medical device data, lab data, weight gain/loss, prescription claims, insurance claims, physical activity levels, climate changes of patient-location, and any number of other medical or health related metrics, or any number of other pieces of data. Data may be pulled from any number of sources, include patient questionnaires, text records (e.g., text data mining of narrative records), data storage of medial devices (e.g., data collected by a heart monitor), health databases, insurance claim databases, etc.
Data that is useful to the model in a native format may be directly imported into a prediction event database. Other data may need to be transformed into a useful state. Still other data may be stored with unnecessary components (e.g., data contained in a text narrative). In this latter situation, a text mining procedure may need to be implemented. Text mining and data mining are known in the art and several commercial products exist for this purpose. However, the use of text mining to populate databases for use in a subsequent data mining or analytical model is not widespread. Alternatively, a proprietary procedure may be used to mine text for relevant event data. Data may be pulled from a number of sources and stored in a central modeling database. The modeling database may consist of one data repository in one location, more than one data repository in one location, or more than one data repository in more than one location. One benefit of example embodiments of the present invention is the flexibility with regard to input data. The decision trees may not require much, if any, data transformation for most data input or imported into the model when compared with other techniques. However, example embodiments may need to have non-events characterized as an event for the decision tree. For example, a single event may be a hospitalization event occurring one month ago. However, if no other hospitalization events occurred then that too is a relevant event that needs to be addressed, i.e., “no hospitalization events in the past month”. In this way, so-called “lag” variables may be accounted for, and the event at a specific time and the lack of an event over a specific period may both factor into the decision tree model.
Once the data is stored in the modeling database, different “views” may be created to facilitate different modeling approaches. A view may be created based on any number of characteristics, or combination of characteristics. One simple example may include the time frame of the predicted event. For example, the same set of data may have a modeling view set to predict the probability of a hospitalization event in the next week or the probability of a hospitalization event in the next month.
At 165, the records may be aggregated and imported into the modeling algorithm to create one or more models. At 170, outcome variables may be created. In this example embodiment the outcome variable is a hospitalization event within a future timeframe (e.g., a month, week, etc.). Other embodiments for the outcome variable may include the probability of a patient being hospitalized or a score for likelihood of hospitalization which, may be used to rank patients by risk of hospitalization. At 175, the example procedure may create a longitudinal data layout. This data can be used to create time-related variables for individual patient records. An example of this is a variable for “time since last hospitalization”. At 180 the data is partitioned to train, test, and validate one or more models. The data may be partitioned so the data which is used to train the model is separate from the data used to test and validate the model. This ensures that the model does not simply learn the training data and can provide good solutions for data it has not been trained on. Validation generally includes multiple models to find one or more with a sufficient level of accuracy. At 185, the example procedure may apply the model to working datasets to predict the probability of the relevant event (e.g., a hospitalization), and/or save the model to a model database (e.g., 195) for future use.
One example method of data partitioning, according to an example embodiment of the present invention, is to train, test, and validate one or more decision trees. Decision trees are formulated by breaking the record data down with the goal of outcome “purity”. Outcome purity generally means that data is split based on a criteria, such that the relevant outcome is maximized on one side of the split. In this way, the root of the decision tree may represent the entire data set. The children of the parent (e.g., root) represent record sets split by a criteria (e.g., gender). The goal of this split is to favor leaf nodes (e.g., nodes with no children) with as “pure” an outcome for the relevant criteria as possible.
If an example partition were to create “pure” leaves, then the records associated with people over 6 feet tall would all fall in one leaf and the records with people under 6 feet tall would all fall in the other leaf. However, though “pure” leaves might not always be possible,
Additional or alternative splitting may create an even purer concentration. The purity of the leaf nodes may be balanced against the size of the decision tree. For example, it is possible to guaranty completely pure leaf nodes if each leaf node contains only one record. However, a tree may have thousands of records, and single record leaf nodes may require an unreasonable amount of processing overhead to use such a large tree. Therefore, example embodiments of the present invention may balance greater purity against maintaining an efficient tree size.
Different decision tree algorithms may perform the node partitioning or splitting differently. Additionally, when a tree is constructed, branches that do not meet some minimum threshold of improved purity must be removed (e.g., “pruned” from the tree). Different decision tree algorithms may perform this “pruning” differently. Additionally, it may often be the case that records are missing one or more values. For example, the records associated with a patient may have a large quantity of data, but be missing certain information, even basic information such as gender, age, etc. Different decision tree algorithms may deal with these missing data pieces differently as well. Some algorithms may insert one or more default variables in the missing record, and others may treat the lack of a value as the value (e.g., a binary attribute would have three values, the two known values and “unknown”). The algorithm used to construct the decision tree may depend on the relevant outcome (e.g., a hospitalization event). Chi-squared Automatic Interaction Detector (CHAID) treats missing values as their own value, and is an advantageous algorithm for constructing the decision trees because it includes missing values as legitimate values in the tree splitting process.
One additional problem with creating a model to predict rare events is that the dataset is inherently one-sided. Because the event is “rare” there will be far fewer occurrences of that event than not. However, as with most modeling techniques, a balanced dataset (e.g., one with approximately equal positive and negative relevant outcomes) may create a more accurate model. Data mining models generally need at least semi-balanced datasets to learn how to correctly categorize a positive outcome (e.g., a hospitalization event). Correcting for this disparity usually requires the replication of positive datasets or the elimination of negative datasets. However, example embodiments of the present invention may instead use weighted “misclassification costs.” Meaning, a penalty may be assessed when the model incorrectly predicts an outcome. Then, the penalty may be set to achieve an optimized accuracy. For example, if a dataset has 1 positive outcome for every 20 negative outcomes, then the model construction algorithm may assign a 1 point penalty for incorrectly characterizing a negative outcome (e.g., identifying a record set that did not lead to a hospitalization as one that did lead to a hospitalization), and a 20 point penalty for incorrectly characterizing a positive outcome. The mischaracterization cost does not have to be the exact transverse of the outcome proportion. The mischaracterization may likely be inversely proportional to the outcome proportion, but may have a greater or lesser ratio. The ideal ratio of mischaracterization costs may be determined by experimentation and adjustment.
Once the data has been collected, pre-processed, and otherwise prepared for modeling, the variable data may be imported, transmitted, or otherwise made accessible to a data partitioning component 402. This component may be responsible for constructing decision trees for use in the modeling. The component may contain construction logic 440, which may contain a set of rules designed to facilitate the tree construction from the variable data. This component may generally be configured to implement a decision tree construction method, e.g., as illustrated in
A hospitalization event was used in this description as an example, but is only one example of a rare event that may be predicted by models produced and run by example embodiments of the present invention. Any rare event and data associated with the rare event may be modeled and predicted using example embodiments of the present invention. Example embodiments may predict when a production factory goes offline. Events may include: downtime per each piece of equipment, error messages per each piece of equipment, production output, employee vacations, employee sick days, experience of employees, weather, time of year, power outages, or any number of other metrics related to factory production capacity. Factory data (e.g., records) may be proposed, measured, and assimilated into a model. The model may be used to compare known data about events at a factory. The outcome of that comparison may lead to the probability the factory goes offline. It may be appreciated that any rare event and set of related events may be used in conjunction with example embodiments of the present invention to predict the probability of that rare event occurring.
The various systems described herein may each include a computer-readable storage component for storing machine-readable instructions for performing the various processes as described and illustrated. The storage component may be any type of machine readable medium (i.e., one capable of being read by a machine) such as hard drive memory, flash memory, floppy disk memory, optically-encoded memory (e.g., a compact disk, DVD-ROM, DVD±R, CD-ROM, CD±R, holographic disk), a thermomechanical memory (e.g., scanning-probe-based data-storage), or any type of machine readable (computer readable) storing medium. Each computer system may also include addressable memory (e.g., random access memory, cache memory) to store data and/or sets of instructions that may be included within, or be generated by, the machine-readable instructions when they are executed by a processor on the respective platform. The methods and systems described herein may also be implemented as machine-readable instructions stored on or embodied in any of the above-described storage mechanisms. The various communications and operations described herein may be performed using any encrypted or unencrypted channel, and storage mechanisms described herein may use any storage and/or encryption mechanism.
Although the present invention has been described with reference to particular examples and embodiments, it is understood that the present invention is not limited to those examples and embodiments. The present invention as claimed therefore includes variations from the specific examples and embodiments described herein, as will be apparent to one of skill in the art.