Predicting rare events, like Hospitalization for a given patient suffering from chronic disease, is difficult to model using traditional techniques. Most traditional data-mining methodologies like Neural Networks and Logistical Regression, do not account for longitudinal time effects for each patient. Additionally, correlations are built between the target variable and the original set of predictor variables and tends to treat them independently. Whereas, in reality, many of the predictor variables are highly correlated.
Example embodiments of the present invention relate to predicting rare event outcomes using Principal Component Analysis (PCA) and Partial Least Squares (PLS). One example of a rare event that may be predicted by example embodiments of the present invention is a hospitalization event within a certain time period for a particular person. Hospitalization events are traumatic and expensive, requiring accurate predictions for the benefit of the patient, the patient's doctor/caregiver, and insurance companies who insure the patient.
PCA and PLS techniques capture correlations among various predictor variables. These methods also explain the variability of a system in terms of a few principal components (e.g., a composite variable created based on a linear combination of predictor variables). This re-parameterization is unique in the sense that it keeps the information intact for all the original variables. PCA techniques are powerful and efficient for building a reduced order model for categorical and continuous predictor variables. For example, a PCA model based on patient historical data can be used to create a decision flag indicating whether a patient requires hospitalization. PLS helps to explain the variability in a continuous target/response variable in terms of predictor variables. An example target variable may be the length of a hospital stay or the cost associated with a hospitalization or time to hospitalization.
Example embodiments of the present invention relate to predicting rare event outcomes using Principal Component Analysis (PCA) and Partial Least Squares (PLS). One example of a rare event that may be predicted by example embodiments of the present invention is a hospitalization event within a certain time period for a particular person. Example embodiments of the present invention may generally comprise four steps. First, the example embodiment may collect historical data, including non-target events and target events. Next, the example embodiment may create a model based on this historical data. Third, the example embodiment may apply the model to an individual's data. Finally, the example embodiment may create a prediction based on the model applied to that particular data. Additionally, the example embodiment may use PCA and PLS to create the predictive model.
Data used in the predictor model may be pulled from a number of sources, and the types of data will depend on the event to be predicted. One example may be hospitalization events; meaning, based on data and the sequence of events occurring with respect to a specific person, predicting the likelihood that that person will require hospitalization in any given timeframe. In the example of predicting hospitalization events, relevant data may include: personal data about the patient's background and health data about the patient's medical history, etc. Examples may include: date of birth, height (after a certain age), ethnicity, gender, family history, geography (e.g., place where the patient lives), family size including marital status, career field, education level, medical charts, medical records, medical device data, lab data, weight gain/loss, prescription claims, insurance claims, physical activity levels, climate changes of patient-location, and any number of other medical or health related metrics, or any number of other pieces of data. Data may be pulled from any number of sources, include patient questionnaires, text records (e.g., text data mining of narrative records), data storage of medial devices (e.g., data collected by a heart monitor), health databases, insurance claim databases, etc.
Data that is useful to the model in a native format may be directly imported into a prediction event database. Other data may need to be transformed into a useful state. Still other data may be stored with unnecessary components (e.g., data contained in a text narrative). In this latter situation, a text mining procedure may need to be implemented. Text mining and data mining are known in the art and several commercial products exist for this purpose. Alternatively, a proprietary procedure may be used to mine text for relevant event data. Data may be pulled from a number of sources and stored in a central modeling database. The modeling database may consist of one data repository in one location, more than one data repository in one location, or more than one data repository in more than one location.
Example embodiments of the present invention provide a powerful event modeler, being able to predict, for example, both when a hospitalization event will occur and how long it will last and/or how much it will cost. In example embodiments of the present invention, the time between regularly scheduled doctor visits or any other time stamps may be used to partition the patient history data into discrete time windows. The partitioning may be variable or uniform in length. In this respect, the modeling is similar to modeling a chemical plant failure. Chemical plants may be modeled based on “batches”, with certain events occurring during a batch, to predict a plant failure. In terms of hospitalization events, periods of time between doctor visits where no hospitalization event occurred may be considered a “good” batch. Whereas periods of time between doctor visits where there was a hospitalization event may be considered a “bad” batch. Various other events and data may occur during the time intervals. Some events may be single events (e.g., experiencing an asthma attack), and other events may be continuous (e.g., weight or pacemaker data). An advantage to this example embodiment is that “lag” variables (e.g., no hospitalization event for some period of time) are inherently incorporated into the predictive model.
The data may next undergo one or more “data preparation” phases. For example, at 140, data may be extracted from various raw text formats using data-mining techniques. At 145, the data may be formatted. This may include transforming the data to conform to some standard or otherwise tagging relevant parts of the data. For example, diagnosis data may be formatted according to a standard coding scheme, such as an ICD notation (i.e., “International Classification of Diseases”) (e.g., ICD-9). Next, at 150, the example procedure may align the data. This may include organizing the data according to time-stamps or some other indication of when the event occurred or data was initially collected. A temporal alignment may allow for temporal patterns to be observed in the data-sets. Any variety of other data preparation is also possible.
At this point in the example embodiment, the example procedure may construct two different models. These constructions may occur in parallel, as shown, or in any other order (e.g., a serial order). At 160, the example procedure may construct a PCA model. This may generally include constructing a matrix of the different responses, calculating a covariate matrix, and calculating the eigenvalue decomposition of the covariate matrix. Other PCA variations are possible, including other singular-value decompositions. At 163, the example procedure may define a classification criterion. Examples related to the example of hospitalizations, may include the length of the hospital stay, or the cost of the hospital stay. At 166, the example procedure may combine the matrix constructions and decompositions with the relevant event classification (e.g., cost of hospitalization) to construct a PCA prediction model. At this point, the example procedure may transition from “model building” based on historical records, to “model application” based on an individual's present data. At 170, the example procedure may apply an individual's data to the constructed model to create a patient score, or otherwise evaluate the patient data with respect to the model. At 175, the example procedure may create a prediction based on the classification criteria.
Concurrently with the PCA model, the example procedure may construct a PLS model at 180. At 183, the example method may define the time-to-event framework to be predicted. This may include several things, such as, assigning the event to be predicted (e.g., a hospitalization), and assigning the time frame for the event (e.g., within the next week or within the next month). At 186, the example procedure may construct a PLS model of the stored data to predict the relevant event outlined at 183. Similar to 170, at 190, the PLS model may be applied to a set of patient data to provide a score, or otherwise evaluate the data associated with a patient. At 195, the example procedure may produce a time to event prediction (e.g., the probability the patient will experience a hospitalization event in the next month). At the end of the example procedure, a final prediction may be produced, combining both discrete and continuous predictive results, (e.g., the probability of an event, and the probable length of the event).
Data used in the PCA/PLS model may be best organized according to time, and partitioned into discrete chunks of time. In this way, during the data preparation phase of example embodiments, the data may be organized as illustrated by
As PCA and PLS deal with the decomposition and manipulation of matrix data, the example embodiments of the present invention may need to organize the data in matrix form.
Once the data has been collected, pre-processed, and otherwise prepared for modeling, the variable data may be imported, transmitted, or otherwise made accessible to a model building component 402. This component may be responsible for constructing the various matrices required for the PCA and/or PLS models. The component may contain construction logic 440 and 441, which may contain PCA and PLS logic respectively. There may be a classification selector 442 to select one or more criterion for the target event. There may be a framework definer 444, which may select the target event and/or define relevant parameters for the target event (e.g., a timeframe for the event to occur in). The scoring module 446 may receive a patient's data from the example system's user (e.g., data 471 from user input/output interface 470). This is only one example. Prediction data 471 may be a part of variable data 410, or stored anywhere else. The central prediction module 448 may combine the PCA and PLS predictions into a final probability. The outcome may be stored in a library (e.g., prediction library 450), and/or may be directly outputted to the user (e.g., 470). There may also be a user I/O interface 470 used to experiment, adjust, and otherwise administrate the example modeling system illustrated in
A hospitalization event was used in this description as an example, but is only one example of a rare event that may be predicted by models produced and run by example embodiments of the present invention. Any rare event and data associated with the rare event may be modeled and predicted using example embodiments of the present invention. Example embodiments may predict when a production factory goes offline. Events may include: downtime per each piece of equipment, error messages per each piece of equipment, production output, employee vacations, employee sick days, experience of employees, weather, time of year, power outages, or any number of other metrics related to factory production capacity. Factory data (e.g., records) may be proposed, measured, and assimilated into a model. The model may be used to compare known data about events at a factory. The outcome of that comparison may lead to the probability the factory goes offline. It may be appreciated that any rare event and set of related events may be used in conjunction with example embodiments of the present invention to predict the probability of that rare event occurring.
The various systems described herein may each include a computer-readable storage component for storing machine-readable instructions for performing the various processes as described and illustrated. The storage component may be any type of machine readable medium (i.e., one capable of being read by a machine) such as hard drive memory, flash memory, floppy disk memory, optically-encoded memory (e.g., a compact disk, DVD-ROM, DVD±R, CD-ROM, CD±R, holographic disk), a thermomechanical memory (e.g., scanning-probe-based data-storage), or any type of machine readable (computer readable) storing medium. Each computer system may also include addressable memory (e.g., random access memory, cache memory) to store data and/or sets of instructions that may be included within, or be generated by, the machine-readable instructions when they are executed by a processor on the respective platform. The methods and systems described herein may also be implemented as machine-readable instructions stored on or embodied in any of the above-described storage mechanisms. The various communications and operations described herein may be performed using any encrypted or unencrypted channel, and storage mechanisms described herein may use any storage and/or encryption mechanism.
Although the present invention has been described with reference to particular examples and embodiments, it is understood that the present invention is not limited to those examples and embodiments. The present invention as claimed therefore includes variations from the specific examples and embodiments described herein, as will be apparent to one of skill in the art.