1. Field of the Disclosure
The present disclosure relates generally to systems for predictive modeling using medical information. More specifically, the present disclosure relates to a system and method for grouping medical codes for clinical predictive analytics.
2. Related Art
With more and more health care data becoming available digitally, predictive analytics tools have become increasingly important for healthcare providers and payers. For health care providers, predictive analytics, in combination with expert knowledge, can be used to reduce medical costs and improve care, such as by being used to assist in the diagnosis of numerous diseases, create personalized treatment plans, target patients with high risk of readmission for resource allocation, target potential patients with specific diseases, etc. Predictive analytics could help determine which patients need a more thorough follow-up appointment, and could help providers find errors in their claims (e.g., missing charges, upcodings, etc.). For payers, predictive analytics has been used for risk adjustment, primarily to determine health plan premiums and encounter capitation payments. Another main focus of predictive solutions is medical fraud prevention and detection, where over $60 billion is estimated for the Medicare program alone. Other applications include medical necessity, claim qualification, overcharge, and medical abuse detection.
More specifically, it is important for hospitals to cut down readmission rates because readmission to a hospital shortly after hospital discharge is undesirable to both the patient and the hospitals. Hospital readmissions can cause a significant decrease in the quality of life of the patient, and is often avoidable. There is a high cost associated with readmission for health care facilities and insurance companies. Further, new U.S. federal health laws financially penalize hospitals with higher than expected readmission rates. It would be desirable to have a model that could predict the probability of readmission right before hospital discharge, so that extra care could be applied to the patient to avoid the need for readmission.
One of the key problems in heath care data analysis relates to numerous codes that are utilized in health care related data sets. Electronic Medical Records (EMR) are computerized records relating to the medical history and care of patients. EMRs contain several coding systems to record non-numerical values, such as the ICD-9 standard which captures diagnoses and procedures. These codes need to be converted into numerical values to be used in predictive analytics. Due to the large number of different values of medical codes (e.g., ICD-9 standard has approximately 13,000 diagnosis codes), the codes need to be grouped (e.g., Diagnosis Related Groups (DRG)). The majority of the existing groupings of medical codes are based on domain knowledge, as opposed to being data driven. This means that these groupings would not necessarily be a good fit for a specific data set because they are not tailored for that data set, and do not consider the specific target variable of the problem, and consequently do not address the purpose of predictive analytics directly. In other words, the process of building these groupings is unsupervised. Accordingly, there is a need for better grouping of medical codes for predictive analytics purposes.
The present disclosure relates to a system and method for grouping medical codes for clinical predictive analytics. The system includes a computer system and an engine executed by the computer system. The system of the present disclosure generates data-driven groupings of codes (e.g., medical diagnoses codes) relative to the predicted target to be used in healthcare predictive analytics. The system executes a Supervised Variable Grouping (SVG) process which is a supervised and data-driven grouping process for medical codes for use in predictive analytics models. Using a dimensionality reduction approach, SVG groups medical codes with respect to their inter-relations and their relation to the target, resulting in dimensionality reduction.
The foregoing features of the disclosure will be apparent from the following Detailed Description, taken in connection with the accompanying drawings, in which:
The present disclosure relates to a system and method for grouping medical codes for clinical predictive analytics, as discussed in detail below in connection with
Dimensionality reduction lowers the number of variables that are considered during the machine learning process. Dimensionality reduction is crucial in machine learning because it helps avoid dealing with inconvenient properties of high-dimensional data sets, including dimensionality. The importance of dimensionality reduction is discussed in C. M. Bishop, “Pattern Recognition and Machine Learning,” Springer (2006), and T. Hastie, et al., “The elements of Statistical Learning: Data Mining, Inference, and Prediction,” First Edition, Springer (2001), the entire disclosures of which are incorporated herein by reference. A survey of dimensionality reduction techniques is presented in I. K. Fodor, “A survey of dimension reduction techniques,” Tech. rep., U.S. Department of Energy, Lawrence Livermore National Library (2002), the entire disclosure of which is incorporated herein by reference. Feature extraction, as opposed to feature selection, is a dimensionality reduction technique. In feature extraction, a high-dimensional data set is mapped onto a lower dimensional space through a transformation. During the transformation, the process tries to preserve as much predictive information as possible, while reducing the dimensionality.
In step 16, a Supervised Variable Grouping (SVG) process is executed by the system and applied to the data set, and more specifically to the indicator variables of the data set. SVG is a general dimensionality reduction technique and could be used for any type of numerical variables (indicator or non-indicator) for classification or regression problems. Applying SVG on variables of a dataset could significantly reduce the number of variables (e.g., indicator variables) resulting in a data set with a manageable number of dimensions for computation. SVG provides a grouping of indicator variables, which is equivalent to grouping the categories with respect to the target (where a grouping of categories could be tailored for each target so that two different groupings of the same set of categories based on two different targets could be significantly different). Such a grouping could be used as a basis for a smaller set of indicator variables that indicate whether or not a category falls into a specific group.
To apply SVG to the indicator variables, in step 17, the target is defined and thresholds for the length of the vectors and the distance of those vectors to the target is defined (e.g., by a user via a computer interface). The terms “column” and “vector” are used interchangeably because each column of a data set can be viewed as a vector. For example, for categorical variables (e.g., clinical diagnosis codes for primary diagnosis of patients, ICD-9 standard, etc.) with a large number of different categories, introducing indicator variables adds a large number of columns to the data set. Initially, each variable forms a group, and therefore initially the number of groups is the same as the number of variables of the data set, and the cardinality of each group is 1. The vector associated with each group is the sum of columns of the variables that are in the group. In step 18, the vector and distance to the target are calculated for each group.
In step 20, the system automatically searches for two groups that satisfy the threshold conditions (e.g., the maximum allowed threshold for the closeness of the length of two vectors and/or the threshold for the closeness of the length of either of the two vectors to the target). Recursively, at each iteration SVG finds two groups (e.g., pair of indicator variables) such that the length of the vectors associated with them are the closest, and the distance between those vectors and the target vector are the closest. In other words, the difference in their length and the difference in their distance to the target vector is less than the defined thresholds (e.g, ε1 and ε2). These variables are chosen in such a way that if linear regression were to be performed on the final output of SVG, the result would be similar to performing linear regression on the original data set with all of the original variables. In other words, the two variables are selected based on their individual features, as well as their interaction with the target variable.
If there are such groups, then in step 22, these two groups (e.g., variables) are combined (e.g., added) to form a combined group, and in step 24 the two individual groups are removed from the data set and replaced by the combined group (e.g., sum). In step 26, the length of the vector associated with the combined group is calculated. Then, in step 28, the distance to the target vector of the combined group is calculated. In step 30, the distances of the combined group to the vectors associated with the other remaining groups are calculated.
In step 32, the system determines whether a satisfactory number of groups has been created (i.e., the number of remaining variables is equal to a pre-specified number (k*)). If not, the process returns back to step 20. In this way, the process continues iteratively until the satisfactory number of groups has been created (e.g., the number of columns of the data set has reached a pre-specified level), or until there are no more pairs of variables satisfying the threshold conditions (e.g., there are no two columns that are approximately satisfying the conditions required for reducing the dimension of the data set any further).
If a determination is made in step 20 that there are not two groups that satisfy the threshold conditions, or if a determination is made in step 32 that there are a satisfactory number of groups, then the process proceeds to step 34, where the altered data set is used as input for one or more predictive models. In this way, SVG could be applied to indicator variables to build medical code groupings, which are then applied to a data set containing clinical claims records to effectively build data sets for medical prediction problems to be used in predictive models.
|T−V1|2=|T−V2|2|T|22−2V′1T+|V1|22=|T|22−2V′2T+|V2|22V′1T=V′1T. Equation 1
The direct implication is that, given these assumptions, using V1+V2 for building a linear regression model for T produces the same result as using V1 and V2 separately. If all of the vectors and the target are linearly independent, the result of linear regression on the original data set is the same as the result of linear regression on the reduced data set. For linear regression the features of indicator vectors holds. For generalized linear models, due to the involvement of a linear combination of variables in one form or another, the intuition of SVG still holds. For general learning algorithms the gain of using SVG is less dimensions in the data set, and the intuition behind SVG for linear regression provide a basis for a reasonable transformation of the variables.
There are numerous distance measures that could be used for SVG besides Euclidean Distance, such as by using the risk of each category as the distance of its corresponding indicator column to the target vector (used only for classification problems), so that each category is replaced by its risk value. The risk value of a category is the ratio of the number of instances of the category with positive targets to the total number of instances of the category. Indicator variable risk measures the ratio of the number of positive targets when the indicator is 1 to the squared length of the indicator vector. As a result, in this approach the categorical values of Vc are replaced by numerical values, and the new column is Risk(Vc).
The Euclidean distance and risk are related in the case of indicator variables for binary targets. Both can be used as the measure of distance to the target. Consider the case where |V1|=|V2| and Risk(V1)=Risk(V2), VN=V1+V2, and |VN|=|V1|+|V2|=2|V1|. Risk(VN) could be calculated by:
This indicates that if the risk is considered the distance measure between the vectors and the target in SVG, each new vector has the same risk as its parent vectors.
At each iteration, the SVG algorithm takes two vectors with equal length and equal distance to the target vector, sums them into one vector, and the result replaces the two vectors in the data set. Suppose that V1 and V2 are two vectors of D with the desired properties, namely |V1|2=|V2|2 and |T−V1|2=|T−V2|2. As a result:
where Risk(V1) and Risk(V2) are the risks of the categories corresponding to V1 and V2, respectively. Therefore, |V1|2=|V2|2 and |T−V1|2=|T−V2|2 for two categorical vectors V1 and V2 and the target vector T, meaning that V1 and V2 are of equal length and the Euclidean distance between V1 and T is equal to the Euclidean distance between V1 and T, and therefore Risk(V1)=Risk(V2). This means that the variables that are summed up together at each iteration of SVG have equal risk values. Reversely, if |V1|2=|V2|2 and Risk(V1)=Risk(V2), then |T−V1|2=|T−V2|2.
Assuming VN=V1+V2, then |VN|22=|V1|22+|V2|22=2|V1|22, such that VN, V1, and V2 are indicator variables and only have 0 or 1 elements. Also, the squared 2-norm of a binary vector is the number of 1s in the vector, so that the number of 1s in VN is equal to the sum of the number of 1s in V1 and V2. The distance of VN (the new vector) to the target can be calculated (such as by the Pythagorean theorem) as follows:
SVG using risk performs differently than SVG using Euclidean distance. Using risk as the distance measure, the distance of the sum of two vectors to the target is the same as the distance of the two vectors to the target. Comparatively, using the Euclidean distance between the sum of two vectors to the target vector, the sum of the two vectors has a distance to the target which is different than the individual distance of the two vectors to the target vector. This affects the entire algorithm, since the vectors are added together in a recursive manner. Therefore, these two measures would provide different dimensionality reduction transformations of the indicator variables. Another distinction is that using risk for dimensionality reduction is only applicable for binary targets, whereas the Euclidean distance could be use for both binary and continuous targets. This makes the Euclidean distance measure a viable candidate for both classification and regression problems.
In the case of using risk as the distance measure, SVG at each iteration finds two columns Vi and Vj such that |Vi|=|Vj| and Risk(Vi)=Risk(Vj), and then replaces Vi and Vj with a new vector which is equal to Vi+Vj, so that:
which means that ProjV
can be rewritten as:
Where the Euclidean distance measure is used for SVG, the above analysis would be the same because Risk(V1)=Risk(V2)|T−V1|2=|T−V2|2. Therefore, all the above analysis holds for the case where in SVG, at each iteration, Vi and Vj are picked such that |Vi|2=|Vj|2 and |T−V1|2=|T−V2|2, which supports the use of Euclidean distance to measure the distance between each vector and the target vector.
To predict hospital readmission, assume a data set with its records representing hospitalization claim records, and its columns representing information regarding each claim. Each claim corresponds to a hospital stay, and the columns present information regarding each claim (e.g., length of stay, attending physician, claim diagnosis codes, claim procedures, etc.). More specifically, the values of the diagnosis code columns are ICD-9 codes, which are categorical values by nature because each of the 13,000 different ICD-9 code represents a condition. Among all of the features there were 10 columns representing the diagnosis codes associated with each claim (although not all of the 10 columns were necessarily populated for every claim), where such columns could be named ICD9 DGNS CD1 (e.g., representing the code for the primary diagnosis of the claim), ICD9 DGNS CD2 (e.g., representing the secondary diagnosis of the claim), . . . ICD9 DGNS CD10. Consider the first 5 of these 10 diagnosis columns and the target, which is 1 if the claim is followed by a readmission, and is 0 otherwise. Diagnosis related codes are (preferably) exclusively used because diagnosis related information is common to different clinical data sets. Comparatively, clinical data could come from a variety of sources, which could contain different information regarding the claims based on their origination. For instance, one data set might contain detailed information regarding charges associated with each hospitalization whereas another data set might have detailed information about the lab tests and medications associated with the hospitalizations. However, no matter where the data originated or the kind of information reflected therein, most clinical data sets have diagnosis related information.
An advantage of using SVG is that it takes the target into consideration when building the groups of diagnosis codes. An alternative to SVG is to group the 5 diagnosis codes according to domain knowledge, however, such groupings are undesirable because these groupings are for a general purpose and do not consider the specific target of the problem. Another alternative to grouping the 5 diagnosis codes is to use risk tables. One-dimensional risk tables are easy to compute and use, but they do not consider the co-occurrence of codes. For instance if a patient has both condition a and b they might be more prone to readmission compared to a patient who has one condition but not the other. Risk tables of a higher order could be used, but such use would be difficult due to the noise in the data that comes from the scarcity of combinations of codes. Moreover, in such data sets risk tables do not provide a viable solution if the history of the codes must be considered.
To assess the performance of SVG, the SVG grouping was compared to an existing benchmark grouping of ICD-9 codes grouped based on mortality rates and the relative similarity of diseases, and which was presented in Escobar, G., Green, et al., “Risk-adjusting Hospital Inpatient Mortality Using Automated Inpatient, outpatient, and Laboratory Databases,” Medical Care 46(3), 232-239 (2008), the entire disclosure of which is incorporated herein by reference. The benchmark grouping had 45 groups (e.g., acute myocardial infraction, chronic renal failure, gynecologic cancers, liver disorders, etc.). Another benchmark used was a data set which replaces the ICD-9 codes with their individual risk for each of the 5 diagnosis columns.
The data set had about 1,000,000 claims (records) and there are about 4500 to 5000 different ICD-9 codes under each of the five diagnosis columns (e.g., ICD9 DGNS CD1). Indicator variables were then created for all of the codes that appear in these columns. As a result the new data set had about 1,000,000 rows (the same number of rows as the original data set), and about 50,000 columns and one target column, which is the same as the target column in the original data set. The length of each column was calculated, as well as each column's distance to the target column.
The rows of each of these three data sets (e.g., data set using the risk table, data set using the benchmark grouping, and data set using the indicator variables) were split randomly (with the same random seed) into a training set (e.g., 60% of the rows) and a validation set (e.g., 40% of the rows). SVG could be implemented in the Python programming language and used to create 45 variables for each of the columns, which basically forms groups of ICD-9 codes for each column. The Euclidean distance measure and the risk table were used to measure the distance of each vector to the target. The groups were then built using the training set. While SVG was being applied for each diagnosis column, codes that appeared less than 10 times in that column for the entire training set were put in a separate group to remove noise introduced by the codes (which occur rarely in each column).
Four data sets were created based on the primary conditions variables (e.g., data set based on separate risk values of the diagnosis columns, data set based on ICD-9 benchmark grouping, data set built on SVG with the Euclidean distance measure, and data set based on SVG with risk as the distance measure). For each of these four data sets a logistic regression was trained on the training set, and the outcome was used to score the corresponding validation set. Logistic regression is merely an example of how these models could be built. SVG of the present disclosure could be applied to any healthcare predictive analytics with a target function. The area under the ROC curve (AUC) was calculated and the results are shown below:
Each row represents the results for their respective ranges of diagnosis codes. As shown, SVG, with both risk and the Euclidean distance measure, created data sets on which logistic regression performs significantly better compared to the data set created based on the benchmark diagnosis grouping (based on domain knowledge) and compared to the data set based on risk tables (which could indicate that SVG, but not the one dimensional risk tables, can capture part of the correlation between the different diagnosis columns).
The functionality provided by the present disclosure could be provided by an SVG program/engine 106, which could be embodied as computer-readable program code stored on the storage device 104 and executed by the CPU 112 using any suitable, high or low level computing language, such as Python, Java, C, C++, C#, .NET, MATLAB, etc. The network interface 108 could include an Ethernet network interface device, a wireless network interface device, or any other suitable device which permits the server 102 to communicate via the network. The CPU 112 could include any suitable single- or multiple-core microprocessor of any suitable architecture that is capable of implementing and running the secure document distribution program 106 (e.g., Intel processor). The random access memory 114 could include any suitable, high-speed, random access memory typical of most modern computers, such as dynamic RAM (DRAM), etc.
Having thus described the system and method in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present disclosure described herein are merely exemplary and that a person skilled in the art may make any variations and modification without departing from the spirit and scope of the disclosure. All such variations and modifications, including those discussed above, are intended to be included within the scope of the disclosure. What is desired to be protected is set forth in the following claims.
This application claims priority to U.S. Provisional Patent Application No. 61/777,246 filed on Mar. 12, 2013, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61777246 | Mar 2013 | US |