1. Field of the Invention
The present invention relates generally to systems for detecting errors. More specifically, the present invention relates to a system and method for detecting billing errors using predictive modeling.
2. Related Art
In the healthcare field, billing and coding are complex processes that involve multiple “handoffs” between various medical departments/entities, etc., as well as human intervention. Typically, when a patient visits a hospital, the doctor diagnoses the patient's symptoms and orders services to cure his/her illness or to alleviate symptoms. After the patient is discharged from the hospital, professional coders manually code the services and procedures provided to patients by reading physician orders, nurse notes, laboratory records, and many other medical records to prepare claims. This inevitably leads to billing errors or missed charges due to various reasons (e.g., misreading handwritten notes, delayed laboratory records, different billing rules for hospitals or insurance plans, inexperienced coders, etc.). As a result, there are direct losses associated with missing charges since hospitals (or other types of businesses) will not get paid by insurance companies or other payers. Further, claims with billing errors are also denied by payers. It has been estimated that about 1% of hospital revenue is lost due to the missing charges.
In order to prevent revenue leakage, most hospitals rely on manual review, and/or rule-based software solutions for checking bills before they are issued. Manual and rule-based solutions have difficulty handling different practice patterns across large systems (e.g., a large hospital system), which results in many exceptions and false-positives that may lead to denied claims due to billing errors, wasted time and resources, increased costs, etc. For pre-billing checks that are manually conducted, internal and/or third party reviewers review charges for a sample (10-15%) of pre-bill visits. Due to the expense of this approach, it is often reserved for only the most expensive procedures (e.g., surgeries, transplants, and cardiac procedures) and the review quality depends on the ability of the auditors (e.g., experience, training, etc.), who need to be constantly trained and educated on changes in medical care or billing.
Rule-based software solutions are mainly used to check for billing errors, instead of missing charges, and are often implemented as rules requiring the co-occurrence of specific procedure codes to check the consistency of claims. These solutions are only as effective as the rules created by the client, and usually the rules are too simple to capture the complicated patterns that exist in hospital billing, while the billing system as a whole becomes too complicated to maintain. For example, rule-based systems typically, and impractically, recommend hundreds of possible missing codes.
The present invention relates to a system and method for detecting billing errors using predictive models. The system includes a computer system and a billing error detection engine capable of detecting billing errors using predictive modeling techniques. The system receives billing information (e.g., in the form of a daily file and alert report), and pre-processes the billing information. The system then applies one or more predictive models to the information to identify billing errors. The results could be optionally sent to, and reviewed by, third party auditors, whereby their feedback could be incorporated into the results. A final report is generated by the system which indicates billing errors that require correction, thereby allowing an entity (e.g., a hospital) to correct such errors and to prevent revenue leakage. The system could apply more than one predictive model to detect errors, and can also cascade multiple models for increased performance.
The foregoing features of the invention will be apparent from the following Detailed Description of the Invention, taken in connection with the accompanying drawings, in which:
The present invention relates to a system and method for detecting billing errors using predictive modeling, as discussed in detail below in connection with
The system 10 can communicate through a network 18 with one or more clients, or auditors, to obtain daily file(s), obtain alert report(s), and/or transmit results. Network communication could be over the Internet using standard TCP/IP communications protocols (e.g., hypertext transfer protocol (HTTP), secure HTTP (HTTPS), file transfer protocol (FTP), secure file transfer protocol (SFTP), electronic data interchange (EDI), etc.), through a private network connection (e.g., wide-area network (WAN) connection, e-mails, electronic data interchange (EDI) messages, extensible markup language (XML) messages, file transfer protocol (FTP) file transfers, etc.), or any other suitable wired or wireless electronic communications format.
The computer system 12 could be any suitable computer server (e.g., a server with an INTEL microprocessor, multiple processors, single processing core, multiple processing cores, etc.) running any suitable operating system (e.g., Windows by Microsoft, Linux, UNIX, etc.). The computer system 12 includes non-volatile storage which could include disk storage (e.g., hard disk), flash memory, read-only memory (ROM), erasable, programmable ROM (EPROM), electrically-erasable, programmable ROM (EEPROM), or any other type of non-volatile memory. The computer system 12 could further include random access memory (RAM). The engine 16, discussed in greater detail below, could be embodied as computer-readable instructions stored in computer-readable media (e.g., the non-volatile memory mentioned above), and programmed in any suitable programming language (e.g., C, C++, Java, MATLAB, Python, Fortran, etc.). The server could also include a display and one or more input devices (e.g., keyboard, mouse, etc.).
The system 10 could be web-based and could allow for remote access to the system 10 over the network 18 by one or more devices, such as a personal computer system 20, a smart cellular telephone 22, a tablet computer 24, or other devices. It is possible that the billing error detection engine 16 could execute locally on the personal computer 20, smart cellular telephone 22, and/or tablet computer 24. It is conceivable that, in such circumstances, the device could communicate with a remote billing database over a network 18. Further, as noted above, the billing history database 14 need not be stored on the server 12, and indeed, billing data could be provided from one or more remote data sources, such as from a medical billing system 25 (e.g., associated with a hospital or other entity).
In step 34, the backend system uses the daily file to update the information in the billing history database 14. Then, in step 36, the backend system applies one or more predictive models to the updated information to detect billing errors in the daily file, and generates results. In step 38, the user, client, or system 12 decides whether the results of step 36 require review by an auditor (e.g., third party auditor). If so, in step 40 the results of the predictive model are updated based on the feedback of the auditors. Otherwise, the process proceeds to step 42, where the results are made accessible to, and reviewed by, the client.
It is noted that the system 10 could be implemented as a file-based system (e.g., wherein billing files are periodically transmitted to the system 10 for processing), or as a database-based system (e.g., wherein billing information is stored in a database accessible to the system 10, such as the billing history database 14, and/or a database in the medical billing system 25 of
Optionally, the computer system 12 could upload the results to one or more third party auditors 54 which review the results and fill in, or correct, codes or information as needed. The reviewed results are then sent back to the server 48 and in turn to the backend system 50 which consolidates or integrates the reviewed results. In either case, the final results are then sent from the SFTP server 48 to the client 46 for review.
Referring to
Importantly, the system can use different statistical models for inpatient data and outpatient data to accommodate differences in payment methodologies. For example, major inpatients can be billed using the Perspective Payment System (PPS), where the reimbursement to hospitals is based on Diagnosis Related Groups (DRGs). Usually the primary diagnosis, surgical procedures, and/or complications and comorbidities, are used to assign each discharged patient into a DRG. Hospitals are reimbursed by a fixed amount for the same DRG no matter what charges were made during a patient's hospital stay. As a result, the inpatient models target two types of outliers: extremely low charges and extremely high charges for a certain DRG. Extremely low charges due to billing errors may not result in more reimbursement for the potential missing charge because reimbursement is a fixed amount, but those errors could lower the average charges for the DRG, which could eventually lower the payment set up for that DRG. For extremely high charges, the patient could be classified into a different DRG, which could potentially have a higher reimbursement pay rate.
One methodology that could be applied to inpatient data is Principle Component Analysis (PCA) 124. Every patient visit has charges associated with it and each charge has a department code assigned to it. All the charge level data can be “rolled up” and cumulative charges for each department can be used as the input variables for the PCA 124. An example of cumulative charges is shown in Table 3 below.
For better performance, PCA 124 can optionally be applied not directly to the charge values, but to the logarithmic values of the charges. PCA 124 is not robust with extreme outliers, so to improve results, the number of visits for each DRG can be filtered before applying PCA 124, such that if μ is the mean and σ is the standard deviation of the distribution of log(Σn charges), only visits that have (μ−1.5 σ)<log(charges)<(μ+1.5 σ) are retained.
For each DRG, PCA 124 is applied to data over one year, and then eigenvalues and eigenvectors are computed. The eigenvalues are sorted in descending order and the bottom 20% of the eigenvalues are used to calculate the Mahalanobis distance Σi=nlp2/λ, where l is the total number of principal components, n is the index of the first eigenvalue after the top 80%, p is the value of the ith principal component for the record and λ is the corresponding ith eigenvalue. The Mahalanobis distance represents the score of the visit (i.e., error term or relative error for a visit).
Each new visit is converted to the same format and scored using the set of eigenvectors obtained for the DRG to which it belongs. After scoring, the data for the new visits is reconstructed using the top 80% eigenvectors and the mean and standard deviation of the log values of the department level charge distributions. The original department-hospital level average and reconstructed values are compared and the department with the highest difference is ranked 1 (and, so on) for each visit. The first ranked entry is considered to be the charge value with highest priority review for that visit. This predicts charging errors at the department level, but not individual missing charges for inpatient scoring. However, department and revenue codes can be combined to give a more granular estimate of missing charges.
Another methodology that could be applied to inpatient data is an auto-encoder 126, which is a nonlinear extension of PCA 124 and can explore the nonlinearity in the data and can also accept binary and categorical inputs. The auto-encoder 126 is preferably a multi-layer, artificial neural network with special structure. The neural network includes an input layer, a number of considerably smaller hidden layers which will form the encoding, and an output layer where each neuron (or, processing element) has the same meaning as in the input layer. Similar to PCA 124, the trained auto-encoder 126 is applied to the new patient visits to reconstruct the charge values in the department level (or combined department and revenue code level). If the difference between the actual value and reconstructed value is above a certain threshold, it should be reviewed for auditors.
For outpatient data, hospital reimbursement is based on fees charged for service (the most traditional payment mechanism), which means that a service is billed using a procedure code (e.g., HCPCS, current procedural terminology (CPT), International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), etc.). The payer has a fee schedule with a set reimbursement amount for each service. The provider receives the fee schedule amount less any deductible or co-insurance owed by the patient. The outpatient predictive models, or advanced statistical modeling techniques 130, directly detect the missing codes, resulting in more reimbursement for hospitals. Exemplary outpatient predictive models, or advanced statistical modeling techniques 130, include, but are not limited to, supervised learning models 132, joint density learning models 134, quantity model 136, and cascade models 140. For at least some of these models, L1-regularization could be used to reduce over-fitting of training data.
The supervised learning model 132 learns the relation between data and their labels (e.g., charge codes). For instance, assume is the total number of codes, and the patient visit data is represented as a binary vector x=(x1, . . . , xD), such that xi=1 if code i is present and xi=0 otherwise (where code i could represent a charge code, diagnosis code, procedure code, or any other code). For any code i, the supervised learning model 132 learns the probability of the presence of that code p(xi|x−i), where x−i=(x1, . . . , xi−1, xi+1, . . . , xD) is the rest of the codes. Supervised learning models 132 that could be used include, but are not limited to, logistic regression models 142, decision tree models 144, and local Naive Bayes models 146.
For logistic regression (LR) 142 the model assumes:
Here, b is the prior bias and w is a vector of weights that correspond to how each individual feature in x−i influences the probability of having xi. As such, the LR model 142 is trained for each potentially missing charge code. Often, the ratio of positive to negative training examples is very small. The number of negative visits should be down-sampled to ensure that the logistic regression training can learn properly. The charge codes are chosen based on frequency in the data as well as dollar value. Preferably, codes are chosen that appear often enough to train an accurate model, and whose dollar value is high enough.
The number of LR models 142 built depends on the number of codes that need to be evaluated (e.g., six thousand models). Patient data is scored by each individual LR model 142, and the probability of missing codes is calculated according to the formula above, which could be one of the inputs of the ensemble model 154, discussed in more detail below.
Decision tree (DT) models 144 can capture the nonlinearity between data and their labels. Unlike the LR model 142, the DT model 144 can be constructed to take into account multiple hospitals (e.g., 32,000 decision tree models can be constructed). Here, the probability p(xi|x−i) is modeled as a decision tree, which consists of decision nodes and leaf nodes. Each of the decision nodes consists of the feature used to split the node, and links to other nodes based on presence or absence of the feature in a given test case. Each leaf node consists of probability of the presence of code.
The decision tree is constructed by minimizing entropy, which is defined as −Σxp(x)log p (x). At the root node, the feature that minimizes entropy of the label is selected. The samples are then split into two groups based on the value of the split feature and recursively subsequent nodes are created. The process stops when there are insufficient samples to proceed or the entropy reduction is not substantial. At every leaf node, the probability of the label is calculated as (number of positive labels)/(number of labels), and stored. During scoring, the decision tree is traversed according to the values of the decision features, and when a leaf node is reached, the label probability associated with that leaf node is returned.
The Local Naïve Bayes Model 146 is another supervised learning model 132 that creates neighborhoods for each visit and applies the standard Naive Bayes Model on the neighborhoods to recommend the missing codes for that visit. Compared with LR models 142 and DT models 144, this method is dynamic but sacrifices some model performance.
In order to determine the neighborhood for each visit, the similarity between visits must be defined. Since each visit can be represented as the set of codes associated with it, the cosine distance can be used as the similarity. For any two sets A, B, the similarity between them is:
Different weights can be assigned to the diagnosis codes, procedure codes, and HCPCS codes when computing the similarity. For example, the similarity score between two visits (x, y) in one of the algorithms can be:
Sim(x, y)=s(H(x),H(y))+S1·s(D(x), D(y))+S2·s(P(x),P(y)) Equation 3
where S1, S2>0 are arbitrary constants, H(·),D(·), and P(·) are the HCPCS codes, diagnosis codes, and procedure codes of visits, respectively. Finally, the neighborhood of each visit is the first K neighbors with the highest scores.
The Naïve Bayes Model 146 is then used to estimate the probability p(xi|x−i):
The ratio of the two probabilities is then used to remain numerically stable:
Each term on the right side is calculated from the neighborhood using a Laplace smoothing. With this ratio, a threshold test is performed to determine how much more probable it is that the potentially missing code xi should be in visit x−i.
As discussed above, a joint-density learning model 134 can be applied to outpatient data. Rather than receiving an explicit label for missing charges (as in the supervised learning models 132 discussed above), the joint-density learning model 134 tries to learn the complex interdependencies between charge codes, diagnosis codes, and other informative visit data without a predetermined notion of what is “right” or “wrong.” Here the binary vector x=(x1, . . . xD) is still used to represent the presence of charge codes, diagnoses codes, and procedure codes as well as any other patient visit data. Three exemplary joint-density learning models 134 are the Restricted Boltzmann Machine model 148, the Bernoulli Mixture Model 150, and the Gaussian Missing Data model 152.
The Restricted Boltzmann Machine model (RBM) 148 draws from statistical thermodynamics to compute whether or not a particular charge code should be present. The RBM 148 consists of two layers: the visible layer x=(x1, . . . , xD) whose units represent patient visit data, and the hidden layer h=(h1, . . ., hn) whose units are linked to the units of the visible layer. The model functions in two stages: (1) visible units trigger the state of the hidden units; and (2) the hidden units re-trigger the states of the visible units. The visible and hidden units are triggered stochastically. Each hidden unit is triggered according to the following probability distribution:
Here, bj is the bias of hidden unit j and Wj is the set of weights that represent the influence that the visible nodes x have on the behavior of hidden node hj. Visible nodes are triggered according to the distribution:
Similar to the notation for hidden node activation, ai is the bias for visible unit i and Wi is the set of weights that influence the activation of visible node i with respect to the hidden states h. The weights Wi, Wj are columns and rows, respectively, of the same weight matrix W.
Patient visit data is grouped first according to hospital, then according to primary diagnosis code. Thus, the RBMs 148 are trained on a very local level of data. The diagnosis groups are chosen such that each group has roughly the same number of training examples. Within each diagnosis group, the visits are converted into the binary vector x and are used as examples from which the RBM 148 can learn. For scoring, the appropriate RBM 148 model is selected according to the hospital and primary diagnosis. Then, the patient data is converted to binary form. This input is passed into the model, which undergoes the two stages described above. Any new re-triggered visible nodes indicate a high probability of missing charges.
The Bernoulli Mixture Model (BMM) 150 is a special mixture model with the assumption that the binary data points for each component are generated by a Bernoulli distribution. Similar to the other methods, each patient visit is formulated as a binary vector x=(x1, . . . , xD). The hidden variable is a multinomial label z ∈ {1, 2, . . . , k} that can be viewed as assigning each visit vector to one of k clusters. The joint distribution of the BMM 150 is given by:
Here, the parameter πz=p(z|π) denotes the prior probability of the latent variable z, while the parameter μiz=p(xi=1|z, μ) denotes the conditional means of the observed variable xi.
It is noted that an expectation-maximization (EM) algorithm can be used to estimate parameters that maximize the likelihood Πn p(xn|π, μ) of the visits in the historical patient data. The number of clusters k is determined with Bayesian Information Criterion (BIC). Similar to RBM 148, the BMM 150 is built for the same diagnosis groups for each hospital.
The trained BMM 150 is then applied to detect the missing code for a new visit. Let e={xi
Here, m is the D-l remaining codes that do not exist in the visit. There is no efficient way to maximize the above equation over all 2D-l possible ways to complete the visit. Therefore the individual posterior probability p(xi=1|e=1) is calculated for each possible missing code i. Then all possible missing codes whose posterior probabilities exceed some threshold are recommended.
In the Gaussian Missing Data model (GMD) 152, each patient visit is treated as a binary set (only 0 or 1) corresponding to the charge codes, diagnoses, etc. that are observed. The model then tries to suggest other codes that should be present, as well. Let x=(x1, . . . xD) be the binary vector representing the presence of charge, diagnoses, and procedure codes as well as any other patient visit data. Under the GMD model, x is a Gaussian random vector with mean μ and covariance matrix R. The elements of x are split into two groups: indices where a code is present and indices where a code is not present. Denote the two index sets as S and T, respectively. RS is the submatrix of R whose rows are in S. Similarly, μS, μT are the subvectors of μ whose indices are in S and T, respectively, and RTS is the submatrix of R whose rows are in T and whose columns are in S. Last, y is the vector of observed codes for a particular visit, specifically in this case, a vector of ones whose length is equal to the number of codes in the bill. An estimate of the probability of missing codes is given by:
{circumflex over (x)}=E{x|y, μ, R}=R
TS
R
S
−1(y−μS)+μT Equation 10
An EM technique is used to train an estimate for R and μ from historical data. Informally, the initial estimates for R and μ are the co-occurrence counts between codes and the relative frequency between codes, respectively. In fact, these first estimates produce good results in model scoring without need for further EM steps. Unlike RBM 148 and BMM 150, the GMD model 152 is built for each hospital due to its efficient implementation.
Each patient visit is converted to the binary vector form x. Then the sets S and T are determined in order to select the submatrices RTS, RS and subvectors μS, μT. The formula above is then evaluated and elements of {circumflex over (x)} whose values are close to 1 indicate a probable missing charge code.
A quantity model 136 could be used to detect the partially missing charges for observation hours, surgery hours, anesthesia hours, recovery hours, etc. Although most of the charges need only binary recommendations (i.e. either present or absent), there are several other charges that require quantitative predictions. When a charge is present, but the charged quantity is less than expected, it is an undercharged quantity.
Since many of these quantity variables have multiple charge codes associated with them, a mapping from charge codes to the quantity variables could be created, such as shown in Table 4 below:
Extra fields could be calculated (e.g., stay duration) to better model quantities. The quantity modeling consists of two steps: variable selection and regression. In the variable selection step, the initial dependent set is initialized to the empty set. Incrementally, variables from the pool are added to minimize the mean square residual of the target quantity. This step is repeated until the improvement in terms of residuals is smaller than a threshold. Once the dependent variable is set, a simple linear regression is used to construct a quantitative prediction model to predict quantities. For each model, the residual root mean square error is also noted.
For each quantitative variable, the predicted value is compared to the current value of the variable. If the difference is higher than a threshold (which is a product of mean square error of the model and a pre-decided constant) and the current value is lower than the predicted value, a recommendation is made to increase quantity of this variable.
A cascade model 140 could also be utilized by the system to capture the complicated relationship between codes and to improve prediction accuracy and performance. The first stage of the cascade model is an ensemble model 154 (itself a cascade model) that combines a number of individual models (e.g., supervised learning models 132, joint-density models 134, and/or quantity models 136), and where the second stage is a feedback model 158 which learns the feedback from professional coders. At least one of the individual models used in the ensemble model 154 could utilize a normalization model 156. Any individual model can be used in the ensemble model. Any other suitable model structures can be used as the outpatient model. The remaining features are based on information from the account receiving the code recommendation. Binary indicators are created for variables such as the patient's type, subtype, financial class, and day of week of discharge. A quantity model 136 could be used with, but separate from, the ensemble model 154.
The normalization model 156 obtains positive training examples by (1) removing one charge code from a patient visit, (2) scoring the altered visit using the appropriate LR 142, RBM 148, or DT 144 model, saving the (code, score) pair, (3) repeating steps 1-2 for each code in the patient visit, and (4) repeating steps 1-3 for each visit in historical data. Negative examples are created by (1) scoring an unaltered visit using the appropriate LR 142, RBM 148, or DT 144 model, (2) saving the top 100 (code, output) pairs, ordered by score, and (3) repeating 1-2 for each visit in historical data.
For normalizing the LR 142 and DT 144 models, the inputs into the normalization model 156 are the model score (i.e., LR score 172, DT score 174, and RBM score 176) and a binary indicator variable corresponding to the charge code (which is equivalent to the model used). For the RBM normalization, the inputs are the RBM score 176, binary indicator for charge code 180, and binary indicator for diagnosis group 182. The normalization models 156 use the L1-regularized logistic regression model described previously.
Then, normalized LR 184, RBM 188, and DT 186 models (e.g., processed outputs) are joined or combined with the GMD score 178 of the GMD model 152 to form the final ensemble model 154, which uses the L1-regularized logistic regression model described previously.
Positive and negative training examples are created in a similar way as for model normalization, except that the normalized scores are recorded. There are 9 inputs into the ensemble model 154, two per model and one overall bias term 192. The two inputs per model are: (1) normalized scores (i.e., normalized LR score 184, normalized DT score 186, normalized RBM score 188, GMD score 178); and (2) a binary indicator for presence of a score for each model (indicated as 194 in
In addition to the ensemble model 154, a second layer model (feedback model 158) is trained to target the feedback received from the client's auditors. The feedback model 158 learns from feedback to further refine the results. For example, if the electrocardiography (EKG) is always delayed for one hospital (which usually triggers the alarm of the ensemble model) the feedback model could learn to suppress it. Logistic regression is used in this implementation, but other classifiers are suitable.
The features used by the feedback model 158 come from either the ensemble model output or from information on the account itself. The predicted code itself is also used, along with several derivative features which aim to take advantage of the partially hierarchical structure of the coding systems. Thus, the model takes as input the predicted code 200, its ensemble score 196 (i.e., ensemble model output), and additional account-related information 202. The output is the probability that the client (or client's auditor) accepts the code, indicated by block 204. If the code predicted is a CPT or HCPCS code (5 characters), then four binary indicator features are activated: an indicator for the full code, plus three indicators for the first one, two, and three characters of the code, respectively. On the other hand, if the code predicted comes from a hospital chargemaster, then only two binary features are activated: an indicator for the full code (3-digit department code+5-digit charge code), plus another indicator for the 3-digit department code alone.
It is noted that the training set could be expanded by tracking the future appearance of a code on a visit as a proxy, which is usually caused by the manual review or the delay of hospital billing systems. That is, predictions are made given a snapshot of the visit data on a past date, and then the correctness of each prediction is judged by the appearance of the predicted code in later days. Also, the feedback model 158 could be biased on delayed codes. For these reasons, examples of real feedback are given higher weight in training than the proxy labels.
In addition to expanding the training set, L1 regularization could be used to prevent over-fitting to noise in the auditor feedback. A parameter search can be used to select the regularization strength and the learning rate of the logistic regression training. Holdout validation can be used to compare the effectiveness of the models, with the models trained on data collected continuously over two months, and then tested on data for the following two weeks. The metric for performance is the false positive rate at 95% recall of positive examples, since this is roughly the target point on the Receiving Operator Characteristic (ROC) curve, but other choices for operating points would also be valid.
Having thus described the invention in detail, it is to be understood that the foregoing description is not intended to limit the spirit or scope thereof. It will be understood that the embodiments of the present invention described herein are merely exemplary and that a person skilled in the art may make any variations and modification without departing from the spirit and scope of the invention. All such variations and modifications, including those discussed above, are intended to be included within the scope of the invention. What is desired to be protected is set forth in the following claims.
This application claims priority to U.S. provisional Patent Application No. 61/659,175 filed on Jun. 13, 2012, which is incorporated herein in its entirety by reference and made a part hereof.
Number | Date | Country | |
---|---|---|---|
61659175 | Jun 2012 | US |