This application is a priority application.
The present inventions relate to machine learning and artificial intelligence and, more particularly, to a method and system for detecting machine learning instability.
Paymode-X is a cloud-based, invoice-to-pay service that optimizes the accounts payable process. An accounts payable department retains the invoice-to-pay service to handle payments. Each vendor signs up for the service, and invoices are received, processed, approved, and paid through the invoice-to-pay service. The invoice-to-pay service relies on the integrity of the vendor database, and preventing fraudulent payments is important. To prevent fraud, vendors must be vetted to prevent malicious actors. Vendors are vetted using machine learning algorithms.
In the long-running use of a machine learning algorithm, the distribution of various features used in the machine learning model can drift over time. For instance, a model that checked vendor addresses to see if the address were for a residential address may not take into account working from home. The relevance of working at home dramatically changed in 2020 with the COVID pandemic, as vendor's account receivable clerks started working from home. The relevance of residential addresses changed in 2020 and its influence on the model needs to be reassessed. Current machine learning models do not check changes in the relevance of features on the model. An improvement to machine learning models is needed to identify drifting features in a model, and to alert users of the changes in the model. The present inventions provide the improvement.
In an alternate scenario, a monitoring tool in a medical facility that watches for improper access to medical records may flag an access to medical records from a residential IP address. In the past, the machine learning model determined that residential IP addresses were outside of the medical facility and likely improper. But with the rapid increase in telemedicine in 2020, the machine learning model needs to shift dramatically to account for doctors working from home. Similarly, the GPS location where a pharmaceutical prescription is written, and its relevance to a drug monitoring machine learning model has changed in response to the COVID pandemic. The location of the IP (or GPS) address feature and its influence on the machine learning model needs to be reassessed. Current machine learning models do not check changes in the relevance of features on the model. An improvement to machine learning models is needed to identify drifting features in a model, and to alert users of the changes in the model. The present inventions provide the improvement.
An improved machine learning method is described herein. The method comprises (1) creating a first machine learning model with training data, (2) periodically adjusting the first machine learning model with production data to create a second machine learning model, (3) creating a training dataset by processing the training data through the first machine learning model, (4) creating a prediction dataset by processing the production data set through the second machine learning model, and (5) looping through each feature in the prediction dataset, (5a) determining a p-value by comparing the feature in the prediction dataset to the feature in the training dataset, and (5b) if the p-value is less than a constant (alpha) and a confidence interval for the training dataset does not overlap the confidence interval for the prediction dataset, creating an alert.
In some embodiments, the improved machine learning method further comprises performing a T-test to determine the p-value. In some embodiments, the improved machine learning method further comprises performing a binomial proportions test to determine the p-value. In some embodiments, the improved machine learning method further comprises automatically adjusting the first or second machine learning model based on the alert. In some embodiments, the improved machine learning method further comprises creating a plot of the feature in the prediction dataset. The first machine learning model could be created using a Densicube algorithm, a K-means algorithm, or a Random Forest algorithm. The overlap in the confidence interval could use a mean and a margin of error.
A method for creating machine learning model performance alerts is also described here. The method includes (1) creating a first machine learning model with training data, (2) adjusting the first machine learning model with production data to create a second machine learning model, (3) creating a training dataset by processing the training data through the first machine learning model, (4) creating a prediction dataset by processing the production data through the second machine learning model, and (5) looping through each feature in the prediction dataset, (5a) determining a p-value by comparing the feature in the prediction dataset to the feature in the training dataset, and (5b) if the p-value is less than a constant (alpha) and a confidence interval for the training dataset does not overlap the confidence interval for the prediction dataset, creating the machine learning model performance alert.
The present inventions are now described in detail with reference to the drawings. In the drawings, each element with a reference number is similar to other elements with the same reference number independent of any letter designation following the reference number. In the text, a reference number with a specific letter designation following the reference number refers to the specific element with the number and letter designation and a reference number without a specific letter designation refers to all elements with the same reference number independent of any letter designation following the reference number in the drawings.
It should be appreciated that many of the elements discussed in this specification may be implemented in a hardware circuit(s), a processor executing software code or instructions which are encoded within computer-readable media accessible to the processor or a combination of a hardware circuit(s) and a processor or control block of an integrated circuit executing machine-readable code encoded within a computer-readable media. As such, the term circuit, module, server, application, or other equivalent description of an element as used throughout this specification is, unless otherwise indicated, intended to encompass a hardware circuit (whether discrete elements or an integrated circuit block), a processor or control block executing code encoded in a computer-readable media, or a combination of a hardware circuit(s) and a processor and/or control block executing such code.
The document describes the building of a framework that can be used to monitor the model performance over time. Visualizations are created to see the distribution of model performance over time. Model performance is monitored by evaluating how well is the model fitting the test data. The test dataset is part of the train-validation-test split created when building the model, we can evaluate how well is the model fitting the test set over time using different performance metrics and see the distribution of performance over time using the visualizations.
The process starts 101 with the creation of the machine learning model 102. The machine learning model could be created 102 using any number of machine learning algorithms, such as Random Forest, K-Means, Densicube (see U.S. Pat. No. 9,489,627 by Jerzy Bala and U.S. patent application Ser. No. 16/355,985 by Jerzy Bala and Paul Green, both incorporated herein in their entirety by reference), et al. The machine learning algorithms are trained using training data to create the machine learning model 102.
Periodically, the machine learning model is updated 103 using the new data saved from running the machine learning model 104. When the machine learning model is updated 103, an outcomes dataframe entry is added 107 to the outcomes dataframe (see Table 1). The updated machine learning model is then used to process production data through the machine learning model 104. After running the data through the model 104, the machine learning model is monitored 105 to see if the features have drifted over time from the model created by the training data set. The details of this monitoring are described in
If there are no alerts 106, the period is checked 121 to see if it is time to update the model 103. If so, the model is updated 103. If not, the next set of production data is processed through the machine learning model 104. The period could be a count of the number of transactions processed (every ten, hundred, or thousand transactions) or it could be a set time (every day at midnight, every week, monthly, etc).
Next, we look to
Specifically, the monitor machine model 105 routine starts by obtaining the dataframe outcomes 201. The outcome dataframe is created 107 for every time the model is used to trained and the performance on the test set is known. The outcome dataframe is the test set from the train-validation-test split made while training the model. The outcome dataframe has a unique identifier for each observation, score generated by the model, actual outcome, predicted outcome. With the dataframe outcomes, the model is checked for permance alerts 202. The evaluation of the model for performance alerts is further enumerated in
Next, the feature stability alerts are created 204. These feature stability alerts 204 are further described in
Using the outcome dataframe, different performance metrics are calculated like precision, recall, and accuracy. These performance metrics have a set threshold in the monitor config file. Then the model monitoring framework would check for model performance and feature stability alerts. Also, it would create performance and feature stability plots. The alerts and plots are then returned to the data scientist by email whenever the model is used to process data.
Divide every outcome file 301, 302 into 4-5 parts 301a, 301b, 301c, 301d, 302a, 302b, 302c, 302d randomly to get a confidence interval estimate 303 which can be controlled using a parameter in the config file. Calculate the performance metrics for each of the parts 301a-d, 302a-d like precision, recall, and accuracy. The performance metric values for all the parts 301a-d, 302a-d of the outcome dataframes are stored in the performance dataframe 304. The performance dataframe has seven columns: the date of the outcome file, the part number, precision_0 of class 0 for the part, precision_1 of class 1 for the part, recall_0 of class 0 for the part, recall_1 of class 1 for the part, and the accuracy for the part.
The performance dataframe is read and used as data to create the plots 305. The distribution of precision, recall, and accuracy over time is visualized. A confidence interval of 95% is generally used in the plots but the confidence interval is a configurable parameter that can be changed. The performance plots are returned.
Looking at
For each part of each dataframe, calculate the recall 403, the precision 404, and the accuracy 405. Accuracy is calculated 405 as:
Precision_1 is the precision for positives and precision_0 is the precision for the negatives (i.e. use true negative and false negative in place of the positive values).
Recall is calculated 403 as:
Recall_1 is the precision for positives and recall_0 is the precision for the negatives (i.e. use true negative in place of the true positive and false positive instead of false negative).
For each feature 502, check if the raw feature was numeric (614, 615) or categorical (613, 617, 618, 619) 503. If numeric then perform a T-test 511 with the null hypothesis that the distribution of the feature in the training dataset is the same as the distribution of the feature in the prediction dataset. If the feature is categorical, then perform a binomial proportion test 521 with the null hypothesis that the proportion of the feature in the training dataset is the same as the proportion of the feature in the prediction dataset. Both these statistical tests return a p-value. The alpha (a constant representing a significance level) which is the probability of rejecting the null hypothesis when it is true (false positive) is configurable for each model. If the p-value is less than or equal to the alpha 504, then we reject the null hypothesis, and we say the result is statistically significant. If the p-value is greater than alpha 504, then we fail to reject the null hypothesis, and we say that the result is statistically nonsignificant. Since we have a large sample size, we can't solely rely on the p-values. So, if the p-value is less than or equal to alpha we check if there is an overlap in the confidence intervals 505 by using the mean and the margin of error (amount of random sampling error for a 95% confidence level) for each distribution for numeric features and expected probability of success and margin of error for proportions for categorical features. If the confidence intervals overlap 505, then there is no need to create an alert else an alert for that feature is created. The feature stability alerts generated 506 are returned 531 and automatically sent to the data scientists through email.
For each feature 612, 613, 614, 615, 616, 617, 618, the feature is checked to see if it is numeric 503. If the feature is numeric, a T-test is performed 511. The T-test is calculated by subtracting the mean of the test data set for the function from the mean of the prediction data set for the function and dividing by a function of the variances. Note that the Tscore is the p-value.
Where n is the number of samples, var is the variance, and mean is the mean of each data set.
If the feature is not numeric 503, then a binomial proportion test 521 is performed for the categorical features. A random sample of the training dataset is taken to match the number of observations in the prediction dataset. A binomial proportion test is performed with the null hypothesis that the proportion of the feature in the training dataset is the same as the proportion of the feature in the prediction dataset. The alternate hypothesis is that the proportion of the feature in training and prediction dataset significantly differ from each other. The hypothesized probability of success is the proportion of 1 for the feature in the training dataset (probability in the formula below). The binomial proportions test would return the p-value. The alpha (the constant representing the significance level) which is the probability of rejecting the null hypothesis when it is true (false positive) is configurable for each model. If the p-value is less than or equal to the alpha 504, then we reject the null hypothesis, and we say the result is statistically significant. If the p-value is greater than alpha 504, then we fail to reject the null hypothesis, and we say that the result is statistically nonsignificant.
The binomial proportion test is
Since we have a large sample size, we can't solely rely on the p-values. So if the p-value is less than or equal to alpha 504 we check if there is an overlap in the confidence intervals 505 by using the mean and the margin of error (amount of random sampling error for a 95% confidence level) for each distribution for numeric features and expected probability of success and margin of error for proportions for categorical features. If the confidence intervals overlap 505 then there is no need to create an alert else an alert for that feature is created 506. Then, the next feature is checked 502.
Although the inventions are shown and described with respect to certain exemplary embodiments, it is obvious that equivalents and modifications will occur to others skilled in the art upon the reading and understanding of the specification. It is envisioned that after reading and understanding the present inventions those skilled in the art may envision other processing states, events, and processing steps to further the objectives of the system of the present inventions. The present inventions include all such equivalents and modifications, and is limited only by the scope of the following claims.