The present disclosure generally relates to systems and methods for monitoring and managing machine learning models and related data. Some examples described herein relate specifically to systems and methods for processing streams of data, identifying and monitoring drift in data over time, and taking corrective action in response to data drift and/or model inaccuracies.
Machine learning is being integrated into a wide range of use cases and industries. Unlike certain other applications, machine learning applications (including deep learning and advanced analytics) can have multiple independent running components that operate cohesively to deliver accurate and relevant results. This complexity can make it difficult to manage or monitor all the interdependent aspects of a machine learning system.
In some instances, for example, data for a machine learning model can be provided in a data stream of unknown size and/or having thousands or millions of numerical values per hour, and lasting for several hours, days, weeks, or longer. Failing to properly store, process, or aggregate such data streams can result in catastrophic failures in which data is lost or models are otherwise unable to make predictions. Additionally, such data can drift over time to be significantly different from data that was used to train the model, which can result in model performance issues.
In general, the present disclosure relates to systems and methods for monitoring and managing machine learning models and data used by such models. A stream of data used by the models can be aggregated using histogram structures (e.g., centroid histograms) that approximate traditional histograms and require far less data storage. The histogram structures can avoid catastrophic data processing failures associated with previous or traditional data stream aggregation processes, and can be used to calculate a wide variety of metrics, including, for example, medians and percentiles. Additionally or alternatively, the systems and methods described herein can be used to identify or monitor drift occurring in data and/or model predictions over time. When drift is identified in scoring data used to make model predictions, for example, alerts can be generated to inform users or system components about the drift. Additionally or alternatively, such alerts can be triggered when model inaccuracies are detected or when model predictions deviate from expectations (e.g., due to data drift). In response to the alerts, the systems and methods can be used to take corrective action, for example, by retraining or refreshing a model with updated training data, or by switching to a new model (e.g., a challenger model).
In general, one innovative aspect of the subject matter described in the present disclosure can be embodied in a computer-implemented method of processing a stream of data or building a histogram for the stream of data. The method includes: (a) providing a histogram for a stream of data including numerical values, the histogram including a centroid vector having elements for storing centroid values, and a count vector having elements for storing count values corresponding to the centroid values; (b) receiving a next numerical value for the stream of data; (c) identifying two adjacent elements in the centroid vector having centroid values less than and greater than the next numerical value; (d) inserting a first new element between the two adjacent elements in the centroid vector; (e) inserting a second new element between corresponding adjacent elements in the count vector; (f) storing the next numerical value in the first new element in the centroid vector; (g) setting a count value in the second new element in the count vector to be equal to one; (h) identifying two neighboring elements in the centroid vector having a smallest difference in centroid values; (i) merging the two neighboring elements in the centroid vector into a single element including a weighted average of the centroid values from the two neighboring elements; (j) merging two corresponding neighboring elements in the count vector into a single element including a sum of the count values from the two corresponding neighboring elements; and (k) repeating steps (b) through (j) for additional next numerical values for the stream of data.
In certain examples, providing the histogram can include initializing the histogram, and initializing the histogram can include: providing the centroid vector and the count vector each having an initial length N; receiving a set of N initial numerical values for the stream of data; storing the N initial numerical values in numerical order in the centroid vector; and setting each value in the count vector to be equal to one. Providing the histogram can include initializing the histogram at periodic time intervals. A duration of each periodic time interval can be or include one hour, one day, one week, or one year. The next numerical value can fall between centroid values stored in the adjacent elements of the centroid vector.
In some implementations, identifying the two neighboring elements can include calculating a difference in centroid values between each set of adjacent elements in the centroid vector. Step (k) can include: repeating steps (b) through (j) until a specified time duration is reached; and storing the histogram for later reference. The method can include converting the histogram to a new histogram having a plurality of buckets, each bucket including a lower bound, an upper bound, and a count. The method can include calculating a cumulative count for each of the plurality of buckets. The method can include calculating at least one of a median or a percentile for the new histogram based on the cumulative counts.
In another aspect, the present disclosure relates a system having one or more computer systems programmed to perform operations including: (a) providing a histogram for a stream of data including numerical values, the histogram including a centroid vector having elements for storing centroid values, and a count vector having elements for storing count values corresponding to the centroid values; (b) receiving a next numerical value for the stream of data; (c) identifying two adjacent elements in the centroid vector having centroid values less than and greater than the next numerical value; (d) inserting a first new element between the two adjacent elements in the centroid vector; (e) inserting a second new element between corresponding adjacent elements in the count vector; (f) storing the next numerical value in the first new element in the centroid vector; (g) setting a count value in the second new element in the count vector to be equal to one; (h) identifying two neighboring elements in the centroid vector having a smallest difference in centroid values; (i) merging the two neighboring elements in the centroid vector into a single element including a weighted average of the centroid values from the two neighboring elements; (j) merging two corresponding neighboring elements in the count vector into a single element including a sum of the count values from the two corresponding neighboring elements; and (k) repeating steps (b) through (j) for additional next numerical values for the stream of data.
In certain examples, providing the histogram can include initializing the histogram, and initializing the histogram can include: providing the centroid vector and the count vector each having an initial length N; receiving a set of N initial numerical values for the stream of data; storing the N initial numerical values in numerical order in the centroid vector; and setting each value in the count vector to be equal to one. Providing the histogram can include initializing the histogram at periodic time intervals. A duration of each periodic time interval can be or include one hour, one day, one week, or one year. The next numerical value can fall between centroid values stored in the adjacent elements of the centroid vector.
In some implementations, identifying the two neighboring elements can include calculating a difference in centroid values between each set of adjacent elements in the centroid vector. Step (k) can include: repeating steps (b) through (j) until a specified time duration is reached; and storing the histogram for later reference. The operations can include converting the histogram to a new histogram having a plurality of buckets, each bucket including a lower bound, an upper bound, and a count. The operations can include calculating a cumulative count for each of the plurality of buckets. The operations can include calculating at least one of a median or a percentile for the new histogram based on the cumulative counts.
In another aspect, the present disclosure relates to a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computer processors, cause the one or more computer processors to perform operations including: (a) providing a histogram for a stream of data including numerical values, the histogram including a centroid vector having elements for storing centroid values, and a count vector having elements for storing count values corresponding to the centroid values; (b) receiving a next numerical value for the stream of data; (c) identifying two adjacent elements in the centroid vector having centroid values less than and greater than the next numerical value; (d) inserting a first new element between the two adjacent elements in the centroid vector; (e) inserting a second new element between corresponding adjacent elements in the count vector; (f) storing the next numerical value in the first new element in the centroid vector; (g) setting a count value in the second new element in the count vector to be equal to one; (h) identifying two neighboring elements in the centroid vector having a smallest difference in centroid values; (i) merging the two neighboring elements in the centroid vector into a single element including a weighted average of the centroid values from the two neighboring elements; (j) merging two corresponding neighboring elements in the count vector into a single element including a sum of the count values from the two corresponding neighboring elements; and (k) repeating steps (b) through (j) for additional next numerical values for the stream of data.
In another aspect, the present disclosure relates to a computer-implemented method including: providing a machine learning model configured to predict a preferred combination of a binning strategy and a drift metric for determining data drift; determining one or more data characteristics for at least one data set; providing the one or more characteristics as input to the machine learning model; receiving as output from the machine learning model an identification of the preferred combination of the binning strategy and the drift metric for the at least one data set; using the predicted combination to determine drift between a first data set and a second data set; and facilitating a corrective action in response to the determined drift.
In various examples, the first data set can include training data and the second data set can include scoring data. The first data set and the second data set can include data for a single feature of a predictive model. The one or more characteristics can include a length, a distribution, a minimum, a maximum, a mean, a skewness, a number of unique values, or any combination thereof. The at least one data set can include the first data set, the second data set, or both the first data set and the second data set. The at least one data set can include numerical data, and the binning strategy can include use of fixed width bins, quantiles, quartiles, deciles, ventiles, Freedman-Diaconis rule, Bayesian Blocks, or any combination thereof. The at least one data set can include categorical data, and the binning strategy can include use of (i) one bin per level in a training data sample plus one, (ii) one bin per level in a portion of the training data sample plus one, (iii) inverse binning, or (iv) any combination thereof.
In certain implementations, the at least one data set includes text data, and the binning strategy includes use of (i) inverse binning, (ii) one bin per quantile based on word use frequency, or (iii) any combination thereof. The drift metric can include use of population stability index, Kullback-Leibler divergence, relative entropy, Hellinger distance, Isolation Forest (e.g., ratio of training anomalies to scoring anomalies), modality drift, Kolmogorov-Smirnov test, Wasserstein distance, or any combination thereof. Facilitating the corrective action can include retraining a predictive model, switching to a new predictive model, collecting new data for the first data set, collecting new data for the second data set, or any combination thereof. The method can include: determining a percentage of anomalies in the first data set; determining a percentage of anomalies in the second data set; and calculating an anomaly drift based on the percentage of anomalies in the first data set and the percentage of anomalies in the second data set.
In another aspect, the present disclosure relates to a system having one or more computer systems programmed to perform operations including: providing a machine learning model configured to predict a preferred combination of a binning strategy and a drift metric for determining data drift; determining one or more data characteristics for at least one data set; providing the one or more characteristics as input to the machine learning model; receiving as output from the machine learning model an identification of the preferred combination of the binning strategy and the drift metric for the at least one data set; using the predicted combination to determine drift between a first data set and a second data set; and facilitating a corrective action in response to the determined drift.
In various examples, the first data set can include training data and the second data set can include scoring data. The first data set and the second data set can include data for a single feature of a predictive model. The one or more characteristics can include a length, a distribution, a minimum, a maximum, a mean, a skewness, a number of unique values, or any combination thereof. The at least one data set can include the first data set, the second data set, or both the first data set and the second data set. The at least one data set can include numerical data, and the binning strategy can include use of fixed width bins, quantiles, quartiles, deciles, ventiles, Freedman-Diaconis rule, Bayesian Blocks, or any combination thereof. The at least one data set can include categorical data, and the binning strategy can include use of (i) one bin per level in a training data sample plus one, (ii) one bin per level in a portion of the training data sample plus one, (iii) inverse binning, or (iv) any combination thereof.
In certain implementations, the at least one data set includes text data, and the binning strategy includes use of (i) inverse binning, (ii) one bin per quantile based on word use frequency, or (iii) any combination thereof. The drift metric can include use of population stability index, Kullback-Leibler divergence, relative entropy, Hellinger distance, modality drift, Kolmogorov-Smirnov test, Wasserstein distance, or any combination thereof. Facilitating the corrective action can include retraining a predictive model, switching to a new predictive model, collecting new data for the first data set, collecting new data for the second data set, or any combination thereof. The operations can include: determining a percentage of anomalies in the first data set; determining a percentage of anomalies in the second data set; and calculating an anomaly drift based on the percentage of anomalies in the first data set and the percentage of anomalies in the second data set.
In another aspect, the present disclosure relates to a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computer processors, cause the one or more computer processors to perform operations including: providing a machine learning model configured to predict a preferred combination of a binning strategy and a drift metric for determining data drift; determining one or more data characteristics for at least one data set; providing the one or more characteristics as input to the machine learning model; receiving as output from the machine learning model an identification of the preferred combination of the binning strategy and the drift metric for the at least one data set; using the predicted combination to determine drift between a first data set and a second data set; and facilitating a corrective action in response to the determined drift.
In another aspect, the present disclosure relates to a computer-implemented method including: obtaining training data including a plurality of features for a machine learning model; obtaining multiple sets of scoring data including the plurality of features for the machine learning model, each set of scoring data representing a respective period of time; for each feature from the plurality of features and for each set of scoring data, providing the training data and the scoring data as input to a classifier; determining, based on output from the classifier, that the sets of scoring data have drifted from the training data over time for at least one of the features; determining that the drift corresponds to a reduction in accuracy of the machine learning model; and facilitating a corrective action to improve the accuracy of the machine learning model.
In certain implementations, the machine learning model can be trained using the training data, and the machine learning model can be used to make predictions based on the scoring data. Each set of scoring data can represent a distinct period of time. The classifier can be or include a covariate shift classifier configured to detect statistically significant differences between two sets of data. Determining that the sets of scoring data have drifted from the training data can include detecting drift over multiple periods of time for the at least one of the features. Determining that the drift corresponds to a reduction in accuracy of the machine learning model can include identifying one or more features from the plurality of features that contributed to the reduction in accuracy.
In some instances, identifying the one or more features can include determining an impact that the one or more features had on the reduction in accuracy. Determining the impact can include displaying on a graphical user interface a chart including an indication of the impact that the one or more features had on the reduction in accuracy. The method can include: using the machine learning model to make predictions for each set of scoring data; and detecting anomalies in the predictions over time. Detecting anomalies in the predictions can include displaying on a graphical user interface a chart including an indication of a quantity of detected anomalies over time. The corrective action can include: sending an alert to a user of the machine learning model, refreshing the machine learning model, retraining the machine learning model, switching to a new machine learning model, or any combination thereof.
In another aspect, the present disclosure relates to a system having one or more computer systems programmed to perform operations including: obtaining training data including a plurality of features for a machine learning model; obtaining multiple sets of scoring data including the plurality of features for the machine learning model, each set of scoring data representing a respective period of time; for each feature from the plurality of features and for each set of scoring data, providing the training data and the scoring data as input to a classifier; determining, based on output from the classifier, that the sets of scoring data have drifted from the training data over time for at least one of the features; determining that the drift corresponds to a reduction in accuracy of the machine learning model; and facilitating a corrective action to improve the accuracy of the machine learning model.
In certain implementations, the machine learning model can be trained using the training data, and the machine learning model can be used to make predictions based on the scoring data. Each set of scoring data can represent a distinct period of time. The classifier can be or include a covariate shift classifier configured to detect statistically significant differences between two sets of data. Determining that the sets of scoring data have drifted from the training data can include detecting drift over multiple periods of time for the at least one of the features. Determining that the drift corresponds to a reduction in accuracy of the machine learning model can include identifying one or more features from the plurality of features that contributed to the reduction in accuracy.
In some instances, identifying the one or more features can include determining an impact that the one or more features had on the reduction in accuracy. Determining the impact can include displaying on a graphical user interface a chart including an indication of the impact that the one or more features had on the reduction in accuracy. The operations can include: using the machine learning model to make predictions for each set of scoring data; and detecting anomalies in the predictions over time. Detecting anomalies in the predictions can include displaying on a graphical user interface a chart including an indication of a quantity of detected anomalies over time. The corrective action can include: sending an alert to a user of the machine learning model, refreshing the machine learning model, retraining the machine learning model, switching to a new machine
In another aspect, the present disclosure relates to a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computer processors, cause the one or more computer processors to perform operations including: obtaining training data including a plurality of features for a machine learning model; obtaining multiple sets of scoring data including the plurality of features for the machine learning model, each set of scoring data representing a respective period of time; for each feature from the plurality of features and for each set of scoring data, providing the training data and the scoring data as input to a classifier; determining, based on output from the classifier, that the sets of scoring data have drifted from the training data over time for at least one of the features; determining that the drift corresponds to a reduction in accuracy of the machine learning model; and facilitating a corrective action to improve the accuracy of the machine learning model.
In another aspect, the present disclosure relates to a computer-implemented method including: monitoring a performance of a machine learning model over time; detecting a degradation in the performance of the machine learning model; in response to the detected degradation in the performance, automatically triggering at least one of: switching from the machine learning model to a challenger machine learning model, or updating the machine learning model with new training data; and using at least one of the challenger machine learning model or the updated machine learning model to make predictions.
In certain examples, monitoring the performance of the machine learning model can include comparing model predictions with ground truth data over time. Monitoring the performance of the machine learning model can include detecting a drift in scoring data used to make model predictions. Monitoring a performance of the machine learning model can include displaying on a graphical user interface a chart including an indication of an accuracy of the machine learning model and an accuracy of the challenger machine learning model over time. The degradation can include a reduction in agreement between model predictions and ground truth data. The automatic triggering can be based on one or more characteristics including a size of a data set, a number of rows in the data set, a number of columns in the data set, a historical performance of the challenger machine learning model, a detected drift associated with the challenger machine learning model, a quantity of scoring data that can be matched up with ground truth data, or any combination thereof. The data set can include training data, scoring data, or a combination thereof.
In various instances, switching from the machine learning model to the challenger machine learning model can include selecting the challenger machine learning model from a plurality of challenger machine learning models based on a historical performance of the challenger machine learning model. Updating the machine learning model with new training data can include generating an updated set of training data by combining the new training data with previous training data, reducing an amount of previous training data to accommodate the new training data, replacing previous training data with the new training data, or any combination thereof. Updating the machine learning model with new training data can include reducing an amount of previous training data to accommodate the new training data, and reducing the amount of previous data can include removing a random portion of the previous training data, removing an outdated portion of the previous training data, removing an anomalous portion of the previous training data, or any combination thereof.
In another aspect, the present disclosure relates to a system having one or more computer systems programmed to perform operations including: monitoring a performance of a machine learning model over time; detecting a degradation in the performance of the machine learning model; in response to the detected degradation in the performance, automatically triggering at least one of: switching from the machine learning model to a challenger machine learning model, or updating the machine learning model with new training data; and using at least one of the challenger machine learning model or the updated machine learning model to make predictions.
In certain examples, monitoring the performance of the machine learning model can include comparing model predictions with ground truth data over time. Monitoring the performance of the machine learning model can include detecting a drift in scoring data used to make model predictions. Monitoring a performance of the machine learning model can include displaying on a graphical user interface a chart including an indication of an accuracy of the machine learning model and an accuracy of the challenger machine learning model over time. The degradation can include a reduction in agreement between model predictions and ground truth data. The automatic triggering can be based on one or more characteristics including a size of a data set, a number of rows in the data set, a number of columns in the data set, a historical performance of the challenger machine learning model, a detected drift associated with the challenger machine learning model, a quantity of scoring data that can be matched up with ground truth data, or any combination thereof. The data set can include training data, scoring data, or a combination thereof.
In various instances, switching from the machine learning model to the challenger machine learning model can include selecting the challenger machine learning model from a plurality of challenger machine learning models based on a historical performance of the challenger machine learning model. Updating the machine learning model with new training data can include generating an updated set of training data by combining the new training data with previous training data, reducing an amount of previous training data to accommodate the new training data, replacing previous training data with the new training data, or any combination thereof. Updating the machine learning model with new training data can include reducing an amount of previous training data to accommodate the new training data, and reducing the amount of previous data can include removing a random portion of the previous training data, removing an outdated portion of the previous training data, removing an anomalous portion of the previous training data, or any combination thereof.
In another aspect, the present disclosure relates to a non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computer processors, cause the one or more computer processors to perform operations including: monitoring a performance of a machine learning model over time; detecting a degradation in the performance of the machine learning model; in response to the detected degradation in the performance, automatically triggering at least one of: switching from the machine learning model to a challenger machine learning model, or updating the machine learning model with new training data; and using at least one of the challenger machine learning model or the updated machine learning model to make predictions.
In another aspect, the present disclosure relates to a computer-implemented method. The method includes: receiving model data from a plurality of prediction environments for a plurality of machine learning models deployed in the prediction environments, the model data including model predictions; providing the model data to a machine learning operations (MLOps) component configured to perform operations including at least one of: aggregating a stream of scoring data, identifying drift in scoring data or model predictions, generating alerts related to the drift, or generating requests related to model adjustment or replacement; receiving, from the MLOps component, a request to take an action for a machine learning model from the plurality of machine learning models, wherein the machine learning model is deployed in a respective prediction environment from the plurality of prediction environments; and implementing the action for the machine learning model in the respective prediction environment.
In certain examples, the model data can include scoring data. Receiving the model data can include aggregating the model data prior to providing the model data to the MLOps component. Each of the prediction environments can include a computing environment in which machine learning models are deployed for making predictions. Each of the prediction environments can include a web-based computing platform hosted by a third party. The MLOps component can include a data aggregation module for aggregating the stream of scoring data, a drift identification module for identifying the drift in scoring data or model predictions, a drift monitoring module for generating the alerts related to the drift, and/or a model management module for generating the requests related to model adjustment or replacement.
In some instances, the action can include refreshing the machine learning model and/or replacing the machine learning model with a different model. Implementing the action can include: selecting a plugin from a plurality of plugins associated with the plurality of prediction environments, wherein the selected plugin is associated with the respective prediction environment; and using the selected plugin to implement the action in the respective prediction environment. The method can include: retrieving a new model from a storage location; and using the selected plugin to deploy the new model in the respective prediction environment. Retrieving the new model from the storage location can include selecting a second plugin associated with the storage location, wherein the second plugin is selected from a plurality of plugins associated with a respective plurality of storage locations.
In another aspect, the present disclosure relates to a system. The system includes one or more computer systems programmed to perform operations comprising: receiving model data from a plurality of prediction environments for a plurality of machine learning models deployed in the prediction environments, the model data including model predictions; providing the model data to a machine learning operations (MLOps) component configured to perform operations including at least one of: aggregating a stream of scoring data, identifying drift in scoring data or model predictions, generating alerts related to the drift, or generating requests related to model adjustment or replacement; receiving, from the MLOps component, a request to take an action for a machine learning model from the plurality of machine learning models, wherein the machine learning model is deployed in a respective prediction environment from the plurality of prediction environments; and implementing the action for the machine learning model in the respective prediction environment.
In certain examples, the model data can include scoring data. Receiving the model data can include aggregating the model data prior to providing the model data to the MLOps component. Each of the prediction environments can include a computing environment in which machine learning models are deployed for making predictions. Each of the prediction environments can include a web-based computing platform hosted by a third party. The MLOps component can include a data aggregation module for aggregating the stream of scoring data, a drift identification module for identifying the drift in scoring data or model predictions, a drift monitoring module for generating the alerts related to the drift, and/or a model management module for generating the requests related to model adjustment or replacement.
In some instances, the action can include refreshing the machine learning model and/or replacing the machine learning model with a different model. Implementing the action can include: selecting a plugin from a plurality of plugins associated with the plurality of prediction environments, wherein the selected plugin is associated with the respective prediction environment; and using the selected plugin to implement the action in the respective prediction environment. The operations can include: retrieving a new model from a storage location; and using the selected plugin to deploy the new model in the respective prediction environment. Retrieving the new model from the storage location can include selecting a second plugin associated with the storage location, wherein the second plugin is selected from a plurality of plugins associated with a respective plurality of storage locations.
In another aspect, the present disclosure relates to non-transitory computer-readable medium having instructions stored thereon that, when executed by one or more computer processors, cause the one or more computer processors to perform operations comprising: receiving model data from a plurality of prediction environments for a plurality of machine learning models deployed in the prediction environments, the model data including model predictions; providing the model data to a machine learning operations (MLOps) component configured to perform operations including at least one of: aggregating a stream of scoring data, identifying drift in scoring data or model predictions, generating alerts related to the drift, or generating requests related to model adjustment or replacement; receiving, from the MLOps component, a request to take an action for a machine learning model from the plurality of machine learning models, wherein the machine learning model is deployed in a respective prediction environment from the plurality of prediction environments; and implementing the action for the machine learning model in the respective prediction environment.
The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.
The foregoing Summary, including the description of some embodiments, motivations therefor, and/or advantages thereof, is intended to assist the reader in understanding the present disclosure, and does not in any way limit the scope of any of the claims.
In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the present invention are described with reference to the following drawings, in which:
“Machine learning” generally refers to the application of certain techniques (e.g., pattern recognition and/or statistical inference techniques) by computer systems to perform specific tasks. Machine learning systems may build predictive models based on sample data (e.g., “training data”) and may validate the models using validation data (e.g., “testing data”). The sample and validation data may be organized as sets of records (e.g., “observations”), with each record indicating values of specified data fields (e.g., “dependent variables,” “outputs,” or “targets”) based on the values of other data fields (e.g., “independent variables,” “inputs,” “features,” or “predictors”). When presented with other data (e.g., “scoring data”) similar to or related to the sample data, the machine learning system may use such a predictive model to accurately predict the unknown values of the targets of the scoring data set.
A feature of a data sample may be a measurable property of an entity (e.g., person, thing, event, activity, etc.) represented by or associated with the data sample. For example, a feature can be the price of an apartment. As a further example, a feature can be a shape extracted from an image of the apartment. In some cases, a feature of a data sample is a description of (or other information regarding) an entity represented by or associated with the data sample. A value of a feature may be a measurement of the corresponding property of an entity or an instance of information regarding an entity. For instance, in the above example in which a feature is the price of an apartment, a value of the feature can be $1,000. As referred to herein, a value of a feature can also refer to a missing value (e.g., no value). For instance, in the above example in which a feature is the price of an apartment, the price of the apartment can be missing.
In various examples, an “entity” (alternatively referred to as a “segment”) can be a specific value for a feature. For example, the feature may be “Customer_business_area” and values for the feature may include “telecoms,” “electrical,” and the like. The entities in this example include “telecoms” and “electrical.” A segment can be a manually defined cluster that can be picked up by a machine learning algorithm. Clustering can be used to automatically “segment” data, and the resulting segment may or may not match a manual cluster or segment. Cluster and segment can be used interchangeably.
Features can also have data types. For instance, a feature can have an image data type, a numerical data type, a text data type (e.g., a structured text data type or an unstructured (“free”) text data type), a categorical data type, or any other kind of data type. In some cases, the feature values for one or more features corresponding to a set of observations may be organized in a table, in which case those feature(s) may be referred to herein as “tabular features.” Features of the numerical data type and/or categorical data type are often tabular features. In the above example, the feature of a shape from an image of the apartment can be of an image data type. In general, a feature's data type is categorical if the set of values that can be assigned to the feature is finite.
As used herein, “image data” may refer to a sequence of digital images (e.g., video), a set of digital images, a single digital image, and/or one or more portions of any of the foregoing. A digital image may include an organized set of picture elements (“pixels”) stored in a file. Any suitable format and type of digital image file may be used, including but not limited to raster formats (e.g., TIFF, JPEG, GIF, PNG, BMP, etc.), vector formats (e.g., CGM, SVG, etc.), compound formats (e.g., EPS, PDF, PostScript, etc.), and/or stereo formats (e.g., MPO, PNS, JPS).
As used herein, “non-image data” may refer to any type of data other than image data, including but not limited to structured textual data, unstructured textual data, categorical data, and/or numerical data.
As used herein, “natural language data” may refer to speech signals representing natural language, text (e.g., unstructured text) representing natural language, and/or data derived therefrom.
As used herein, “speech data” may refer to speech signals (e.g., audio signals) representing speech, text (e.g., unstructured text) representing speech, and/or data derived therefrom.
As used herein, “auditory data” may refer to audio signals representing sound and/or data derived therefrom.
As used herein “time-series data” may refer to data having the attributes of “time-series data.”
As used herein, “time-series data” may refer to data collected at different points in time. For example, in a time-series data set, each data sample may include the values of one or more variables sampled at a particular time. In some embodiments, the times corresponding to the data samples are stored within the data samples (e.g., as variable values) or stored as metadata associated with the data set. In some embodiments, the data samples within a time-series data set are ordered chronologically. In some embodiments, the time intervals between successive data samples in a chronologically-ordered time-series data set are substantially uniform.
Time-series data may be useful for tracking and inferring changes in the data set over time. In some cases, a time-series data analytics model (or “time-series model”) may be trained and used to predict the values of a target Z at time t and optionally times t+1, . . . , t+i, given observations of Z at times before t and optionally observations of other predictor variables P at times before t. For time-series data analytics problems, the objective is generally to predict future values of the target(s) as a function of prior observations of all features, including the targets themselves.
In certain examples, “seasonality” can refer to variations in time series data that repeat at periodic intervals, such as each week, each month, each quarter, or each year. For example, a time series having a weekly seasonality may exhibit variations that repeat substantially each week, over time.
After a predictive problem is identified, the process of using machine learning to build a predictive model that accurately solves the prediction problem generally includes steps of data collection, data cleaning, feature engineering, model generation, and model deployment. “Automated machine learning” techniques may be used to automate steps of the machine learning process or portions thereof.
As referred to herein, the term “machine learning model” may refer to any suitable model artifact generated by the process of training a machine learning algorithm on a specific training data set. Machine learning models can be used to generate predictions.
As referred to herein, the term “machine learning system” may refer to any environment in which a machine learning model operates. A machine learning system may include various components, pipelines, data sets, other infrastructure, etc.
A machine-learning model can be an unsupervised machine learning model or a supervised machine learning model. Unsupervised and supervised machine learning models differ from one another based on their training datasets and algorithms. Specifically, a training dataset used to train an unsupervised machine learning model generally does not include target values for the individual training samples, while a training dataset used to train a supervised machine learning model generally does include target values for the individual training samples. The value of a target for a training sample may indicate a known classification of the training sample or a known value of an output variable of the training sample. For example, a target for a training sample used to train a supervised computer vision model to detect images containing a cat can be an indication of whether or not the training sample includes an image containing a cat.
Following training, a machine learning model is configured to generate predictions based on a scoring dataset. Targets are generally not known in advance for samples in a test dataset, and therefore a machine learning model generates predictions for the test dataset based on prior training. For example, following training, a computer vision model may be configured to distinguish data samples including images of cats from data samples that do not include images of cats.
As referred to herein, the term “development” with regard to a machine learning model may refer to construction of the machine learning model. Machine learning models may be constructed by computers using training data sets. Thus, “development” of a machine learning model may refer to training of the machine learning model using a training data set. In some cases (generally referred to as “supervised learning”), a training data set used to train a machine learning model can include known outcomes (e.g., labels). In alternative cases (generally referred to as “unsupervised learning”), a training data set does not include known outcomes.
In contrast to development of a machine learning model, as referred to herein, the term “deployment” with regard to a machine learning model may refer to use of a developed machine learning model to generate real-world predictions. A deployed machine learning model may have completed development (e.g., training). A model can be deployed in any system, including the system in which it was developed and/or a third-party system. A deployed machine learning model can make real-world predictions based on a scoring data set. Unlike certain embodiments of a training data set, scoring data set generally does not include known outcomes. Rather, the deployed machine learning model is used to generate predictions of outcomes based on the scoring data set.
As used herein, “data analytics” may refer to the process of analyzing data (e.g., using machine learning models or techniques) to discover information, draw conclusions, and/or support decision-making. Species of data analytics can include descriptive analytics (e.g., processes for describing the information, trends, anomalies, etc. in a data set), diagnostic analytics (e.g., processes for inferring why specific trends, patterns, anomalies, etc. are present in a data set), predictive analytics (e.g., processes for predicting future events or outcomes), and prescriptive analytics (processes for determining or suggesting a course of action).
In general, the subject matter described herein relates to a complete and independent technological solution for machine learning operations (MLOps) that includes a platform-independent environment for the deployment, management, and control of statistical, rule-based, and predictive models. The subject matter includes computer-implemented modules or components for performing data aggregation for data streams, drift identification, drift monitoring, and model management and control. Each computer-implemented module or component can be or include a set of instructions executed by one or more computer processors.
For example, referring to
The model package 102 can be managed and controlled by an MLOps controller 120, which acts as an interface between a prediction environment (e.g., including the model package 102) and an internal or MLOps environment (e.g., including the data aggregation module 106, the drift identification module 108, the drift monitoring module 110, and the model management module 112) for the system 100. The controller 120 can include a monitoring agent 160 and a management agent 162. The monitoring agent 160 can enable monitoring of any model, in any prediction environment, without needing to know a structure of the model, such as model inputs and outputs or a schema for such inputs and outputs. The management agent 162 can enable management of any model in any prediction environment, including initial deployment, model replacement, and execution of prediction jobs.
As described herein, in various examples, the data aggregation module 106 receives a stream of scoring data 122 (e.g., via the controller 120) and aggregates (step 121) the stream of scoring data 122, in real time, to generate a series of histograms (e.g., one histogram per hour) representing the scoring data 122. The histograms can be stored in an aggregated data store and/or can be provided as input to the drift identification module 108, the drift monitoring module 110, the model management module 112, and/or other components of the system 100.
In certain implementations, the drift identification module 108 receives as input (e.g., from the controller 120) the training data 114, the scoring data 122 (or aggregated scoring data 122 from the data aggregation module 106), and/or model predictions 123 and provides as output an indication of (i) a degree to which the scoring data 122 deviates from the training data 114 and/or (ii) a degree to which predictions based on the scoring data 122 (“scoring predictions”) deviate from predictions based on the training data 114 (“training predictions”). The scoring predictions and the training predictions can be included within the model predictions 123, which includes predictions from the model 104. The training data 114 can be aggregated (step 124) and provided to an adaptive drift learner 126, along with the scoring data 122 (e.g., as aggregated by the data aggregation module 106), the training predictions, and/or the scoring predictions. The adaptive drift learner 126 can predict a suitable (e.g., optimal) binning strategy and drift metric to use for one or more features in the training data 114 and/or the scoring data 122. The binning strategy and drift metric can be used to identify drift (step 128) between the training data and the scoring data, and/or between the training predictions and the scoring predictions. A user 130 can accept or reject the determined amounts of drift. Such user feedback can be used to refine the capabilities or accuracy of the adaptive drift learner 126, over time, which can utilize artificial intelligence.
In some examples, the drift monitoring module 110 receives as input (e.g., from the controller 120) the training data 114, the scoring data 122, the model predictions 123 (e.g., including training predictions and/or scoring predictions), and/or ground truth data 132 (alternatively referred to as “actuals”) corresponding to the scoring predictions and generates alerts (e.g., using an alert management component 134) or facilitates other corrective action when feature drift and/or model inaccuracies are detected. Feature drift can be detected using a covariate drift classifier configured to monitor and detect differences between datasets (e.g., the training data and the scoring data), for one or more features. Anomaly detection can be performed and used to flag abnormal model predictions as they occur.
The model management module 112 can be used to refresh models with updated training data and/or to switch between two or more models, for example, in response to alerts received from the drift identification module 108 or the drift monitoring module 110. Refreshing a model (step 136) can involve the use of various data management techniques, for example, to replace old training data with new training data and/or maintain the training data at a reasonable size. Such techniques can be performed by a data management component 137, which can utilize artificial intelligence to determine a suitable (e.g., optimal) data management strategy and/or generate a new or updated set of training data. When model inaccuracies are detected (e.g., by the drift monitoring module 110), an adaptive drift controller 138 can be used to automatically switch (step 140) to a different, challenger model, for example, based on one or more user-defined heuristics, as described herein. Model refreshing and switching can be implemented via the controller 120.
In various examples, the data aggregation module 106 is configured to process a stream of data (e.g., of unknown size or duration) by aggregating (step 121) the data in a collection or series of histograms. The aggregated data can be stored in a data store for subsequent queries and/or can be used to calculate metrics of interest to users of the system 100 (e.g., MLOps service health engineers). Such metrics for a data set can include, for example, minimum, maximum, mean, median, any percentile (e.g., 10th percentile, 90th percentile, quartiles, etc.) and/or counts of values over or under a particular threshold.
Some metrics of interest can be relatively easy to compute without having access to an entire data set or stream. For instance, mean can be computed from sum and count values. Other metrics, such as medians, percentiles, and/or counts over or under thresholds, can be difficult or impossible to compute precisely without accessing or using the entire data set or stream. Advantageously, however, the data aggregation module 106 is able to approximate such metrics through the use of Ben-Ham/Tom-Tov histograms or other histograms (e.g., centroid histograms) that provide an accurate summary or approximation of an entire data set. The data aggregation module 106 is configured to select aggregate values for storage that maximize the number of different metrics that can be computed, while minimizing storage space required for these metrics. In some examples, a Ben-Haim/Tom-Tov (BH-TT) decision tree algorithm can be adapted to efficiently aggregate data from a scoring engine used for machine learning models, at coarse-grained time windows, such as one-hour windows, one-day windows, or one-week windows. In some instances, for example, the data aggregation module 106 utilizes a data structure that is or includes an array of objects, with each object having two properties: centroid and count. The data structure can be used to collect and store data from a stream of data, and the stored data can be used to calculate various metrics (e.g., minimums, maximums, medians, percentiles, and thresholds) related to the data and/or relevant to service health for machine learning models. While the following example utilizes arrays of length 5, it is understood that the array length can larger (e.g., to improve accuracy). For example, the array length can be 5, 10, 15, 20, 50, 100, 200, 500, 1000, or any integer N between or above these values. In one implementation, an array length of 50 works well for most data streams, from an accuracy and computational efficiency standpoint.
A traditional histogram defines how many values are between minimum and maximum bounds for each bin. This can provide a precise and accurate representation of the data, however, all of the data is generally needed to calculate such bounds. A centroid histogram (e.g., a BH-TT histogram), on the other hand, can be an approximation of the traditional histogram. The centroid histogram can define how many values are “near” or “around” each centroid. For example, Table 2 illustrates a centroid histogram having an array length of 5. In this case, there are 16 values near 0.4,23 values near 1.8, 13 values near 2.2, etc. The centroid histogram can be imprecise because it may not tell you absolute bounds of each bin; rather, it can provide an approximation of a distribution of values. In various examples, the centroid histogram can include a centroid vector containing centroid values, as indicated by the “Centroid” row in Table 2, and a count vector containing count values, as indicated by the “Count” row in Table 2. The centroid vector and/or the count vector can have corresponding offsets or indices, as indicated by the “Offset” row in Table 2.
By way of contrast, Table 3 illustrates an example of a corresponding traditional histogram having a length of 5. The traditional histogram is or includes an array of objects, each of which has three properties: minimum boundary, maximum boundary, and count.
The advantage of choosing an approximation-based histogram, such as the centroid histogram, is that it can be calculated or constructed as “you go along” (e.g., as data is received in a data stream) and can be available to query during the data streaming process. This is advantageous over traditional histograms because, when there is a stream of data of unknown size, the traditional histogram cannot be calculated until the stream has finished, if the stream ever does finish.
In some examples, the centroid histogram can be initiated using an initial set of values from the stream of data. For example, if the initial values in the stream of data are 0.2, 3.5, 1.6, 4.9, 2.3, 4.1, and 0.4, the first five of these values can be added to the histogram as shown in Table 4. In general, the number of initial values added to the histogram during this step is equal to a length of the array (e.g., an initial length N), which is 5 in this case.
As Table 4 indicates, the initial values can be stored in order by centroid. When a new initial value is added, the stored values can be rearranged, as needed, to keep the values in numerical order.
To add the next value from the stream (2.3 in this case), the value can be added to the array in order, as shown in Table 5. This can be done by, for example: (i) identifying two adjacent elements in the centroid row or vector having centroid values less than and greater than the next value, (ii) inserting a new element between the two adjacent elements in the centroid row, (iii) inserting a new element between corresponding adjacent elements in the count row, (iv) setting a value of the new centroid element (at offset 2) to be equal to the next value (2.3), and (v) setting a value of the new count element to be equal to one. This results in an array length of 6, which exceeds the initial or maximum length of 5, so the next step is to collapse the array back to a length of 5.
An example of a method for collapsing the array is as follows. First, the two adjacent or neighboring bins or buckets having closest centroid values are identified and merged proportionally. In this example, the two buckets with the closest centroid values are buckets at offsets 3 and 4. The difference between the centroids in these buckets centroids is 0.6, which is less than the difference between any two other adjacent centroids. These buckets can be proportionally merged by summing the counts for the two buckets and computing a weighted average of the centroids for the two buckets as follows:
where Centroid1 and Count1 are the centroid and count values for one of the buckets, and Centroid2and Count2 are the centroid and count values for the other bucket.
After collapsing, the centroid histogram can be as shown in Table 6. The histogram in this example stores an array with 5 objects but includes or encodes information for six values. Adding more values to the histogram can be done the same way, without increasing the length or size of the array.
Advantageously, these centroid histograms can be used to accurately approximate median and percentile values, as well as counts over or under a particular threshold. The techniques for approximating each of these values can similar. An example of computing median, beginning with the centroid histogram from Table 7 (same as Table 2), is as follows.
To begin, the total number of values represented by the histogram is calculated. In this case, the histogram encapsulates 16+23+13+5+8=65 total values. Since there are an equal number of values greater than and less than the median, the goal is to approximate the value having 32 larger values and 32 smaller values. The overall minimum and maximum values of the data stream can be stored (e.g., separately, outside of the histogram) over a specific time period. In one implementation, a time period of one hour is used, which means one histogram can be created per hour, for a total of 24 histograms per day. If the data stream includes data for more than one feature, additional histograms can be generated for each time period. For example, if there are 10 features represented by the data stream, 10 histograms can be generated each hour, or one histogram per hour for each feature.
Next, the centroid histogram is converted into a traditional histogram. This can be accomplished by considering that, by definition, half of the values in each bucket of the centroid histogram are greater than the centroid, and the other half of the values are below the centroid. Performing this conversion yields the intermediate structure shown in Table 8.
Assuming the overall minimum value was 0 and the maximum value was 5 for the time period, the traditional histogram shown in Table 9 can be generated. The count in each interior bin can be computed by summing the count greater than a lower centroid and the count less than an upper centroid. For instance, the count for bin at offset 1 in this example is computed by summing 8(the count greater than the lower centroid) and 11.5 (the count less than the upper centroid). Counts for bins on the ends of the array can be computed from the minimum value, maximum value, and total count.
To obtain the median, a cumulative count can be calculated for each bucket, as shown in Table 10. In this example, the median is somewhere between 1.8 and 2.2, because there are 27.5values less than 1.8 and 45.5 values less than 2.2, and the median is the 33rd value.
The final step in the computation assumes that the actual values in the bucket are evenly spaced. While this is not precise, it is a good enough approximation when enough buckets are used (e.g., 50 or more). Finding the 33rd value is done by computing:
where LB is lower bound, UB is upper bound, CC is cumulative count, PCC is previous cumulative count, and MC is median count. In this case, the median value, which falls within the bucket at offset 2, is given by: Median value=1.8 +(2.2−1.8)/(45.5−27.5)*(33−27.5)=1.92. This final step can include performing a linear interpolation, as shown in Equation (3). Medians and percentiles are one use of these histograms. Counts over or under a particular threshold can also be computed, in a similar manner. In some examples, the histograms, values stored within the buckets of the histograms, and/or metrics calculated using the histograms (e.g., median) can be used by the systems and methods described herein as inputs to one or more machine learning models and/or to calculate or monitor various data characteristics, such as data drift. For example, the histograms, values, and/or metrics can be used by the drift identification module 108 and/or the drift monitoring module 110 to detect data drift, trigger one or more alerts, and/or take other corrective action, as described herein. Additionally or alternatively, the histograms, values, and/or metrics can be used by the model management module 112 to refresh a machine learning model, trigger use of a challenger model, and/or take other corrective action, as described herein.
Referring again to
In general, the drift identification module 108 can be used to compare the scoring data 122 and/or the scoring predictions with any other model input data and/or corresponding model predictions, which may or may not be the training data 114 and the training predictions. The scoring data 122 and the scoring predictions can be referred to herein as “compare data” and “compare predictions,” respectively, and the other model input data and the corresponding model predictions can be referred to herein as “reference data” and “reference predictions,” respectively.
In various implementations, the adaptive drift learner 126 can be a machine learning model created from a series of experiments and a manual assessment of experimental results. A manual set of univariate scenarios can be created to cover different types of drift in numerical, categorical, and/or text features. Tables 11-13 include examples of a few of the experiments created to test bucketing strategies for different two-sample scenarios. Sample 1 in these examples is a feature from training data and Sample 2 is a feature from scoring data. Each scenario is labeled with whether drift should be expected for that test.
For example, “Expected Drift in these tables indicates whether Sample 1 (from training) and Sample 2 (from scoring) are expected to include or flag drift, with “Green” indicating little or no drift, “Amber” or “Yellow” indicating a moderate amount of drift, and “Red” indicating large amounts of drift. If the PSI metric is used, for example, then default color coding can be as follows: “Green” for less than 0.15; “Amber” or “Yellow” for between 0.15 and 0.25; and “Red” for above 0.25. These default values can be used for prototyping or training experiments. “Missing Drift” in tables 11 and 13 refers to an extra test that was added to indicate whether the scoring data (in Sample 2) includes more missing data, compared to the training data (in Sample 1). Missing data generally refers to data (e.g., for a feature) that is not available (NA) and/or is not usable (e.g., because the data is in an improper format). Features in the scoring data that have a significant amount of missing data when compared to the training data may be indicative of a data quality problem. The adaptive drift learner 126 can be trained to detect or capture this kind of drift or data quality problem, over time.
For each scenario in the manually derived experiments, all binning strategies described herein can be applied, histograms can be created, and each metric can be applied (e.g., for each binning strategy and histogram). Labeling of the most appropriate binning strategy and metric for each drift scenario can be carried out manually. For example, a combination of binning strategy and drift metric can be assigned a label according to how well the combination reveals drift in the data. Combinations that reveal drift accurately can be labelled with a high score (e.g., 10), for example, and combinations that reveal drift inaccurately can be labelled with a low score (e.g., 0 or 1). Output of the tests, including the labels, can be used to create a dataset for predicting the best binning strategy and metric combination, for example, based on the nature or characteristics of the training data feature, such as length, distribution, minimum and maximum, mean, skewness, number of unique values, and other feature characteristics. For example, the adaptive drift learner 126 can be trained using the test output to predict a suitable (e.g., optimal) binning strategy and/or drift metric. Once trained, the adaptive drift learner 126 can receive as input one or more characteristics or features for a set of data (e.g., length, distribution, minimum, maximum, mean, skewness, number of unique values, or any combination thereof) and provide as output a recommended binning strategy and/or drift metric. Table 14 lists a set of example data characteristics for the adaptive drift learner 126. Additional characteristics can be added over time, for example, according to data that a user has optionally supplied (e.g., a use case of the data or a textual description of a data characteristic).
With regard to the drift metric, histogram-based metrics such as Population Stability Index (PSI) can be used to assess known populations; however, drift detection can require assessing future or unknown data. When binning the data, PSI can fail if one of the comparison sample bins has a frequency of 0. For purposes of drift detection, when a 0 is encountered in new data, a count of 1can be added to both the new data bin and the corresponding training bin. This can be done for all histogram-based metrics that may require each bin to have a frequency greater than zero. Example pseudocode for calculating PSI with this zero bin correction technique is provided below.
In various examples, a second adjustment can be made to add a bin for tracking missing data. Missing data (NAs) is typically removed from data before statistical calculations are performed. For drift detection, however, it can be important to track such values as “missing,” which can be indicative of either drift or a data quality problem. In some implementations, the counts of the number of missing values for a feature in the training data and the scoring data can be stored, and an extra bin can be appended to the histogram, regardless of the binning strategy employed. If there is less missing data in the scoring data than in the training data for a feature (e.g., the data is of better quality in the scoring data), then missing drift (e.g., an increased amount of missing data) may not be flagged and may not be included in the overall drift metric (e.g., PSI). In general, when labeling test output, decisions on the “most appropriate” automated binning strategy and drift metric can be based on two main parameters or assessments: (1) is the histogram visually informative?; and (2) did the metric correctly flag drift or incorrectly flag drift?
The adaptive drift learner 126 can use a wide variety of drift metrics. For numeric data, for example, the following metrics can be utilized: Population Stability Index, Kullback-Leibler divergence (relative entropy), Hellinger Distance, Modality Drift (e.g., which can identify bins drifting together), Kolmogorov-Smirnov test, and/or Wasserstein distance. For categorical and/or text data, the following metrics can be utilized: Population Stability Index, Kullback-Leibler divergence, Hellinger Distance, and/or Modality Drift. In general, the drift metric can be used to quantify a similarity or difference between a first distribution of data (e.g., scoring data) and a second distribution of data (e.g., training data). When the drift metric indicates that the two distributions are different, such differences can be indicative of drift.
Additionally or alternatively, the adaptive drift learner 126 can run anomaly detection (e.g., using an isolation forest blueprint or other technique) on the training data to quantify a percentage of anomalies in a training data sample. The anomaly detection model can then be used to predict a percentage of anomalies in a scoring data sample. The adaptive drift learner 126 can generate or output an anomaly drift score, based on a comparison of the percentage or quantity of anomalies in the training data sample and the percentage or quantity of anomalies in the scoring data sample. For example, the anomaly drift score can be the percentage of anomalies in the training data sample divided by the percentage of anomalies in the scoring data sample (e.g., for a specific feature or combination of features).
The adaptive drift learner 126 can also use a wide variety of binning strategies. For numeric data, for example, the following binning strategies can be utilized: 10 fixed-width bins, quantiles (e.g., quartiles, deciles, or ventiles), Freedman-Diaconis, and/or Bayesian Blocks. For categorical data, the binning strategy can be or include, for example, any one or more of the following:
For text data, the binning strategy can involve viewing text as a high-cardinality problem. The addition of new words may not be as important as new levels in categorical data, for example, because the way people write can be subjective, cultural, and/or may have spelling mistakes. For drift in text fields, it is generally more important to identify a shift in the entirety of the language, rather than a shift in individual words. For this reason, binning strategies for high cardinality categoricals can be effective for identifying drift at a whole language level. Such binning strategies can be or include, for example:
Alternatively or additionally, the binning strategy for text data can involve giving each frequent word (or phrase) in the training data sample its own bin. The frequency for each bin can be compared directly with the frequency for a corresponding bin for the scoring data sample.
In various examples, the adaptive drift learner 126 can use a revised or adjusted strategy for time series forecasting problems, such as demand forecasting. A distinguishing characteristic of time series forecasting problems is that some drift is inherent and/or expected to occur, for example, due to weekly, monthly, or other seasonal variations in one or more features. Thus, for the adaptive drift learner 126 to identify drift that is unexpected (e.g., due to measurement errors or actual variation), the adaptive drift learner 126 is configured to distinguish between expected drift and unexpected drift. When the unexpected drift becomes large or otherwise unacceptable, the adaptive drift learner 126 can provide warnings indicating that a model is unsuitable for use or may be inaccurate. In various examples, expected drift can be drift that exists in both a training dataset and a scoring dataset. Product offerings, for example, may change over time (e.g., in both the training dataset and the scoring dataset) and such changes may be due to expected drift. On the other hand, when a number of customers decreases for a store in the scoring dataset, but not in the training dataset, the change can be due to unexpected drift and investigation can be carried out to determine the reason(s) for the decrease.
Further, time series forecasting problems can involve segmentation strategies that divide or cluster similar entities (e.g., similar values for a feature or features that exhibit similar variations in time or similar frequency content or seasonality) in the time series into distinct segments (e.g., subgroups) and build models for each segment. The adaptive drift learner 126 and/or model management module 112 can monitor drift on the segments individually and trigger retraining pipelines for each segment. For example, when unexpected drift is large for a segment, the model management module 112 can retrain one or more models associated with the segment. The systems and methods described herein can further explore changes in the segmentation strategies, for example, to contrast finer granularities that may provide more accuracy against coarser granularities that may provide faster predictions or more simplicity. For example, store×SKU (e.g., a product number concatenated to a store identifier) can provide more granularity than just store or SKU individually. Further, new segmentation strategies can be tried, models can be developed for the new segmentation strategies, and the models can be evaluated for performance (e.g., accuracy and/or efficiency). In some examples, recommendations for new segmentation strategies can be sent to users for feedback or approval. Additionally or alternatively, the systems and methods may evaluate alternative means of assigning entities in the time series to segments or clusters, based on signals of drift and performance measured after deployment. For example, features that have similar expected and/or unexpected drift can be combined into a single segment.
Referring again to
The “Minimum Value” column in each of these tables contains the minimum value for each bin or bar on the corresponding histogram, with
Referring again to
As an example, if the user makes predictions with the model 104 every Friday, the drift monitoring module 110 can take each individual feature (or subset of features) in the training data and compare it to a corresponding feature in a new set of scoring data provided on a Friday, so that individual feature data drift can be assessed between two points in time (e.g., between a training data time period and the scoring data time period). If feature drift is identified for a feature on one Friday but then the drift disappears or goes back to normal at the next Friday, then the initial drift can be considered transient drift, for example, due to a national holiday or other event (e.g., Black Friday shopping). If feature drift continues over successive Fridays, however, then a significant change may be happening in the system and further investigation should be carried out. This is when the covariate shift classifier of the drift monitoring module 110 can be triggered to determine if drift is occurring in multiple features for those time periods.
In general, the covariate shift classifier can be used to distinguish between the training data and one or more sets of scoring data, for one or more features in the data. In certain examples, the original training data can be concatenated to the scoring data from specific periods of time where individual feature drift has been identified (e.g., from the drift identification module 108). This can result, for example, in a new dataset having the original training data, which can be labeled “Class 1,” and the scoring data from a time period T, which can be labeled “Class 0.” In various examples, any names or labels can be chosen for the target as long as the training data is allocated to one of the classes and the scoring data is allocated to the other class. The covariate shift classifier may not be used to make predictions on new data but instead may be used as an insight model, for example, to determine if and/or why the training and scoring datasets are different. The scoring data time period T can be a single time period (e.g., one day) or an amalgamation of smaller time periods. For example, if predictions have been made for three days in a row and a feature has drifted each day, the time period T for the covariate shift classifier can be three days. Next, the new dataset can be provided as input to the covariate shift classifier, which can classify the data as belonging to either the original training data or the new scoring data. If the datasets are similar and no systemic data drift has occurred, then the classifier may “fail” at discerning between the training data and the scoring data. If there is a substantial shift in the data (e.g., a score of about 0.80 AUC or area under the curve), however, the classifier can easily distinguish between the training data and the scoring data.
The covariate shift classifier can be run like other binary classification models and, in some instances, insights into multivariate data drift can be derived from feature importance or impact. For example, with this type of model, more important features can be the cause of drift between the training data and the scoring data, while less important features can be stable and/or have no drift between the training data and the scoring data. For example,
Referring again to
For example,
Advantageously, for time series models, the systems and methods described herein can be configured to automatically connect predictions and ground truth results, to ensure model accuracy can be monitored and unexpected drift can be identified. In some examples, the systems and methods can determine an association identifier (association ID) that is used to join predictions with correct actuals. The systems and methods can capture ground truth from time series forecasting requests, compute accuracy metrics, issue alerts (e.g., when model accuracy is poor or unexpected drift is detected), and replay data with one or more challenger models (e.g., to determine if a different model may be more accurate). In certain implementations, a user interface is provided that allows users to enable automatic actuals or ground truth feedback for time series models. The user interface can enable users to implement automatic tracking of attributes for segmented analysis of training data and predictions.
Referring to
The example in
When a forecasting request is observed by the system (e.g., in response to a user request), tuples (e.g., timestamp, forecasted_value) can be saved in a database system, for future reconciliation. When a subsequent request occurs, actual values for past predictions may be available as historical values, and corresponding tuples (e.g., timestamp, actual_value) can be extracted. Previously collected tuples for predictions (e.g., timestamp, forecasted_value) can be joined with tuples for actual values (e.g., timestamp, target) using timestamp (or other association ID) as a key. Such data can be used to compute prediction accuracy metrics, such as, for example, root mean square error (RMSE), mean absolute error (MAE), R2, etc.
Referring to
As predictions are made and actual values are received, the predictions and actual values can be stored in a database and/or analyzed to determine model accuracy. For example, referring to
For some use cases, ground truth data (e.g., an actual answer) for a prediction may be known soon after the prediction has been made, or may not be known until several hours, days, weeks, or months later. For example, whether a user will click on a link during a visit to a website can be determined quickly. Alternatively, whether or not a driver will be involved in a car accident under an insurance policy may not be known until the policy is terminated. Advantageously, the systems and methods described herein can allow users to upload ground truth data to the scoring data, so model accuracy can be tracked over time. For example,
Referring again to
Referring again to
In various implementations, a user of the system 100 can set up multiple models to serve as challenger models for the model 104, so that the user can switch from the model 104 to an alternative, challenger model at any time. Such models can be or include, for example, BESPOKE weather models for sports or sales models for holiday events. For example,
Various strategies may be available for the user when configuring challenger models, for example, to provide flexibility for the model risk management (MRM) standards of the user's organization. One such strategy is referred to as “shadowing” and can involve pairing a primary model that serves all predictions with one or more secondary monitored models that receive or serve the same predictions for validation/comparison. Another strategy is referred to as “A/B/n testing” and can involve testing the primary model and one or more secondary models by weighting prediction traffic to the primary model and the one or more secondary models (e.g., some predictions are assigned to the primary model and other predictions are assigned to secondary models). Another strategy is referred to as “tiered promotion” and can involve facilitating model validation in several lower tiered environments (e.g., development, staging/UAT) before models are promoted to production deployment.
Referring again to
In various examples, when the accuracy of the model 104 has been flagged as degrading and there is a sufficient quantity of new ground truth data 132 available, then a new set of training data may be constructed by performing append, reduce, and/or replace operations on the training data. These operations can be performed using the data management component 137, which can choose a suitable (e.g., optimal) data operation based on one or more data characteristics (e.g., a size of the training data and/or the scoring data, an amount of drift in the training data and/or the scoring data, and/or a percentage of anomalies in the training data and/or the scoring data). For example, the data management component 137 can receive the data characteristics as input and provide as output a selected (e.g., optimal) data operation. Alternatively or additionally, the data management component 137 can implement or perform the selected (e.g., optimal) data operation automatically, based on the data characteristics. In some implementations, a user can specify the data operations that will be performed or can define a customized set of retraining requirements. Additionally or alternatively, the user can adjust or customize the data management component 137 to choose data operations preferred by the user.
In some instances, for example, new scoring data can be appended to the original training data to make a new training data set. The append operation may be preferable (and chosen by the data management component 137) when the original dataset is less than a threshold size (e.g., 50,000 rows, where one row can represent an observation or record). There may be a trade-off between dataset size and time or computational power required when using the append operation, given that appending scoring data each time can end up with a very large dataset.
Additionally or alternatively, the reduce operation can be performed to reduce a size of the original training data while retaining the new scoring data. Reducing the original training data can be performed, for example, by selecting and removing a random sample of fixed length from the training data. For example, 20,000 rows, 20% of the rows, a user-defined number of rows, or some other portion of the training data can be randomly selected and removed, and all other training data can be retained. Additionally or alternatively, reducing the original training data set can involve removing all rows that are older than a specified age. For example, all rows corresponding to training data older than 3 months, 6 months, one year, a user-specified age, or other age can be selected and removed from the training data, and all other training data can be retained. Additionally or alternatively, an anomaly detection model (e.g., built on new scoring data) can be used to make anomaly predictions on the original set of training data. The most anomalous rows of the training data can then be identified and removed. In some instances, for example, the quantity of anomalous rows removed can be specified by the user and/or can be 10%, 20%, 50%, or some other portion of the training data. Non-anomalous training rows can then be appended to new scoring data to make the new set of training data.
The model management module 112 can implement an approval policy framework to ensure that model deployment and/or replacement (e.g., driven by challenger models) is accomplished in a controlled and auditable manner. Referring to
Referring again to
In general, a “prediction environment” can be or include a computing environment in which a model is deployed and/or used to make predictions. The prediction environment can be or include, for example, a computing platform (e.g., a web-based or online platform hosted by a third party, such as a company, corporation, or other entity that does not provide or host the MLOps environment) that performs operations associated with deploying, running, or executing a predictive model (e.g., model 104). Such operations can include, for example, providing the model with input data (e.g., scoring data), using the model to make predictions (e.g., the predictions 123), and providing the predictions as output from the model.
In general, the monitoring agent 160 can allow users to monitor features, prediction results, and prediction accuracy for models running in any prediction environment in near-real time, and the monitoring can be performed without knowledge of the model structure (e.g., schema for model inputs and outputs). Referring to
In a typical example, the monitoring agent 160 can receive model predictions, model features, model performance data, and other model data from a prediction environment. The model data can be ingested and/or processed using the MLOps library 1706 (and associated APIs) and provided to the message buffer 1704. The message buffer 1704 can forward the processed model data to the monitoring agent service 1702 in real time, upon request, or at desired intervals. The monitoring agent service 1702 can aggregate the processed model data, as desired, and forward the processed model data to the MLOps components 1708, which can take action based on the data and/or can display the data for users.
The management agent 162 can provide users with automated and standardized management of models and model prediction environments. The automation can encompass a full model deployment lifecycle and can include capabilities for provisioning and maintaining an associated infrastructure responsible for serving a model (e.g., in a prediction environment). The management agent 162 can accomplish these tasks by translating user actions in other system components and applying the actions to both individual model deployments and related software infrastructure. Actions supported by the management agent 162 can include actions in modeling environments (e.g., where models are developed and trained) and prediction environments (e.g., where models are deployed and run). Such actions can include, for example: deploying models; stopping models; deleting models; replacing models; determining model health status (e.g., model accuracy); executing prediction jobs; determining prediction job status (e.g., job progress or time remaining for a job); determining prediction environment health status (e.g., identifying issues with data drift or prediction drift); starting a prediction environment; and stopping a prediction environment. The management agent 162 can respect but be decoupled from upstream replacement and approval policies implemented by the model management module 112. For example, the management agent 162 may take action only after approvals have been received in accordance with an organization's approval policy.
In various examples, the management agent 162 supports a plugin architecture that decouples a management framework from a mechanism that applies user actions in the prediction environment. This can provide flexibility of usage in any prediction environment, such as, for example, KUBERNETES, DOCKER, AWS LAMBDA, etc. The management agent 162 can utilize a stateless design and reconciliation methodology, which can enable fault tolerance while providing eventual consistency. With the stateless design and reconciliation methodology, for example, the management agent 162 itself may not store a state of either a deployment in an MLOps application environment or a deployment in the prediction environment. When the management agent 162 starts or recovers from an outage, the management agent 162 can inspect both environments and reconcile any changes that should be applied and/or may have occurred during the outage.
In an example involving model deployment, the model/environment event 1806 can require a model to be retrieved from one or more storage locations 1810, which can utilize or include storage available in the MLOps application 1804, remote storage, a cloud storage service, or a third party storage service or repository, such as, for example, AMAZON S3, GITHUB, or ARTIFACTORY. To enable communications between the management agent core service 1808 and a variety of storage locations 1810, the management agent 162 includes or utilizes one or more model repository plugins 1812. The plugins 1812 can provide flexibility by allowing the management agent core service 1808 to communicate and exchange data with the various storage locations 1810, which can each utilize or include a unique communication protocol and/or data or storage schema. Each of the plugins 1812 can be associated with a respective storage location 1810. The plugins 1812 can be used to retrieve a model 1814 and provide the model 1814 to the management agent core service 1808.
To take an action with respect to a model (e.g., the model 1814), the management agent 162 can include or utilize one or more prediction environment plugins 1816. The plugins 1816 can provide flexibility by allowing the management agent core service 1808 to communicate and exchange data with various prediction environments 1818. In some examples, the prediction environments 1818 can be or include one or more computing platforms (e.g., hosted by third parties) that perform operations associated with deploying, running, or executing predictive models. Examples of such computing platforms can include KUBERNETES (EKS), KUBERNETES (GKE), AWS LAMBDA, and DOCKER. Each of the plugins 1816 can be associated with a respective prediction environment 1818. In the depicted example, the plugins 1816 receive an event 1820 from the management agent core service 1808, which can generate the event 1820 in response to the model/environment event 1806. When the model/environment event 1806 includes a request to deploy a model, for example, the event 1820 can include or correspond to a model deployment request. One of the plugins 1816 can the forward the event 1820 to a respective prediction environment 1818, which can take an action 1822 in response to the event 1820. The action 1822 can be or include, for example, launching a model deployment, replacing a model with a different model, checking the status of the model, running a prediction job, or any other action performed in the prediction environment 1818 with respect to the model.
In various examples, the systems and methods described herein can be used to achieve centralized deployment, management, and/or control of an organization's statistical, rule-based, and predictive models, independent of the underlying modeling platform. The systems and methods can use a set of interrelated components (e.g., the data aggregation module 106, the drift identification module 108, the drift monitoring module 110, the model management module 112, and any portions thereof) that can be mixed and matched, depending on business requirements. For example, a company that updates models frequently may be more interested in model management than in data management, whereas a company whose models are regulated by external governance may be focused on data management and/or data drift identification. The modular nature of the systems and methods enables plug-and-play capabilities to support diverse business challenges associated with models. Techniques for real-time web analytics can be adapted to provide efficient metrics for monitoring model accuracy, model health (e.g., number of scoring rows being rejected), and data drift (changes in the data over time).
It is estimated that 61% of businesses implemented artificial intelligence (AI) in 2017, and 71% of executives surveyed said their company has an innovation strategy to push investments in new technologies, such as automated machine learning (AutoML). For late adopters of AI and AutoML, there are several technological options to investigate, but for early adopters there may be an innovation gap. Such companies can have predictive analytic models integrated into their current systems and may have teams of data scientists available, but the companies may want to isolate the deployment, management, and performance monitoring of their models. Companies that were early adopters may now understand issues involved with taking a machine learning model and translating the model's value into terms of dollars and/or customer metrics, such as booking cancellations.
Post-modeling can be considered part of operations rather than a responsibility of data scientists, which can free up the data scientists to focus on developing new models and projects. This split may be somewhat analogous to a difference between software development and IT operations, where software engineers are freed from the responsibility of system maintenance. Data science platforms have also recognized the difficulty in deploying machine learning models to production, as well as identifying the distinction between a data scientist and an operations software engineer.
In addition to a post-modeling innovation gap, there may also be a problem of infrastructure. For example, as differing parts of an organization adopted AI at different speeds, models were implemented using chosen tools of the data scientists or implemented in legacy software, such as SAS, because of licensing restrictions. Thus, centralizing post-modeling and making predictive analytics a part of everyday business operations requires a technological solution that can seamlessly integrate multiple models from disparate platforms and from multiple business divisions. Advantageously, the systems and methods described herein can provide this technological solution.
A machine learning model should be seen as any other organizational asset. The model can have a distinct product lifecycle and/or can degrade over time in response to environmental factors, such as economic conditions, competitors, and/or changes in customer behavior. A key aspect of model lifecycle management can be to monitor and manage both the machine learning model and the data the model uses to make predictions. The systems and methods described herein provide a technological solution capable of identifying any changes (drift) in the data, evaluating the impact this drift may have on the performance of the model, and taking appropriate action by adapting the model to this new environment. Data drift can erode data fidelity, operational reliability, and ultimately productivity, and it can increase costs and lead to poor decision-making by data scientists.
There are several business problems and challenges that the systems and methods described herein are able to solve. The diversity of these challenges can illustrate the innovation gap in both post-modeling operations and in technological solutions available to businesses. In one example involving information technology and operations, a large, multinational company may want to centralize its machine learning operations, including centralized cloud management and control. The company may need a technological innovation capable of deploying models, along with the company's containerized runtime environment, in a seamless way that allows data scientists to use tools of their choice while sharing the same underlying infrastructure that allows deployment of models at scale. Advantageously, the systems and methods described herein can be used by the company to provide automated monitoring of the performance of machine learning models from both a cloud usage and data science perspective. The company can have business models that may generate billions of predictions every day, resulting in a massive volume of data. The systems and methods can accurately record statistics about all of these predictions in a format that is both efficient to store and fast to query. Additionally or alternatively, the company may have an internal predictive model that predicts an amount of memory a job will take before being allocated cloud resources, such as containers. The actual memory used by the job may be available when the job has been completed. With the centralization of machine learning operations, the systems and methods can achieve a more diverse set of users, use cases, datasets, and models over time. The systems and methods can automatically adapt and refresh the company's job resource model in response to changing environments, without the need of a data scientist. Such information technology and operations use cases may be focused on or receive significant benefit from the MLOps controller 120, which can act as an interface between models, users, and the cloud.
In another example, related to sports and gaming, an online sports data company may have a technological need for predictive models that are integrated into the company's real-time sports data streaming and/or fantasy sports picks. The systems and methods described herein can provide an IT operations solution where multiple models from multiple sources can be deployed together and a post-modeling solution where the models can be updated and retrained when data drift has occurred. The systems and methods can provide a short-forecast solution that can make real-time in-play predictions from streaming data and adapt the model in-play, as needed. Additionally or alternatively, the systems and methods can include a long-forecast solution (e.g., for tournaments and leagues) when an automatic model refresh may be triggered after data drift has been identified. The systems and methods can run the short-forecast and long-forecast models in parallel (e.g., as champion models and challengers) and can predict on real-time streaming data. The systems and methods can allow the company to seamlessly switch between the models during sporting events, for example, when using BESPOKE models fine-tuned to weather conditions for each sporting event.
In general, the short-term model which refreshes regularly may rely on the data aggregation module 106 and/or the model management module 112. The long-term model may rely on the data aggregation module 106, the drift identification module 108, the drift monitoring module 110, and/or the model management module 112. The company's ability to switch models can be achieved using the model management module 112 at deployment time, with multiple models being run in parallel.
In another example, involving finance and banking, a financial institution may have several machine learning models in production. The models may range from low-risk, unregulated models, such as marketing models, to high-risk, regulated models that contain personal financial information and are heavily regulated by external governance bodies. In such instances, any changes in data may need to be identified early to ensure the model adheres to strict constraints. The systems and methods described herein can provide the institution with both (i) a deployed model alert system that notifies a risk analysts of any fluctuations in scoring data and (ii) an A/B testing capability where the institution can run an old model and a replacement model together for a specified period of time. The financial institution may utilize the data aggregation module 106, the drift identification module 108, the drift monitoring module 110, and/or the model management module 112 to achieve such capabilities.
In another example, a leading manufacturer of farming equipment may have several suppliers of parts that make up the manufacturer's machinery, and each part may need its own warranty related to an overall parent product warranty. In such a case, the manufacturer may have been having problems with data quality where some suppliers used the wrong measurement units (imperial not metric) and others failed to supply all relevant information needed to predict an overall product warranty cost. Advantageously, the systems and methods described herein can be used by the manufacturer to identify parts associated with data quality issues and reject or revise such data before it reaches the warranty model. For example, the manufacturer can utilize the data aggregation module 106 and the drift identification module 108 to identify any missing or incorrect data and reject corresponding rows or observations.
In some implementations, use of the data aggregation module 106 can avoid catastrophic system failures caused by processing or storing data being delivered in a data stream, for example, at a rate of a million predictions per hour (or more) and continuing over long periods of time (e.g., one day, one week, one month, one year, or more). Advantageously, the systems and methods described herein can provide an innovative solution to achieve an efficient computation of metrics on a stream of numeric data of unknown size. Organizations that monitor model performance and data drift in real-time applications can have a need for such a capability.
In some examples, some or all of the processing described above can be carried out on a personal computing device, on one or more centralized computing devices, or via cloud-based processing by one or more servers. Some types of processing can occur on one device and other types of processing can occur on another device. Some or all of the data described above can be stored on a personal computing device, in data storage hosted on one or more centralized computing devices, and/or via cloud-based storage. Some data can be stored in one location and other data can be stored in another location. In some examples, quantum computing can be used and/or functional programming languages can be used. Electrical memory, such as flash-based memory, can be used.
The memory 2020 stores information within the system 2000. In some implementations, the memory 2020 is a non-transitory computer-readable medium. In some implementations, the memory 2020 is a volatile memory unit. In some implementations, the memory 2020 is a non-volatile memory unit.
The storage device 2030 is capable of providing mass storage for the system 2000. In some implementations, the storage device 2030 is a non-transitory computer-readable medium. In various different implementations, the storage device 2030 may include, for example, a hard disk device, an optical disk device, a solid-state drive, a flash drive, or some other large capacity storage device.
For example, the storage device may store long-term data (e.g., database data, file system data, etc.). The input/output device 2040 provides input/output operations for the system 2000. In some implementations, the input/output device 2040 may include one or more network interface devices, e.g., an Ethernet card, a serial communication device, e.g., an RS-232 port, and/or a wireless interface device, e.g., an 802.11 card, a 3G wireless modem, or a 4G wireless modem. In some implementations, the input/output device may include driver devices configured to receive input data and send output data to other input/output devices, e.g., keyboard, printer and display devices 2060. In some examples, mobile computing devices, mobile communication devices, and other devices may be used.
In some implementations, at least a portion of the approaches described above may be realized by instructions that upon execution cause one or more processing devices to carry out the processes and functions described above. Such instructions may include, for example, interpreted instructions such as script instructions, or executable code, or other instructions stored in a non-transitory computer readable medium. The storage device 2030 may be implemented in a distributed way over a network, such as a server farm or a set of widely distributed servers, or may be implemented in a single computing device.
Although an example processing system has been described in
The term “system” may encompass all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. A processing system may include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). A processing system may include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
Computers suitable for the execution of a computer program can include, by way of example, general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. A computer generally includes a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), to name just a few.
Computer readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's user device in response to requests received from the web browser.
Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous. Other steps or stages may be provided, or steps or stages may be eliminated, from the described processes. Accordingly, other implementations are within the scope of the following claims.
The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.
The term “approximately”, the phrase “approximately equal to”, and other similar phrases, as used in the specification and the claims (e.g., “X has a value of approximately Y” or “X is approximately equal to Y”), should be understood to mean that one value (X) is within a predetermined range of another value (Y). The predetermined range may be plus or minus 20%, 10%, 5%, 3%, 1%, 0.1%, or less than 0.1%, unless otherwise indicated.
The indefinite articles “a” and “an,” as used in the specification and in the claims, unless clearly indicated to the contrary, should be understood to mean “at least one.” The phrase “and/or,” as used in the specification and in the claims, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements may optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.
As used in the specification and in the claims, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of” or “exactly one of,” or, when used in the claims, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used shall only be interpreted as indicating exclusive alternatives (i.e. “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the claims, shall have its ordinary meaning as used in the field of patent law.
As used in the specification and in the claims, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements may optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.
The use of “including,” “comprising,” “having,” “containing,” “involving,” and variations thereof, is meant to encompass the items listed thereafter and additional items.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed. Ordinal terms are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term), to distinguish the claim elements.
This application claims priority to and benefit of U.S. Provisional Application No. 63/037,894, titled “Systems and Methods for Managing Machine Learning Models” and filed under Attorney Docket No. DRB-016PR on Jun. 11, 2020, the entire disclosure of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63037894 | Jun 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17344252 | Jun 2021 | US |
Child | 18582380 | US |