An increasing number of technology areas are becoming driven by data and the analysis of such data to develop insights. One way to do this is with data science models (e.g., machine-learning models) that may be created based on historical data and then applied to new data to derive insights such as predictions of future outcomes.
In many cases, the use of a given data science model is accompanied by a desire to explain an output of the model, such that an appropriate action might be taken in view of the insight provided. However, many data science models are extremely complex and the manner by which they derive insights can be difficult to analyze. For example, it may not be apparent how the output of a data science model for a particular input data record was influenced by any given feature that the data science model uses as input. Therefore, it can be difficult to interpret which features had the greatest effect on the output generated by the model.
Disclosed herein is a new technique for rapidly, efficiently, and accurately quantifying the influence of specific features (e.g., determining contribution values) on the output of a trained data science model. In one aspect, the disclosed technology may take the form of a method to be carried out by a computing platform that involves (i) receiving a request to compute a score for an input data record, the input data record comprising a group of actual parameters that map to a set of features that a trained data science model is configured to receive as input; (ii) partitioning the set of features into a plurality of global feature groups based on dependencies between the features; (iii) inputting the group of actual parameters into the trained data science model, wherein the trained data science model comprises an ensemble of decision trees, and wherein: (a) each individual decision tree in the ensemble is symmetric, (b) each individual decision tree in the ensemble is configured to receive a respective subset of the features as input, and (c) within each individual decision tree, internal nodes that are positioned in a same level designate a same splitting criterion based on a same feature selected from the respective subset of features; (iv) for each individual decision tree in the ensemble: (a) assigning each individual feature in the subset of features for the individual decision tree to a corresponding local feature group that is a subset of a given global feature group that includes the individual feature, (b) identifying a respective leaf such that the actual parameters satisfy a series of splitting conditions for edges that connect nodes in a respective path from a root of the individual decision tree to the respective leaf, and (c) based on the corresponding local feature groups, determining a set of respective individual contribution values for the respective leaf, wherein each of the respective individual contribution values maps to a respective feature found in the respective subset of features of the individual decision tree; (v) for each individual feature in the set of features, computing a respective overall contribution value based on: (a) the respective global feature group, and (b) a sum of the respective individual contribution values that map to that individual feature; and (vi) compute, via the trained data science model, the score for the input data record.
In some examples, the method carried out by the computing platform further involves: for each individual decision tree in the ensemble, generating a first matrix of weights, wherein each first weight in the first matrix corresponds to a respective subset pair comprising a first subset of the local feature groups and a second subset of the local feature groups.
Further, in some examples, the method carried out by the computing platform involves: for each individual decision tree in the ensemble: generating a second matrix of weights and a third matrix of weights, wherein, for each of a plurality of mixed pairs comprising (i) a respective one of the local feature groups and (ii) a respective subset pair, there is a second corresponding weight in the second matrix and a third corresponding weight in the third matrix.
Still further, in some examples, the method carried out by the computing platform involves: (i) identifying at least one reason code for the score based on the respective overall contribution values for the individual features in the set of features; and (ii) transmitting the score and the at least one reason code in response to the request.
Still further, in some examples, the method carried out by the computing platform involves: prior to receiving the request, training the trained data science model against training data that comprises a plurality of training data records.
Still further, in some examples, determining the set of respective individual contribution values for the respective leaf comprises: (i) identifying each realizable path from the root of the individual decision tree to each realizable leaf in the individual decision tree, respectively; (ii) for each identified realizable path, computing a respective first probability by dividing a number of the training data records that were scored during the training based on the identified realizable path by a total number of training data records in the training data; (iii) for each identified realizable path, identifying a respective score to be assigned to input data records scored by the identified realizable path; (iv) for each level of the individual decision tree, identifying the same feature on which the same splitting criterion specified by the internal nodes at that level is based; (v) identifying subsets of the respective subset of features that the individual decision tree is configured to receive as input; (v) for each identified subset of the respective subset of features, identifying a respective group of realizable paths such that, for each level of the individual decision tree in which the same splitting criterion for that level is based on a feature included in the identified subset, the respective path and the realizable paths in the respective group have a same path direction from that level to a next level of the individual decision tree; (vi) for each identified subset of the respective subset of features, computing a sum of the respective first probabilities for each realizable path in the identified subset; and (vii) for each identified subset of the respective subset of features, computing a marginal path expectation by multiplying the respective score for the respective path by the sum for the identified subset.
Still further, in some examples, the method carried out by the computing platform involves: for each individual decision tree in the ensemble: generating, based on the identified groups of realizable paths for the respective path and based on the computed marginal path expectations, a vector of sums of marginal path expectations.
Still further, in some examples, identifying each realizable path from the root of the individual decision tree to each realizable leaf in the individual decision tree, respectively, comprises: (i) identifying a selected path to be evaluated for realizability; (ii) detecting that a first splitting condition for a first edge in the selected path and a second splitting condition for a second edge in the path contradict each other; and (iii) excluding the selected path from a list of realizable paths.
Still further, in some examples, determining the set of respective individual contribution values for the respective leaf comprises: (i) receiving an identifier of a leaf selected from a decision tree in the ensemble; and (ii) based on the identifier of the leaf, determining a set of contribution values to which the identifier maps in a data structure, wherein the determined set of contribution values to which the identifier maps in the data structure is the set of respective individual contribution values.
Still further, in some examples, the method carried out by the computing platform involves: prior to receiving the request, generating a respective set of contribution values for each leaf in the ensemble of decision trees and populate the data structure with entries that map the leaves in the ensemble of decision trees to the respective sets of contribution values, wherein generating a respective set of contribution values comprises: (i) identifying each realizable path from the root of the individual decision tree to each realizable leaf in the individual decision tree, respectively; (ii) for each identified realizable path, computing a respective first probability by dividing a number of the training data records that were scored during the training based on the identified realizable path by a total number of training data records in the training data; (iii) for each identified realizable path, identifying a respective score to be assigned to input data records scored by the identified realizable path; (iv) for each level of the individual decision tree, identifying the same feature on which the same splitting criterion specified by the internal nodes at that level is based; (v) identifying subsets of the respective subset of features that the individual decision tree is configured to receive as input; (vi) for each identified subset of the respective subset of features, identifying a respective group of realizable paths such that, for each level of the individual decision tree in which the same splitting criterion for that level is based on a feature included in the identified subset, the respective path and the realizable paths in the respective group have a same path direction from that level to a next level of the individual decision tree; (vii) for each identified subset of the respective subset of features, computing a sum of the respective first probabilities for each realizable path in the identified subset; and (viii) for each identified subset of the respective subset of features, computing a marginal path expectation by multiplying the respective score for the respective path by the sum for the identified subset.
Still further, in some examples, the at least one reason code comprises a model reason code (MRC) or an adverse action reason code (AARC).
In yet another aspect, disclosed herein is a computing platform that includes a network interface for communicating over at least one data network, at least one processor, at least one non-transitory computer-readable medium, and program instructions stored on the at least one non-transitory computer-readable medium that are executable by the at least one processor to cause the computing platform to carry out the functions disclosed herein, including but not limited to the functions of the foregoing method.
In still another aspect, disclosed herein is a non-transitory computer-readable medium provisioned with program instructions that, when executed by at least one processor, cause a computing platform to carry out the functions disclosed herein, including but not limited to the functions of the foregoing method.
One of ordinary skill in the art will appreciate these as well as numerous other aspects in reading the following disclosure.
Entities in various industries have begun to utilize data science models to derive insights that may enable those entities, and the goods and/or services they provide, to operate more effectively and/or efficiently. The types of insights that may be derived in this regard may take numerous different forms, depending on the entity utilizing the data science model and the type of insight that is desired. As one example, an entity may utilize a data science model to predict the likelihood that an industrial asset will fail within a given time horizon based on operational data for the industrial asset (e.g., sensor data, actuator data, etc.). As another example, data science models may be used in a medical context to predict the likelihood of a disease or other medical condition for an individual, and/or the result of a medical treatment for the individual.
As yet another example, many entities (e.g., companies or corporations) have begun to utilize data science models to help make certain operational decisions with respect to prospective or existing customers of those entities. For instance, as one possibility, an entity may utilize a data science model to help make decisions regarding whether to extend a service provided by that entity to a particular individual. One example may be an entity that provides services such as loans, credit card accounts, bank accounts, or the like, which may utilize a data science model to help make decisions regarding whether to extend one of these services to a particular individual (e.g., by estimating a risk level for the individual and using the estimated risk level as a basis for deciding whether to approve or deny an application submitted by the individual). As another possibility, an entity may utilize a data science model to help make decisions regarding whether to target a particular individual when engaging in marketing of a good and/or service that is provided by the entity (e.g., by estimating a similarity of the individual to other individuals who previously purchased the good and/or service). As yet another possibility, an entity may utilize a data science model to help make decisions regarding what terms to offer a particular individual for a service provided by the entity, such as what interest rate level to offer a particular individual for a new loan or a new credit card account. Many other examples are possible as well.
One illustrative example of a computing environment 100 in which an example data science model such as this may be utilized is shown in
For instance, as shown in
Further, as shown in
Further yet, as shown in
Still further, as shown in
Referring again to
For instance, as one possibility, the data output subsystem 102e may be configured to output certain data to client devices that are running software applications for accessing and interacting with the example computing platform 102, such as the two representative client devices 106a and 106b shown in
In order to facilitate this functionality for outputting data to the consumer systems 106, the data output subsystem 102e may comprise one or more Application Programming Interface (APIs) that can be used to interact with and output certain data to the consumer systems 106 over a data network, and perhaps also an application service subsystem that is configured to drive the software applications running on the client devices 106a-c, among other possibilities.
The data output subsystem 102e may be configured to output data to other types of consumer systems 106 as well.
Referring once more to
The example computing platform 102 may comprise various other functional subsystems and take various other forms as well.
In practice, the example computing platform 102 may generally comprise some set of physical computing resources (e.g., processors, data storage, communication interfaces, etc.) that are utilized to implement the functional subsystems discussed herein. This set of physical computing resources may take any of various forms. As one possibility, the computing platform 102 may comprise cloud computing resources that are supplied by a third-party provider of “on demand” cloud computing resources, such as Amazon Web Services (AWS), Amazon Lambda, Google Cloud Platform (GCP), Microsoft Azure, or the like. As another possibility, the example computing platform 102 may comprise “on-premises” computing resources of the entity that operates the example computing platform 102 (e.g., entity-owned servers). As yet another possibility, the example computing platform 102 may comprise a combination of cloud computing resources and on-premises computing resources. Other implementations of the example computing platform 102 are possible as well.
Further, in practice, the functional subsystems of the example computing platform 102 may be implemented using any of various software architecture styles, examples of which may include a microservices architecture, a service-oriented architecture, and/or a serverless architecture, among other possibilities, as well as any of various deployment patterns, examples of which may include a container-based deployment pattern, a virtual-machine-based deployment pattern, and/or a Lambda-function-based deployment pattern, among other possibilities.
It should be understood that computing environment 100 is one example of a computing environment in which a data science model may be utilized, and that numerous other examples of computing environments are possible as well.
Most data science models today comprise a trained model object (sometimes called a trained “regressor”) that is configured to (i) receive input data (e.g., actual parameters) for some set of input variables (e.g., formal parameters), (ii) evaluate the input data, and (iii) based on the evaluation, output a “score” (e.g., a likelihood value). For at least some data science models, the score is then used by the data science model to make a classification decision, typically by comparing the score to a specified score threshold (if the score is quantitative as opposed to categorical), depending on the application of the data science model in question.
These types of trained model objects are generally created by training a machine-learning process to a training dataset that is relevant to the particular type of classification decision to be rendered by the data science model (e.g., a set of historical data records that are each labeled with an indicator of a classification decision based on the historical data record, wherein each training instance in the training dataset includes a label for an individual historical data record and the actual parameters specified in that individual historical data record). In this respect, the machine learning process may comprise any of various machine learning techniques, examples of which may include regression techniques, decision-tree techniques, support vector machine (SVM) techniques, Bayesian techniques, ensemble techniques, gradient descent techniques (e.g., including gradient boosting), and/or neural network techniques, among various other possibilities.
The type of classification decision that is made by the data science model 208 shown in
As shown in
In some implementations, the data science model 208 may initially receive source data (e.g., from one or more of the data sources 104 shown in
Once the input data record 212 including the actual parameters (x1, x2, . . . , xn) is received by the trained model object 204 as input, the trained model object 204 may evaluate the input data record 212 based on the actual parameters. Based on the evaluation, the trained model object 204 may determine and output a score 214 that represents a likelihood that the given individual will fulfill one or more requirements associated with the service. For example, the output score 214 may represent a likelihood (e.g., a value between 0 and 1) that the given individual will default on a loan if the loan is extended to the given individual. As further shown in
There are various advantages to using a data science model comprising a trained model object (e.g., a machine-learning model) over other forms of data analytics that may be available. As compared to human analysis, data science models can drastically reduce the time it takes to make decisions. In addition, data science models can evaluate much larger datasets (e.g., with far more parameters) while simultaneously expanding the scope and depth of the information that can be practically evaluated when making decisions, which leads to better-informed decisions. Another advantage of data science models over human analysis is the ability of data science models to reach decisions in a more objective, reliable, and repeatable way, which may include avoiding any bias that could otherwise be introduced (whether intentionally or subconsciously) by humans that are involved in the decision-making process, among other possibilities.
Data science models may also provide certain advantages over alternate forms of machine-implemented data analytics like rule-based models (e.g., models based on user-defined rules). For instance, unlike most rule-based models, data science models are created through a data-driven process that involves analyzing and learning from the historical data, and as a result, data science models are capable of deriving certain types of insights from data that are simply not possible with rule-based models—including insights that are based on data-driven predictions of outcomes, behaviors, trends, or the like, as well as other insights that could not be revealed without a deep understanding of complex interrelationships between multiple different data variables. Further, unlike most rule-based models, data science models are capable of being updated and improved over time through a data-driven process that re-evaluates model performance based on newly available data and then adjusts the data science models accordingly. Further yet, data science models may be capable of deriving certain types of insights (e.g., complex insights) in a quicker and/or more efficient manner than other forms of data analytics such as rule-based models. Depending on the nature of the available data and the types of insights that are desired, data science models may provide other advantages over alternate forms of data analytics as well.
When using a data science model comprising a trained model object (e.g., a machine-learning model), it may be desirable to quantify or otherwise evaluate the extent to which different parameters influence or contribute to the model object's output. This type of analysis of the contribution (sometimes also referred to as attribution) of the parameters to a model's output may take various forms.
For instance, it may be desirable in some situations to determine which parameters contribute most heavily to a decision made based on a model object's output on a prediction-by-prediction basis. Additionally, or alternatively, it may be desirable in some situations to determine which parameters contribute most heavily, on average, to the decisions made based on a model object's output over some representative timeframe.
As one example, and referring to the discussion of
As another example, an entity that manages industrial assets may want to identify the parameters that contributed most to a failure prediction for a given asset. For instance, if a contribution value for a parameter corresponding to particular sensor data or actuator data gathered from the industrial asset is greater than the contribution values of other parameters, a reason for the predicted failure might be readily inferred. This information, in turn, may then help guide the remedial action that may be taken to avoid or fix the problem before the failure occurs in the given asset and/or in other similarly situated assets. If a temperature reading (e.g., an actual parameter that maps to a formal parameter used by the trained model object to represent temperature) from a temperature sensor attached to a polyvinyl chloride (PVC) pipe has a contribution value that greatly exceeds the contribution values of other parameters used by a trained model object, technicians might readily conclude that the predicted failure of the PVC pipe is due to an ambient temperature that approaches or exceeds an upper-bound operating temperature for PVC (e.g., 140 degrees Fahrenheit).
As yet another example, a medical entity that uses data science models to predict the likelihood of disease or other medical conditions for individuals may want to identify the parameters that contributed most to the model's output score for a given individual. This information may then be used to make judgments about the treatments for the individual that may be effective to reduce the likelihood of the disease or medical condition.
Another situation where it may be desirable to analyze the contribution of the parameters used by a model object to the model's output is to determine which parameters contribute most heavily to a bias exhibited by the model object. At a high level, this may generally involve (i) using the model object to score input datasets for two different subpopulations of people (e.g., majority vs. minority subpopulations), (ii) quantifying (e.g., averaging) the contributions of the input variables to the scores for the two different subpopulations, and (iii) using the contribution values for the two different subpopulations to quantify the bias contribution of the variables.
Further details regarding these and other techniques for determining which input variable(s) contribute most heavily to a bias exhibited by a model object can be found in U.S. patent application Ser. No. 17/900,753, which was filed on Aug. 31, 2022, is entitled “COMPUTING SYSTEM AND METHOD FOR CREATING A DATA SCIENCE MODEL HAVING REDUCED BIAS,” and is incorporated herein by reference in its entirety.
Note that this type of analysis may not be trivial. Depending on the complexity or structure of the model object, the contribution or influence of a formal parameter might not be constant across different values of actual parameters that map to that same formal parameter. For example, suppose that a first input data record includes “30,000” as an actual parameter that maps to a formal parameter representing annual salary and “815” as an actual parameter that maps to a formal parameter representing credit rating. Also suppose that a second input data record includes “200,000” as an actual parameter that maps to the formal parameter representing annual salary and “430” as an actual parameter that maps to the formal parameter representing credit rating. Also suppose that the model object outputs scores for both the first input data record and the second input data record that do not satisfy a threshold condition for loan approval. The score for the first input data record may have been influenced primarily by the annual salary parameter, while the score for the second input data record may have been influenced primarily by the credit rating parameter. Thus, the influence of a particular formal parameter on a score may vary based both on the corresponding actual parameter and on the actual parameters that correspond to other formal parameters. As the number of formal parameters the model object uses increases, the complexity of determining the contributions of individual parameters may increase exponentially.
Several techniques have been developed for quantifying the contribution of a trained model object's parameters. These techniques, which are sometimes referred to as “interpretability” techniques or “explainer” techniques, may take various forms. As one example, a surrogate linear function in a simplified space is used in Local Interpretable Model-agnostic Explanations (LIME), and the linear function is used for explaining the output. Another example technique is Partial Dependence Plots (PDP), which utilizes the model object directly to generate plots that show the impact of a subset of the parameters in the overall input data record (also referred to as the “predictor vector”) on the output of the model object. PDP is similar to another technique known as Individual Conditional Expectation (ICE) plots, except an ICE plot is generated by varying the value of a single actual parameter in a specific input data record while holding the values of other actual parameters constant, whereas a PDP plot is generated by varying a subset of the parameters after the complementary set of parameters has been averaged out. Another technique known as Accumulated Local Effects (ALE) takes PDP a step further and partitions the predictor vector space and then averages the changes of the predictions in each region rather than the individual parameters.
Yet another explainer technique is based on the game-theoretic concept of the Shapley value described in Shapley, “A Value for n-Person Games,” in Kuhn and Tucker, CONTRIBUTIONS TO THE THEORY OF GAMES II, Princeton University Press, Princeton, 307-317 (1953), available at https://doi.org/10.1515/9781400881970-018, which is incorporated by reference herein in its entirety. Given a cooperative game with n players, a set function v that acts on a set N:={1, 2, . . . n} and satisfies v(Ø)=0, the Shapley values assign contributions to each player i∈N to the total payoff v(N), and is given by
by considering the possible combinations of a player i and the rest of the players.
In the machine learning setting, the features (e.g., formal parameters) X=(X1, X2, . . . . Xn) are viewed as n players with an appropriately designed game v (S; x, X, f) where x is an observation (e.g., an actual parameter; a predictor sample from the training dataset of features DX), X is a random vector of features, and f corresponds to the model object and S⊆N. The choice of the game is crucial for a game-theoretic explainer (see Miroshnikov et al. 2024), which is cited below); it determines the meaning of the attribution (explanation) value. Two notable games in the ML literature are the conditional and marginal games given by
introduced in Lundberg and Lee (2017). Shapley values of the conditional game—i.e., conditional Shapley values—explain predictions ƒ(X) viewed as a random variable, while Shapley values for the marginal game—i.e., marginal Shapley values-explain the (mechanistic) transformations occurring in the model ƒ(x).
In practice, conditional or marginal games are typically replaced with their empirical analogs that utilize data samples. Computing conditional game values is generally infeasible when the predictor dimension (i.e., the number of formal parameters) is large; this might be considered the curse of dimensionality. The marginal game, however, is often approximated with the empirical marginal game {circumflex over (v)}ME(S; x,
where
The marginal Shapley value ϕi[vME] of the feature indexed by the subscript i at x, that is the Shapley value for the game vME (S; x, X, f), takes into account the set of possible combinations between a feature of interest (e.g., the parameter whose contribution is to be determined) and the rest of the features in the input vector and produces a score (e.g., a scalar value) that represents the contribution of that feature to the deviation of the model prediction for the specific instance of the input vector (e.g., the actual parameters x1, x2, . . . xn) from the model's average prediction. The empirical marginal Shapley value ϕi[{circumflex over (v)}ME] is the statistical approximant of ϕi[vME], which has complexity of the order O(2n·
In the remaining parts of the document, the term “Shapley values” (or “marginal Shapley values”), refers to the Shapley values ϕi[vME], i=1, 2, . . . n, of the marginal game, The Shapley values are denoted by ϕiME or ϕiME(x) where the information on the model f and the random variable X is suppressed.
Marginal Shapley values, as discussed herein, generate individual contributions of predictor values. It will be appreciated that the marginal Shapley value often cannot be computed because it presupposes knowledge of the distribution of X. While the evaluation of the empirical marginal game {circumflex over (v)}ME(S; x,
One practical implementation of using Shapley values to quantify variable contributions is an algorithm referred to as KernelSHAP, described in Lundberg et al., “S. M. Lundberg and S.-I. Lee, A unified approach to interpreting model predictions”, 31st Conference on Neural Information Processing Systems, (2017), which is incorporated by reference herein in its entirety. KernelSHAP is utilized to compute the marginal Shapley value for each input variable. The KernelSHAP method approximates Shapley values for the marginal game (in view of the assumption of feature independence made by the authors) via a weighted least square problem and it is still very expensive computationally when the number of predictors is large.
Another algorithm, called TreeSHAP, described in Lundberg et al., “Consistent individualized feature attribution for tree ensembles,” ArXiv, arxiv:1802.03888 (2019), which is incorporated by reference herein in its entirety, is utilized to compute the Shapley value of a specially designed tree-based game which mimics the conditioning of the model by utilizing the tree-based model structure. The (path-dependent) TreeSHAP algorithm is a fast method in which the training data does not have to be retained to determine contribution values, but in general it produces neither marginal nor conditional Shapley values (nor their approximants) when dependencies between predictors exist. Furthermore, the contribution values it produces can vary based on implementation details. In terms of complexity, the path-dependent algorithm runs in O(T·L·log(L)2) time, where T is the number of trees comprising the model and L is the upper-bound number of leaves. For one to obtain marginal Shapley values, an adaptation of the TreeSHAP algorithm was proposed called Interventional TreeSHAP, described in Lundberg et al., “From local explanations to global understanding with explainable Al for trees”, Nature Machine Intelligence 2, 56-67 (2020), which is incorporated herein by reference in its entirety. It is not as fast as the path-dependent version of the algorithm because it averages over a background dataset
KernelSHAP (which is model agnostic) is relatively slow due to computational complexity, so it is limited in its application when the number of features is large. Furthermore, KernelSHAP assumes independence between features. On the other hand, TreeSHAP is limited because its path-dependent version produces attributions (e.g., contribution values) that may not be conditional Shapley values and its interventional version requires a background dataset to be used.
In general, a marginal Shapley value may represent, for a given input data record x that was scored by a trained model object ƒ(x), a value (e.g., an “explanation” value or a contribution value) for each parameter that indicates the parameter's contribution to the model's output score for the given input data record. For example, if a trained model object's output is a regressor score (i.e., a probability value with value between 0 and 1) a marginal Shapley value may be expressed as a number between −1 and 1, with a positive value indicating a positive contribution to the output and a negative value indicating a negative contribution to the output. Further, the magnitude of the marginal Shapley value may indicate the relative strength of its contribution.
In this regard, it will be understood that a marginal Shapley value for a given parameter should be interpreted in view of how the data science model defines its output. Returning to the example discussed in
One of the drawbacks of the explainer techniques discussed above is that they fail to account for dependencies between input variables (this is relevant to both KernelSHAP and TreeSHAP). KernelSHAP generally treats input variables as independent from each other (which is often not the case in practice). TreeSHAP relies on the structure of the regression trees that make up the model and its path-dependent version only partially respects dependencies.
To address these and other shortcomings with the techniques discussed above, disclosed herein is a new approach that facilitates rapid computation and retrieval of contribution values for features used by model objects that satisfy several strategic constraints. Specifically, this approach exploits advantages that can be gained by creating an ensemble of decision trees whose structures satisfy specific structural constraints that are described herein.
When the decision trees in the ensemble satisfy these structural constraints (e.g., the decision trees are oblivious), the formula to determine marginal Shapley values for features used by a decision tree can be simplified to obtain a formula of lower computational complexity. When this simplified formula is leveraged in the context of a computing system, the computational efficiency of that system is increased such that the amount of computing resources (e.g., processor cores or memory) used to accomplish a task in a target amount of time can be greatly reduced. For example, suppose an ensemble of decision trees is used to classify a given input data record.
Further suppose Shapley values are desired for features on which decision trees in the ensemble split so that the reasons why the ensemble assigned a particular output class to the input data record features will be more apparent. If no precomputations (which will be described in greater detail below) have been performed beforehand, methods described herein can be used to compute the Shapley values for the features with a computational complexity of log(L)·L1.6 (for a fixed observation), where L denotes the number of leaves in the ensemble of decision trees included in a data science model. While the computational complexity of L1.6 constitutes an advantage over the techniques mentioned above for computing Shapley values, even greater advantages can be gained by performing precomputations as described below.
Regarding these precomputations, as will be explained in the examples below, the set of contribution values (e.g., marginal Shapley values, Owen values, etc.) for the features used by a decision tree that satisfies the aforementioned structural constraints is constant across input data records that land in the same leaf. As a result, leaves can be mapped to sets of contribution values (rather than individual input data records alone to contribution values on a case-by-case basis) such that the set of Shapley values for an input data record can be inferred directly from the leaf in which the input data record falls. Since leaves can be mapped to contribution values, the set of contribution values to which a leaf maps can be determined via precomputation beforehand and stored in a data structure (e.g., a lookup table) that maps leaves to sets of contribution values for the features on which a decision tree splits. The method of computational complexity L1.6 mentioned above can therefore be used to determine the contribution values to which each leaf in each decision tree in an ensemble maps before any input records are classified. The complexity of precomputing the contribution values across each leaf in the ensemble is the number of leaves L multiplied by the complexity L1.6 of determining the contribution values for a single leaf. Therefore, the complexity of precomputing the contribution values across each leaf in the ensemble is L*L1.6=L2.6. In practice, for a single tree, the precomputation of the contribution values for the leaves in the tree can be completed in less than one second. Collectively, for multiple trees included in an ensemble, if the depth of the trees in the ensemble does not exceed fifteen, the number of trees in the ensemble does not exceed one thousand, and sufficient processors and memory are engaged, the collective precomputation of the contribution values for the leaves in the ensemble can be completed in a matter of minutes. If the depth of the trees is less than fifteen (e.g., nine) and the number of trees in the ensemble is less than one thousand (e.g., six hundred fifty), the collective precomputation of the contribution values for the leaves in the ensemble can be completed in a few minutes (e.g., 182 seconds without threading or 45 seconds with thirty-two threads).
Once the precomputation has been completed and the results have been stored in a data structure such as a lookup table, the set of contribution values for the features which an ensemble uses to classify an input data record can be determined with logarithmic complexity rather than exponential complexity. This is because the complexity of identifying the leaves of the trees in the ensemble into which the input data record lands is an operation of logarithmic complexity. Specifically, for each respective decision tree in the ensemble, identifying the leaf into which the input data record lands amounts to traversing a path through the respective decision tree from the root to a leaf. The respective decision tree is binary, so finding the leaf into which the input data record lands for the respective tree is O(Log(L)) (where L is the number of leaves in the respective tree). There are T decision trees in the ensemble and the input data record will land in a respective leaf in each of those trees, so identifying the leaves in the ensemble into which the input data record falls is O(T·Log(L)). Once the leaves in which the input data record lands are known, the contribution values to which those leaves map can be retrieved from the data structure (lookup table) via an O(1) lookup operation for each tree in the ensemble. Given the additive property of certain types of contribution values (e.g., marginal Shapley values), the contribution values for the ensemble as a whole can be readily computed by summing the contribution values for the individual decision trees. In practice, this results in a system that greatly reduces the latency involved in determining contribution values. Specifically, the time of computation for the contribution values for the ensemble as a whole (e.g., for an instance defined by an input data record that represents an individual) is about 0.0001 seconds. Thus, sets of contribution values for ten thousand individuals can be determined in one second.
Furthermore, the data needed to perform the methods described herein is contained in the decision trees themselves. As a result, the contribution values can be computed without access to the training dataset that was used to train the ensemble. This provides another advantage over existing approaches (e.g., Interventional TreeSHAP) that involve accessing training data to calculate game values because memory usage is greatly reduced in cases where the training dataset is large (a common occurrence in many industries, since larger training datasets tend to yield better machine-learning models). The processes and systems described herein can therefore be deployed in computing environments that might lack sufficient memory to store a complete training dataset. The processes and systems described herein thus empower such computing environments to perform tasks that those computing environments would not be able to perform if previous approaches were to be used.
The Categorical Boosting (CatBoost) algorithm (which is familiar to those of ordinary skill in the art) uses gradient boosting to produce an ensemble of decision trees that meet the constraints discussed above. CatBoost can be used without modification in conjunction with the processes disclosed herein. The ensembles produced by CatBoost achieve levels of prediction accuracy comparable to those of other types of machine-learning models (e.g., neural networks) that, although capable of achieving high levels of prediction accuracy, do not lend themselves to having those predictions explained in terms of how much each feature influenced any particular prediction. In addition, the running time for CatBoost is generally less than the running time for other machine-learning algorithms (e.g., XGBoost) that can achieve comparable levels of prediction accuracy. There are some types of machine-learning models (e.g., explainable boosting machines and explainable neural networks) that do lend themselves to having their predictions explained, but those models typically fail to achieve the levels of prediction accuracy of their non-explainable counterparts. When implemented as part of the systems and processes described herein, CatBoost can offer the best of both worlds by achieving high prediction accuracy while also providing the option to obtain explanations for individual predictions via the simplified formula and the other techniques described herein.
Turning to
Prior to commencement of the example process 301, a model object for a data science model that is to be deployed by an entity for use in making a particular type of decision may be trained. In general, this model object may comprise any model object that is configured to (i) receive an input data record comprising a set of actual parameters that are related to a respective individual (e.g., person) and map to a particular set of formal parameters (which may also be referred to as the model object's “features” or the model object's “predictors”), (ii) evaluate the received input data record, and (iii) based on the evaluation, output a score that is then used make the given type of decision with respect to the respective individual. Further, the model object that is trained may take any of various forms, which may depend on the particular data science model that is to be deployed.
For instance, as one possibility, the model object may comprise a model object for a data science model to be utilized by an entity to decide whether or not to extend a particular type of service (e.g., a loan, a credit card account, a bank account, or the like) to a respective individual within a population. In this respect, the set of formal parameters for the model object may comprise data variables that are predictive of whether or not the entity should extend the particular type of service to a respective individual (e.g., variables that provide information related to credit score, credit history, loan history, work history, income, debt, assets, etc.), and the score may indicate a likelihood that the entity should extend the particular type of service to the respective individual, which may then be compared to a threshold value in order to reach a decision of whether or not to extend the particular type of service to the respective individual.
The function of training the model object may also take any of various forms, and in at least some implementations, may involve applying a machine-learning process to a training dataset that is relevant to the particular type of decision to be rendered by the data science model (e.g., a set of historical data records for individuals that are each labeled with an indicator of whether or not a favorable decision should be rendered based on the historical data record). In this respect, the machine-learning process may comprise any of various machine learning techniques, examples of which may include regression techniques, decision-tree techniques, support vector machine (SVM) techniques, Bayesian techniques, ensemble techniques, gradient descent techniques, and/or neural-network techniques, among various other possibilities.
As shown in
As shown in block 322, the example process 301 further includes selecting a realizable leaf in the currently selected decision tree.
As shown in block 324, the example process 301 further includes selecting a feature on which the currently selected decision tree splits.
As shown in block 326, the example process 301 further includes determining a contribution value for the currently selected feature. The contribution value may be determined, for example, using the approach described below with respect to
As shown in block 328, the example process 301 may further include adding the contribution value to a current set of contribution values for the currently selected realizable leaf. If contribution values for each feature on which the currently selected decision tree splits have been determined, the flow of the example process 301 moves to block 330. Otherwise, the flow of the example process 301 moves back to block 324 for the next feature on which the currently selected decision tree splits to be selected.
As shown in block 330, if contribution values for each feature on which the currently selected decision tree splits have been determined, an entry that maps the currently selected realizable leaf to the current set of contribution values is created. If there are entries in data structure that map each realizable leaf in the currently selected decision tree to a respective set of contribution values, the flow of the example process 301 moves to block 332. Otherwise, the flow of the example process 301 moves to block 322 for the next realizable leaf to be selected.
As shown in block 332, if the realizable leaves in each decision tree in the ensemble have been mapped to contribution values, the example process 301 terminates after storing the contribution values (e.g., in a computer-readable storage medium for future retrieval). Otherwise, the flow of the example process 301 moves back to block 320 so that the next decision tree in the ensemble can be selected. In this manner, the data structure that maps realizable leaves in the ensemble to sets of contribution values can be populated.
Turning to
Prior to commencement of the example process 300, a model object for a data science model that is to be deployed by an entity for use in making a particular type of decision may be trained. In general, this model object may comprise any model object that is configured to (i) receive an input data record comprising a set of actual parameters that are related to a respective individual (e.g., person) and map to a particular set of formal parameters (which may also be referred to as the model object's “features” or the model object's “predictors”), (ii) evaluate the received input data record, and (iii) based on the evaluation, output a score that is then used make the given type of decision with respect to the respective individual. Further, the model object that is trained may take any of various forms, which may depend on the particular data science model that is to be deployed.
For instance, as one possibility, the model object may comprise a model object for a data science model to be utilized by an entity to decide whether or not to extend a particular type of service (e.g., a loan, a credit card account, a bank account, or the like) to a respective individual within a population. In this respect, the set of formal parameters for the model object may comprise data variables that are predictive of whether or not the entity should extend the particular type of service to a respective individual (e.g., variables that provide information related to credit score, credit history, loan history, work history, income, debt, assets, etc.), and the score may indicate a likelihood that the entity should extend the particular type of service to the respective individual, which may then be compared to a threshold value in order to reach a decision of whether or not to extend the particular type of service to the respective individual.
The function of training the model object may also take any of various forms, and in at least some implementations, may involve applying a machine-learning process to a training dataset that is relevant to the particular type of decision to be rendered by the data science model (e.g., a set of historical data records for individuals that are each labeled with an indicator of whether or not a favorable decision should be rendered based on the historical data record). In this respect, the machine-learning process may comprise any of various machine learning techniques, examples of which may include regression techniques, decision-tree techniques, support vector machine (SVM) techniques, Bayesian techniques, ensemble techniques, gradient descent techniques, and/or neural-network techniques, among various other possibilities.
As shown in
As shown in block 304, the example process 300 further includes inputting the group of actual parameters into the trained data science model. The trained data science model comprises an ensemble of decision trees wherein each individual decision tree in the ensemble is symmetric, each individual decision tree in the ensemble is configured to receive a respective subset of the features as input, and, within each individual decision tree, internal nodes that are positioned in a same level designate a same splitting criterion based on a same feature selected from the respective subset of features. The trained data science model may be, for example, a categorical boosting (CatBoost) model.
As shown in block 306, the example process 300 further includes, for each individual decision tree in the ensemble, identifying a respective leaf such that the actual parameters satisfy a series of splitting conditions for edges that connect nodes in a respective path from a root of the individual decision tree to the respective leaf, and accessing a set of respective individual contribution values (e.g., via retrieval from a storage location in a computer-readable medium) for the respective leaf. (In this example, the set of respective individual contribution values was precomputed and stored beforehand via a process such as the example process 301 shown in
In one example, determining the set of respective individual contribution values for the respective leaf comprises a number of actions, such as: identifying each realizable path from the root of the individual decision tree to each realizable leaf in the individual decision tree, respectively; for each identified realizable path, computing a respective first probability by dividing a number of the training data records that were scored during the training based on the identified realizable path by a total number of training data records in the training data; for each identified realizable path, identifying a respective score to be assigned to input data records scored by the identified realizable path; for each level of the individual decision tree, identifying the same feature on which the same splitting criterion specified by the internal nodes at that level is based; identifying subsets of the respective subset of features that the individual decision tree is configured to receive as input; for each identified subset of the respective subset of features, identifying a respective group of realizable paths such that, for each level of the individual decision tree in which the same splitting criterion for that level is based on a feature included in the identified subset, the respective path and the realizable paths in the respective group have a same path direction from that level to a next level of the individual decision tree; for each identified subset of the respective subset of features, computing a sum of the respective first probabilities for each realizable path in the identified subset; and for each identified subset of the respective subset of features, computing a marginal path expectation by multiplying the respective score for the respective path by the sum for the identified subset. This same set of actions can be applied to each leaf in the ensemble. The sets of contribution values generated thereby may be used to populate a data structure with entries that map the leaves in the ensemble of decision trees to the respective sets of contribution values.
The action of identifying each realizable path from the root of the individual decision tree to each realizable leaf in the individual decision tree, respectively, may involve identifying a selected path to be evaluated for realizability; detecting that a first splitting condition for a first edge in the selected path and a second splitting condition for a second edge in the path contradict each other; and excluding the selected path from a list of realizable paths.
In some examples, the set of respective individual contribution values for the respective leaf may have been computed beforehand and stored in a data structure that maps leaves to respective sets of contribution values. In such examples, determining the set of respective individual contribution values for the respective leaf may involve: receiving an identifier of a leaf selected from a decision tree in the ensemble; and, based on the identifier of the leaf, determining a set of contribution values to which the identifier maps in the data structure. (The determined set of contribution values to which the identifier maps in the data structure is the set of respective individual contribution values.)
As shown in block 308, the example process 300 further includes, for each individual feature in the set of features, computing a respective overall contribution value based on a sum of the respective individual contribution values that map to that individual feature. This may be achieved, for example, by summing the local contribution values for each tree in the ensemble for the individual feature.
As shown in block 310, the example process 300 further includes computing, via the trained data science model, the score for the input data record based on the respective leaves identified.
The example process may further include identifying at least one reason code for the score based on the respective overall contribution values for the individual features in the set of features. Still further, the example process 300 may include transmitting the score and the at least one reason code in response to the request.
Turning to
As will be recognized by persons of ordinary skill in the art, formal parameters refer to variables that act as placeholders within the definition of a function, a subroutine, a procedure (e.g., in procedural programming languages), or any module of code that (i) has its own local variable scope and (ii) can receive, through a parameter list supplied when the module (e.g., function) is called, values (e.g., actual parameters, which are sometimes called “arguments”) to be used in place of the placeholder variables (e.g., formal parameters) declared in the module definition during execution of the module with the supplied parameter list.
A decision tree is one example of a function in that a decision tree (1) receives values, (2) compares those values to a series of splitting conditions for edges (e.g., arcs or directed edges) that connect nodes in the tree to identify a path from the root node of the tree to a leaf of the tree such that those values satisfy the splitting conditions for edges that connect nodes in a path from the root to a leaf, and (3) returns a label (e.g., a score) associated with the leaf.
As will be recognized by persons of ordinary skill in the art, a decision tree can be represented by a connected acyclic graph in which each node (i.e., vertex) other than the root is the head or target (i.e., terminal vertex) of a single directed edge and each internal node (i.e., a node that is not a leaf node) is the tail (i.e., initial vertex) of at least one directed edge. (In the case of a binary tree, each internal node is the initial vertex of at least one directed edge and no more than two directed edges.) Each directed edge connects a node from an nth level of the tree to a node in the (n+1)th level in the tree, where n is a non-negative integer. (For reference, in accordance with nomenclature conventions known to those of skill in the art, the root of a decision tree is considered to be positioned in the first level of that decision tree.) The root of a decision tree is a source (i.e., a node with an in-degree of zero); each leaf in a decision tree is a sink (i.e., a node with an out-degree of zero).
With regard to nomenclature for binary trees that will be familiar to those of skill in the art, the decision tree 400 is a “full” binary tree because each node in the decision tree 400 is an initial vertex of zero or two edges. As will be recognized by those of skill in the art, the “depth” of a given node is the number of edges in the path from the root node to the given node (thus, the depth of a root node is zero). The height of a binary tree is the depth of the leaf in the binary node that is farthest from the root node. The decision tree 400 is not “balanced” because the height of the left subtree of the root 401 differs from the height of the right subtree of the root 401 by more than one level. Furthermore, the decision tree 400 is not “complete” because some levels of the decision tree 400 other than the last level (which is the fifth level in this example) are not filled. Also, the decision tree 400 is not a “perfect” binary tree. A “perfect” binary tree is a special type of binary tree in which each leaf is at the same level (i.e., depth), and each internal node has two children. However, as shown in
For the purposes of
Since the decision tree 400 is configured to receive two formal parameters as input, the decision tree 400 is a function of two variables. The domain (i.e., the set of possible input values for which the function is defined) of the decision tree 400 can, therefore, be represented intuitively in two dimensions by the grid 450. The range (i.e., set of possible output values that the function can output) of the decision tree 400 is indicated by the regions 451a-f into which the grid 450 is divided.
The vertical axis 452a depicts a set of potential values ranging from zero to three that the actual parameter x2 may specify for the formal parameter X2. Similarly, the horizontal axis 452b depicts a set of potential values from zero to four that the actual parameter x1 may specify for the formal parameter X1. Note, however, that these sets of potential values have not been selected for this example to imply that any upper bounds or lower bounds exist on the possible values that may be specified for the formal parameters (X1, X2); the output for the decision tree 400 is still defined for (i) values of x1 that are less than zero or greater than four and for (ii) values of x2 that are less than zero or greater than three. Rather, these sets of potential values have been selected for illustrative purposes so that the portion of the domain of the decision tree 400 depicted by the grid 450 is large enough to include a region of the tree that maps to each of the leaves 430a-f, respectively. Each of the regions 451a-f maps to a respective one of the leaves 430a-f (as indicated by the respectively matching fill patterns of 451a-f and 430a-f) for reasons that will be explained in greater detail below,
Consider, for example, the region 451a. The region 451a represents cases in which x1 is a value between zero and one, inclusive, and x2 is also a value between zero and one, inclusive. If the decision tree 400 is evaluated against a set of actual parameters (x1, x2) that satisfy these constraints, the decision tree 400 will return the score that is associated with the leaf 430a. This can be verified in this example by beginning at the root 401 of the decision tree 400 and comparing the actual parameters (x1, x2) to the splitting criterion for the root 401. The splitting criterion for the root 401 is expressed by the splitting conditions for the edges 420a-b because these are the two edges for which the root 401 is the initial vertex. In this example, the splitting criterion for the root 401 designates a threshold (the number one, in this case).
As shown, the splitting conditions for the edges 420a-b are mutually antithetical. In other words. if the splitting condition for the edge 420a (i.e., X1≤1) is satisfied, the splitting condition for the edge 420b (i.e., X1>1) is not satisfied. Conversely, if the splitting condition for the edge 420b is satisfied, the splitting condition for the edge 420a is not satisfied. Stated more generally, in this example, the splitting condition for the edge 420a is that X1 does not exceed the threshold designated by the splitting criterion for the root 401 and the splitting condition for the edge 420b is that X1 exceeds the threshold. In this example, since the actual parameter x1 (which maps to the formal parameter X1) is a value selected from the region 451a, x1 is less than or equal to one. The path through the decision tree 400 therefore proceeds from the root 401 (which is positioned in the first level of the decision tree 400) to the internal node 403a (which is positioned in the second level of the decision tree 400) via the edge 420a.
Next, the actual parameters (x1, x2) are compared to the splitting criterion for the internal node 403a. The splitting criterion for the internal node 403a is expressed by the splitting conditions for the edges 420c-d because these are the two edges for which the internal node 403a is the initial vertex, Since the actual parameter x2 (which maps to the formal parameter X2) is a value selected from the region 451a, x2 is less than or equal to one. Therefore, the splitting condition for the edge 420c (i.e., X2≤1) is satisfied and the splitting condition for the edge 420d (i.e., X2>1) is not satisfied. As a result, the path through the decision tree 400 proceeds from the internal node 403a (which is positioned at the second level of the decision tree 400) to the leaf 430a (which is positioned in the third level of the decision tree 400) via the edge 420c. The score associated with leaf 430a will therefore be returned when the decision tree 400 is evaluated against a set of actual parameters selected from the region 451a. For this reason, the region 451a is said to map to the leaf 430a. In other words, when the decision tree 400 is evaluated against a set of actual parameters selected from the region 451a, an input data record that comprises this set of actual parameters will “land in” the leaf 430a.
A similar walkthrough can be done for sets of actual parameters selected from each of the regions 451b-e to verify that the region 451b maps to the leaf 430b, the region 451c maps to the leaf 430c, the region 451c maps to the leaf 430c, the region 451d maps to the leaf 430d, the region 451e maps to the leaf 430e, and the region 451f maps to the leaf 430f.
The relationship between the grid 450 and the leaves 430a-f as described above has at least two implications. First, two input data records whose actual parameters are selected from a same region in the grid 450 will “land in” the same leaf—namely, the leaf to which that region maps—and will both be assigned the score associated with that leaf. Second, each threshold designated by a splitting criterion for a node in the decision tree 400 will mark a border between at least two regions in the grid 450 along the dimension (e.g., formal parameter) to which the threshold applies. For example, the splitting criterion for the root 401 designates the number one as a threshold for X1. As shown in the grid 450, the number one along the horizontal axis (which represents the set of potential values for X1) marks a solid vertical line that separates the region 451 a from the region 451c, the region 451b from the region 451c, and the region 451b from the region 451d. This vertical border, which is established by a splitting criterion that applies to X1, extends across the full height of the grid 450. In other words, regardless of the value selected for X2, the line x1=1 marks a border between regions. Thus, the status of the solid vertical line x1=1 as a border is independent of the value selected for X2. For similar reasons, the solid vertical line x1=3 marks a vertical border across the full height of the grid 450 regardless of the value selected for X2.
By contrast, the splitting criterion for the internal node 403a designates the number one as a threshold for X2. As shown in the grid 450, the number one along the vertical axis (which represents the set of potential values for X2) marks a horizontal border that separates the region 451a from the region 451b. However, unlike the solid vertical line x1=1, the solid portion of the horizontal line at x2=1 does not extend across the full width of the grid 450. Specifically, for values of X1 greater than one, the dashed portion of the horizontal line x2=1 does not mark a border between regions. Thus, the status of the horizontal line x2=1 as a border (i.e., whether it is a solid line or a dashed line) is not independent of the value selected for X1. Similarly, the horizontal line x2=2 and the vertical line x1=2 mark borders that do not fully traverse the grid 450.
This dependence relationship between (i) the status of a threshold designated by a splitting criterion found in the decision tree 400 as a border along the dimension to which the threshold applies and (ii) the value selected for a formal parameter to which the threshold does not apply results from certain structural characteristics of the decision tree 400. First, the leaves 430a-f are distributed across more than one level of the decision tree 400. For example, leaf 430a, leaf 430b, and leaf 430f are positioned in the third level, while leaf 430c is positioned in the fourth level, and leaves 430d-e are positioned in the fourth level of the decision tree 400. Second, although the internal node 403a and the internal node 403b are both positioned in the second level of the decision tree 400, the splitting criterion for the internal node 403a and the splitting criterion for the internal node 403b apply to different formal parameters (X2 and X1, respectively). Third, the splitting criterion for the internal node 403a and the splitting criterion for the internal node 403b designate different thresholds (one and three, respectively).
If the decision tree 400 is intended to be used to compute scores alone, the structural characteristics of the decision tree 400 that result in the dependence mentioned above might be of little concern, However, if contribution values for the parameters used by the decision tree 400 are desired in addition to the score that the decision tree 400 computes for an input data record, these structural characteristics pose a problem.
To illustrate this problem, consider the following example. Suppose a first input data record includes actual parameters selected from the region 451c shown in the grid 450. Specifically, suppose that the actual parameter x1 is greater than one, but less than or equal to two. Also suppose that the actual parameter x2 is greater than one, but less than or equal to two. Since the region 451c maps to the leaf 430c, the decision tree 400 will return the score associated with the leaf 430c for the first input data record.
Further suppose that a second input data record also includes actual parameters selected from the region 451c. However, for the second input data record, suppose that the actual parameter x1 is greater than two, but less than three. In addition, for the second input data record, suppose that x2 is greater than or equal to zero, but less than one. Again, since the region 451c maps to the leaf 430c, the decision tree 400 will return the score associated with the leaf 430c for the second input data record.
Although the first input data record and the second input data record both land in the leaf 430c, they map to subregions of the region 451c (e.g., as shown by the dashed lines that cross the region 451c) that would have been divided by a solid vertical border (marked by the line x1=1) and by a horizontal border (marked by the line x2=1) but for the dependence relationship explained above. In cases where two input data records (i) land in the same leaf of a decision tree, yet (ii) map to different subregions of a grid region that maps to the leaf, as discussed above, the contribution values (e.g., game values such as Shapley values and Owen values) for the formal parameters used by the tree will generally not be equal for the two input data records. In other words, although the two input data records land in the same leaf and will be assigned the same score by the decision tree, the two input data records will not have the same contribution values for their respective features. A formal proof of this principle has been provided in Filom et al., “On marginal feature attributions of tree-based models,” digital object identifier (doi): 10.3934/fods.2024021, which is hereby incorporated by reference in its entirety.
Thus, the structural characteristics of the decision tree 400 that result in the dependence relationship explained above render the decision tree 400 insufficient for determining contribution values without additional extrinsic data (e.g., training data) that is not incorporated into the decision tree 400 itself. The methods available for determining contribution values for the decision tree 400 are computationally intensive and have certain drawbacks for some applications that involve determining contribution values for large numbers of input data records.
Filom et al. (cited above) have demonstrated that the type of problematic dependence relationship described above can be eliminated if several specific constraints, discussed in further detail below, on the structural characteristics of a decision tree are satisfied. Filom et al. (cited above) have further demonstrated that the contribution values will be equivalent for each input data record that lands in the same leaf of a decision tree that satisfies these constraints.
Thus, each leaf in a decision tree that satisfies these constraints (e.g., the decision tree is symmetric) maps to a single respective set of contribution values for the formal parameters (e.g., features) the decision tree is configured to receive as input. As a result, sets of contribution values for features can be determined on a leaf-by-leaf basis rather than on an input-data-record-by-input-data-record basis. Effectively, once the set of contribution values for the features for a single input data record that lands in a leaf is known, the set of contribution values for the features for each other input data record that lands in that leaf is also known. This unexpected principle can be leveraged by storing each computed set of contribution values into a data structure that maps leaves to sets of contribution values (e.g., a lookup table or a hash table), Once the set of contribution values to which a leaf maps has been computed and stored in the data structure, the set of contribution values for an input data record that subsequently lands in the leaf can be retrieved via a rapid lookup operation rather than through an arduous series of calculations.
The speed at which a set of contribution values can be retrieved subsequent to computation is not the only way efficiency can be increased, however. Filom (cited above) have also demonstrated that when the problematic dependence relationships described above with respect to
The increases in efficiency at the computation stage are such that, in many cases, the sets of contribution values to which the leaves of a decision tree map can be exhaustively calculated before the decision tree is deployed for use so that both scores and contribution values can be returned rapidly for input data records immediately upon deployment of the decision tree.
Nevertheless, if an exhaustive determination of the sets of contribution values to which the leaves map is prohibitively costly (e.g., in terms of memory, processor capacity, or other computing resources) or otherwise not desirable prior to deployment, the data structure for retrieval can be populated piecemeal over time (e.g., each time an input data record lands in a leaf in which no previous input data record has landed, the set of contribution values can be computed and an entry that maps the leaf to the set of contribution values can be added to the data structure).
In light of the advantages described above, it will be illustrative to provide an example in which the specific constraints on the structural characteristics of a decision tree are satisfied such that these advantages can be obtained.
Turning to
With regard to the nomenclature for binary trees that is familiar to those of skill in the art, the decision tree 500 is a “full” binary tree because each node in the decision tree 500 is an initial vertex of zero or two edges. The decision tree 500 is also “balanced” because the height of the left and right subtrees of the root 501 (and the respective left and right subtrees of each of the internal nodes 503a-f) are equivalent. Furthermore, the decision tree 500 is also “complete” because each level of the decision tree 500 is filled. Ultimately, the decision tree 500 is a “perfect” binary tree because the leaves 530a-h are positioned in the same level and each of the internal nodes 503a-f is an initial vertex of two directed edges.
For the purposes of
Like the decision tree 400 shown in
The vertical axis 552a depicts a set of potential values ranging from zero to two that the actual parameter x2 may specify for the formal parameter X2. Similarly, the horizontal axis 552b depicts a set of potential values from zero to three that the actual parameter x1 may specify for the formal parameter X1. Note that these sets of potential values do not imply that any upper bounds or lower bounds exist on the possible values that may be specified for the formal parameters (X1, X2).
The structural characteristics of the decision tree 500 satisfy the constraints mentioned above such that the advantages mentioned above can be achieved. These constraints will be described in turn. First, within any given level of the decision tree 500, each internal node in the given level specifies the same splitting criterion (e.g., designates the same threshold and applies to the same feature) as the other internal nodes in the given level. For example, in the second level of the decision tree 500, the internal node 503a and the internal node 503b both specify the splitting criterion X2≤1. In the third level of the decision tree 500, the internal node 503c, the internal node 503d, the internal node 503e, and the internal node 503f each specify the splitting criterion X1≤2. The fourth level is the last level of the decision tree 500 and contains the leaves 530a-f; there are no internal nodes in the fourth level of the decision tree 500, so there are no criteria to be compared for the fourth level. Of course, there is only one internal node in the first level of the decision tree 500—namely, the root 501—so there are no other nodes in the first level whose criteria can be compared to the criterion specified by the root 501. Since the respective splitting criterion used at each level of the decision tree 500 applies to a single feature, the number of features that the decision tree 500 is configured to receive as input is no greater than the number of levels in the tree. This upper bound on the number of features that may be used by a decision tree of a given depth is helpful for reducing computational complexity. Second, the decision tree 500 is a “perfect” binary tree (i.e., each internal node in the decision tree 500 is an initial vertex of two edges and each leaf in the decision tree 500 is at the same level).
Decision trees that satisfy these two constraints are said to be symmetric (i.e., oblivious). Hence, the decision tree 500 is symmetric, Symmetric decision trees provide the potential for an additional advantage that can be leveraged to increase computational speed in combination with the other advantages discussed herein, as discussed below.
As explained above, the splitting criterion specified in each level of a symmetric decision tree is the same for each node in that level. As a result, each level of the symmetric tree (except the last level, which does not include internal nodes) can be mapped to a single respective threshold and a single respective feature to which that threshold applies.
A first vector of the thresholds to which the levels of the symmetric decision tree map can be generated. The numerical position (e.g., index) of a threshold in the first vector indicates the level of the symmetric decision tree to which that threshold applies. A second vector that identifies the formal parameters to which the thresholds in the first vector apply can also be generated. For example, each entry in the second vector can match the subscript of the formal parameter to which the threshold in the corresponding numerical position in the first vector applies.
When an input data record to be scored by the symmetric decision tree is provided, a third vector can be generated. Each entry in the third vector is the actual parameter (selected from the input data record) that maps to the formal parameter in the corresponding numerical position in the second vector. Once the third vector is generated, a fourth vector that represents the path through the symmetric decision tree between a leaf to the root for the input data record can be generated. The entry for each numerical position in the fourth vector may be a binary value that is determined by comparing the entry at that numerical position in the third vector (which is an actual parameter) to the entry at that numerical position in the first vector (which is a threshold). If the entry in the third vector exceeds the entry in the first vector, the entry in the fourth vector is set to one to signify that the path proceeds through a right edge that proceeds out of a node positioned in the level of the symmetric decision tree that matches the numerical position of the entry. Otherwise, the entry is set to zero to signify that the path proceeds through a left edge that proceeds out of the node positioned in the level of the symmetric decision tree that matches the numerical position of the entry.
Since the splitting criterion for a given level of a symmetric decision tree is the same for each node in that level, the threshold to which a comparison is to be made at any given level is independent of the route of the path through the symmetric decision tree in previous levels. Furthermore, the actual parameter to be compared to the threshold is also independent of the route of the path through the symmetric decision tree in previous levels because the formal parameter to which the threshold applies (and to which the actual parameter maps) is independent of the route of the path through the symmetric decision tree in previous levels. As a result of this independence between the respective splitting criterion for each level and the route of the path through previous levels of the symmetric decision tree, the entries for the fourth vector (which represents the path through the symmetric decision tree for the input data record) can be computed in parallel rather than in series. As a result, the speed to compute the leaf in which the input data record lands can be increased.
Returning to the specific example of the decision tree 500, the relationship between the decision tree 500 and the grid 550 is similar to the relationship between the decision tree 400 of
Note that there are eight leaves (i.e., the leaves 530a-h) in the decision tree 500, but there are six regions in the grid 550. This is because no possible input data record will land in the leaf 530b or in the leaf 530d. The path from the root 501 to the leaf 530b includes both an edge with the splitting condition X1≤1 and an edge with the splitting condition X1>2; there is no possible value for X1 that can satisfy both of these splitting conditions concurrently. Similarly, the path from the root 501 to the leaf 530d includes these contradictory splitting conditions. For this reason, the leaf 530b and the leaf 530d are said to be non-realizable. By contrast, the leaves 530a, c, e-h are said to be realizable because there are combinations of possible values of X1 and X2 that can satisfy the splitting conditions in the respective paths from the root to the leaves 530a, c, e-h. The grid 550 includes a region that maps to each realizable leaf, but does not include any regions that map to non-realizable leaves.
Each threshold designated by a splitting criterion for a node in the decision tree 500 (which is also the splitting criterion for the level in which that node is positioned) marks a border between at least two regions in the grid 550 along the dimension (e.g., formal parameter) to which the threshold applies. For example, the splitting criterion for the root 501 designates the number one as a threshold for X1. As shown in the grid 550, the number one along the horizontal axis (which represents the set of potential values for X1) marks a solid vertical line that separates the region 551a from the region 551e and the region 551c from the region 551g. This vertical border, which is established by a splitting criterion that applies to X1, extends across the full height of the grid 550. In other words, regardless of the value selected for X2, the line x1=1 marks a border between regions. Thus, the status of the solid vertical line x1=1 as a border is independent of the value selected for X2. For similar reasons, the solid vertical line x1=2 marks a vertical border across the full height of the grid 550 regardless of the value selected for X2.
Similarly, the splitting criterion for the internal node 503a designates the number one as a threshold for X2. As shown in the grid 550, the number one along the vertical axis (which represents the set of potential values for X2) marks a solid horizontal line that separates the region 551a from the region 551c. Unlike the example shown in
Thus, in the example shown in
With the examples shown in
Turning to
Suppose the ensemble 600 is a CatBoost model that has been trained against a training dataset. Also suppose that there are a total of Al trees in the ensemble 600, where M is a positive integer. Let T1(X), T2(X), . . . Tm(X) denote the trees in the ensemble, where X represents the set of formal parameters (e.g., features, which are stored in a vector in this example) that the ensemble 600 is configured to receive as input, and the subscripts represent indices to identify the individual decision trees within the ensemble 600.
The decision tree 601 is shown as an example of an individual tree. The operations below will be described with respect to the decision tree 601 for the sake of simplicity, but those same operations will be performed for each decision tree in the ensemble 600 during the process of computing contribution values for the features. Persons of skill in the art will understand that at least some of the operations and other actions described below may be performed in orders other than the order provided in this example.
The process may commence by identifying the realizable paths through the decision tree 601 and storing a collective representation of those paths in a matrix. A single path through the tree may be represented by a vector of binary values. In one example, suppose there are n levels in the decision tree 601, where the root 602 is in the first level and the leaves of the decision tree 601 are in the nth level. In this example, the numerical position (e.g., index) of an entry in the vector may be defined as n minus the level of the decision tree 601 to which the entry maps. An entry with a binary value of one at an index j in the vector signifies that the path represented by the vector includes a right edge that points to a node positioned in the (n−j)th level of the decision tree 601. In contrast, an entry with a binary value of zero at the index j in the vector signifies that the path represented by the vector includes a left edge that points to the node positioned in the (n−j)th level of the decision tree 601. Since other vectors described below will also include binary values, a vector that represents a path will be called a path vector. (For example, given a path a, the example equation a=(1,0,0,1,0) would indicate that the path vector (1,0,0,1,0) represents the path a through a binary tree of depth 5.) Each path vector for a realizable path through the decision tree 601 is stored as a row of a matrix of paths that will be called the path matrix.
Next, a probability estimate is determined for each realizable leaf in the decision tree 601. Let Ra denote the realizable leaf that is connected to the root 602 of the decision tree 601 by the path a. The probability for the realizable leaf Ra (and therefore the probability assigned to the path a) can be estimated (the estimate is represented by
) by dividing the number of training instances (e.g., input data records used for training, which may be) in the training dataset that landed in the realizable leaf during training of the decision tree 601 by the number of training instances in the training dataset, as indicated by the equation below:
where X∈Ra denotes the proposition that a set of actual parameters that map to the features in the vector X lands in the realizable leaf Ra.
Given that the ensemble 600 is a CatBoost model in this example, one characteristic of the decision tree 601 and the other member trees of the ensemble 600 is that each member tree is configured to use a (usually small) subset of the features that the ensemble 600 is configured to receive as input. Suppose there are n features that the ensemble 600 is configured to receive as input, where n is a positive integer. Also suppose that N denotes the set of the features that the ensemble 600 is configured to receive as input. In other words, N is the set of global features for the ensemble 600. The cardinality (i.e., number of elements in a set) of N is denoted by |N| and is equal to n. Further suppose that K denotes the set of features on which the decision tree 601 splits and that k denotes the number of features in K (which can also be represented by |K|, which is the cardinality of K). Because the features on which the decision tree 601 splits are selected from the set of the features that the ensemble 600 is configured to receive as input, the features in K represent features that are also found in N. In that sense, the set of features represented in K is a subset of the set of features represented in N. Note, however, that depending on the forms in which N and K are implemented in programming code, there might not be a self-evident way to identify to which feature in N a given feature in K corresponds. For instance, consider an example in which N is implemented as vector (or an array or some other data structure that stores multiple values, assigns a respective numeric index to each given value, and returns the given value in response to receiving the respective numeric index assigned the given value as an argument (e.g., an actual parameter)). In this example, if K is also represented as a vector and K does not include most of the features included in N, the index assigned to a particular feature in the vector that implements K might not match the index assigned to that particular feature in the vector that implements N. To facilitate implementation of the techniques described further below in programming code, it may be advantageous to identify the feature in N to which a feature in K corresponds. Accordingly, an injective function that maps features in K to features in N (e.g., such as the local-to-global mapping that will be described further below) may be defined for this purpose.
Regarding k, k is a positive integer that is less than or equal to n. K constitutes the set of local features for the decision tree 601. The case k=n would rarely be implemented in practice because it would be likely to cause overfitting. (Note that k is not allowed to exceed the depth of the tree; in practice, it may be preferable to constrain the depth of the tree to no more than fifteen.) For that reason, suppose that k<n (i.e., K is a proper subset of N) for the purposes of this example.
The features in K were selected (e.g., randomly or by an optimization mechanism applied during training) from N. As a result, the indices that map to the features in a vector that stores the elements of K (i.e., the local features for the decision tree 601) typically will not match the indices of those same features in a vector that stores the elements of N (i.e., the global features for the ensemble 600). As will be shown further below, it is useful to create local-to-global mapping that maps the indices of local features in the vector that stores K to the indices of those same features in the vector that stores N. The local-to-global mapping can be stored in a data structure such as a lookup table.
Next, for each feature i in K, the set of the levels of the decision tree 601 for which i is the feature to which the splitting criterion for the level applies is identified. In other words, if the splitting criterion for a level of the decision tree 601 applies to i, that level is included in the set of levels for i. The set of levels for i is denoted by (i). The set of levels
(i) may be stored by a vector that contains the indices of the elements of
(i) (e.g., the depths of the levels in
(i)) in the decision tree 601. The set of the sets
(i) for each feature i in K is denoted by p. For reference, Filom et al. (cited above) refer to sets of levels as partitions of levels (i.e., level partitions) and also uses
(i) and
to represent the set of levels for i and the set of sets of levels of i, respectively. For disambiguation purposes, note that level partitions (which refer to sets of levels) differ from the partitions of features (e.g., global partitions and tree-local partitions) that will be discussed further below with respect to determining Owen values.
In this example, suppose the contribution values to be determined are Shapley values. The generalized formula for computing Shapley values is given by
where ϕi[vME, N] represents the Shapley value for the feature i, S represents a proper subset of N that does not include the feature i, s represents the number of elements in S (i.e., the cardinality of S), w(s,n) represents a known weight value, {i} represents the set of features containing i alone and no other elements, and VME(S∪{i}), where dependence on parameters (x, X, f) is suppressed as indicated above, represents a game based on marginal expected values of the decision tree 601. In this context, the term “game” refers to a game as defined in game theory, as will be recognized by persons of skill in the art. In the game vME, the features in N are considered to be the players (as defined in game theory); the payoffs and rules (as defined in game theory) are established by the structure of the decision tree 601.
In this example, it will be useful to provide notations for some additional quantities that will be computed during the process of determining Shapley values for the leaves in the decision tree 601. Let b denote a path. As noted above, a also denotes a path. For the pair of path a and path b, which is denoted by (a, b), it will be helpful to identify a subset of the set of features K that highlights similarities between how the feature i influences path a and how the feature i influences path b. Specifically, it will be helpful to know at which levels path a and path b have matching path directions. In this context, there are two scenarios in which path a and path b are considered to have a matching path direction at a given level of the decision tree 601. In the first scenario, (i) path a proceeds to the next level in the decision tree 601 through a left edge of the node through which path a passes in the given level and (ii) path b proceeds to the next level in the decision tree 601 through a left edge of the node through which path b passes in the given level. In the second scenario, (i) path a proceeds to the next level in the decision tree 601 through a right edge of the node through which path a passes in the given level and (ii) path b proceeds to the next level in the decision tree 601 through a right edge of the node through which path b passes in the given level.
In other words, in the first scenario, both path a and path b proceed to a left subtree of a node in the given level. Path a and path b may or may not pass through the same node of the given level to the same subtree, but path a and path b are considered to have a matching path direction in either case as long as they both proceed via a left edge for which a node in the current level is the initial vertex. Similarly, in the second scenario, both path a and path b proceed to a right subtree of a node in the given level. Path a and path b may or may not pass through the same node of the given level to the same subtree, but path a and path b are considered to have a matching path direction in either case as long as they both proceed via a right edge for which a node in the current level is the initial vertex.
With the meaning of the phrase “matching path directions” thus explained, a subset of levels that reflects commonalities between how the features in K influence two paths is defined in the equation below:
where j denotes a feature in K, (j) denotes the splitting directions of the path b at levels of the decision tree 600 that map to respective splitting criteria that apply to the feature j,
(j) a denotes the splitting directions of the path a at levels of the decision tree 600 that map to respective splitting criteria that apply to the feature j, and ε(a, b) denotes the set of pairs of paths for which path a and path b have matching path directions at each level that map to a splitting criterion that applies to the feature j. Note that ε(a, b) will be the empty set if there is no feature j in K for which path a and path b have matching path directions. Also note that ε(a, b) will be equivalent to K if path a equals path b. Of course, depending on which paths are selected as path a and path b, the number of features in ε(a, b) can also be greater than zero or less than the number of features in K.
It will be also be helpful to define an additional set of pairs of paths according to the following equation:
where W denotes a subset of K (i.e., the set of local features for the decision tree 601), Z denotes a subset of W, {tilde over (Z)} denotes the set of features that are in K but are not in Z, u denotes a path, (b, u) denotes a pair of paths, and C(a, Z, W) denotes the set of pairs of paths that conform to the definition established by the equation above (which specifies that (i) the set of pairs of paths ε(a, b) is W; and (ii) the set of pairs of paths ε(b, u) is −Z).
Given the equations and definitions provided above, and as explained in greater detail by Filom et al. (cited above), the generalized formula for computing a Shapley value can be reduced to a formula designed specifically to compute the Shapley value for a feature i for a leaf a in the decision tree 601, as shown in the equation below:
where w+(w,z) denotes a weight that is a functional of the weight w(s,n) (defined above) and is known when w(s,n) is known, w−(w,z) also denotes a weight that is a functional of the weight w(s,n) (defined above) and is known when w(s,n) is known, z denotes the number of features in Z (i.e., the cardinality of Z), cb denotes the value associated with the leaf Rb in the decision tree 601 (i.e., the value the decision tree 601 will assign to an input data record that lands in the leaf Rb), pu denotes the probability estimate (X∈Ru) for Ru, and ϕi(a) denotes the Shapley value for the feature i for the leaf a in the decision tree 601.
The formula ϕi(a) reduces the computational complexity of determining a Shapley value for a feature i for a leaf a in the decision tree 601 to such an extent that it may be practical and desirable to compute the set of Shapley values for the features N of the ensemble 600 for each leaf that is found in the member trees of the ensemble 600. One advantage that results from computing the Shapley values beforehand in this manner is that the Shapley values can be stored in a data structure that maps leaves to their corresponding sets of Shapley values. Once the data structure is populated, the sets of Shapley values for an input data record can be retrieved rapidly from the data structure based on the leaves in which the input data record lands in the decision trees found in the ensemble 600. The overall Shapley value for a feature for the ensemble 600 can be computed by summing the Shapley values for that feature for the decision trees found in the ensemble 600.
To evaluate the formula for ϕi(a) for a given path a (and the leaf indicated thereby) and a given feature i, it will be useful to identify a set of paths referred to herein as a path preimage for the path a. The path preimage for the path a and a subset of K is defined in the equation below:
where a denotes the path, b denotes a path such that the condition ε(a, b)=W is satisfied, W denotes a subset of K (i.e., the set of local features for the decision tree 601), and ε(a, b) denotes a subset of features as explained above. The path preimages for the path a and each possible value of W are computed and stored (e.g., in a matrix of path preimages for the path a). If sets of contribution values for features are to be precomputed for storage in a data structure for subsequent lookup, the path preimages for each path from the root to a leaf of the decision tree 601 paired with each possible value of W (i.e., each possible combination of a and W) can be computed and stored.
Notably, the number of elements in the path preimage ε−1(a,W) for the path a is independent of a. Rather, the number of elements in the path preimage ε−1(a,W) is dependent only on W. Moreover, for every fixed realizable path a, the collection of path preimages {ε−1(a,W)}W⊆K partitions the set of all realizable paths into disjoint parts. Thus, for every fixed realizable path a
where is the number of realizable paths. Thus, the path preimages for the possible values of W and every path a can be stored together in a matrix size
times
.
Note that the priority application U.S. patent application Ser. No. 18/499,974 used the symbol (a,W) to refer to the path preimage for the path a and a subset of K. The current application has changed the symbol used to refer to the path preimage from
(a,W) to ε−1(a,W) so that the symbol
may be used to refer to partitions in the discussion of Owen values further below without causing confusion. In addition, the priority application used the word “preimage” alone to refer to the path preimage. The current application has added the adjective “path” to “preimage” to distinguish the path preimage from other types of preimages which will be discussed further below with respect to Owen values.
Once the path preimages have been computed, it will be useful to compute probabilities for the path preimages (i.e., the path preimage probabilities). The probability of a path preimage is defined by the equation below:
where pPre(a,W) denotes the probability of the path preimage ε−1(a,W), denotes a probability estimate (as defined above), Rb denotes the realizable leaf that is connected to the root 602 of the decision tree 601 via the path b, pb denotes the probability estimate for Rb, and ∪b∈ε
Once the path preimage probabilities have been computed, marginal path expectations can be computed. The marginal path expectation for the path a and the set of features W is defined by the equation below:
where mp(a,W) denotes a marginal path expectation, ca denotes the score associated with the leaf a (i.e., the score that the decision tree 601 will assign to an input data record that lands in the leaf Rb), and the use of T in superscript denotes transposing the operand that immediately precedes T (which presumes that the path preimage probabilities ppre(a,W) are stored as a vector).
A marginal path expectation can be interpreted as an updated expected value for the leaf Ra that is computed by using the probability of the path preimage in place of the probability estimate for the leaf Ra. Functionally, the process of computing a marginal path expectation can be described as identifying the hyperplanes in the multidimensional space of the domain that bound the region of the domain that maps to the leaf Ra.
With the marginal path expectations thus defined, for a given feature i and a given path a, the simplified formula for computing Shapley values can be rewritten as shown in the equation below:
where w+(w,z) denotes a weight that is a functional of the weight w(s,n) (defined above) and is known when w(s,n) is known, w−(w,z) also denotes a weight that is a functional of the weight w(s,n) (defined above) and is known when w(s,n) is known, z denotes the number of features in Z, where W denotes a subset of K (i.e., the set of local features for the decision tree 601), Z denotes a subset of W, and −Z denotes the set of features that are in K but are not in Z.
With the marginal path expectations computed and the weights known, the formula ϕi(a) can be evaluated for each feature i for the leaf a into which an input data record falls in the decision tree 601. The formula ϕi(a) can be similarly evaluated for each feature i for each leaf into which the input data record falls in the other decision trees of the ensemble 600. The sum of the Shapley values for a feature i across the leaves in the ensemble 600 into which the input data record lands can then be summed to determine the overall Shapley value for the ensemble 600.
The formula ϕi(a) can also be evaluated to determine each of the features to determine the respective set of Shapley values to which each leaf in the ensemble 600 maps. The determined Shapley values can then be stored in a data structure that maps leaves to sets of Shapley values to facilitate rapid retrieval and to obviate repeating any calculations when Shapley values are requested for input data instances provided thereafter.
As explained above, the simplified formula for computing Shapley values can be evaluated with greatly increased computational efficiency using path preimages and marginal path expectations in the manner described above. In cases where the features used in an ensemble of decision trees (e.g., a CatBoost model) are statistically independent of each other, Shapley values may serve as valuable indicators of the extent to which each of the corresponding values for those features (e.g., the actual parameters that map to those features) influenced the classification of a given input data record via the ensemble. In other cases, some of the features that an ensemble of decision trees is configured to receive as input might not be statistically independent of each other.
In such cases where the premise of statistical independence between features is not satisfied, there are advantages to using Owen values instead of Shapley values because Owen values are determined in a manner that accounts for the dependencies that exist between features.
The formula for the Owen value Owi[v] is similar to the Shapley value formula (see [0055]), but also incorporates a partition as shown below:
However, the formula for determining Owen values includes additional terms to account for dependencies between features. To evaluate these additional terms, additional computations generally have to be performed. If these additional computations are coded in a brute-force manner that naïvely translates the summation terms in a generalized formula for determining Owen values into computer code, inefficiency results because many of the additional computations will be repeated. Such inefficiency may render it impractical to determine Owen values in a deployed system for classifying input data records.
As will be explained in greater detail below, techniques described herein can be applied to determine Owen values in a manner that is computationally efficient enough to be used in a deployed system for classifying input data records via an ensemble of symmetric decision trees. These techniques involve storing the results of intermediate computations in additional data structures that will be described below. This obviates the inefficiency that would result from naïve approaches that would involve repeating those intermediate computations by allowing the stored results to be retrieved rather than recomputed throughout the process of determining the Owen values for the features.
To elucidate the techniques described herein for determining Owen values, it will be helpful to begin by defining the symbols and notations that will be used to facilitate the explanation below. First, let (X1, . . . X) represent a vector of features that an ensemble of symmetric decision trees (e.g., a CatBoost model) is configured to receive as input. The features in the vector (X1, . . . Xn) and their respective indices will be referred to hereafter as the global features and the global indices, respectively. This will help to distinguish the global features and global indices from local features and local indices that will be described with respect to individual decision trees further below.
To facilitate implementation of the techniques described below in programming code, it will be helpful to use the global indices of the global features in the vector (X1, . . . Xn) to represent the global features themselves because the global indices may be stored as integers (e.g., because many programming languages can store integers via a native data type). For this reason, let N represent the set of global indices {1, . . . , n} of the global features, where each global index represents the respective global feature to which it corresponds in the vector (X1, . . . Xn).
Next, let the global features represented by the global indices in N be partitioned into a set of m global feature groups, where m is a positive integer that is less than or equal to n (note that, in a case where m=n, the Owen values and the Shapley values for the global features will be equivalent). Each global feature group may, for example, comprise features that are not statistically independent of each other (e.g., as determined by applying a clustering technique or some other type of statistical technique). Let represent the set (i.e., global partition) of global feature groups {S1, . . . Sm}, where m is a positive integer that represents the number of global feature groups in the global partition. Again, to facilitate implementation of the techniques described below in programming code, it will be helpful to use the global indices of the global feature groups to represent the global feature groups themselves because the global indices may be stored as integers. For this reason, let M represent the set of global indices {1, . . . , m} of the global feature groups, where each global index represents the respective global feature group to which it corresponds in the set {S1, . . . Sm}. For reference,
In general, a given decision tree in the ensemble will not use the entire set of global features that the ensemble is configured to receive as input. Instead, the given decision tree will use a subset of the global features as a basis for splitting criteria included in the levels of the given decision tree. For any particular global feature that is not used in the given decision tree, the Owen value for the particular global feature for any leaf in the given decision tree will be zero—this can be demonstrated a priori without performing any computations for the particular global feature for the given decision tree. For this reason, for the given decision tree, it is advantageous not to perform any computations for (i) global features not used in the given decision tree and (ii) global feature groups that do not include any global features used in the given decision tree (e.g., to avoid inefficient use of memory, processing cycles, etc.).
To ensure that such unnecessary computations are not performed, it will be helpful to define a partition that is local to the given decision tree. Hereafter,
will be referred to as the tree-local partition. In general throughout the discussion that follows, the tilde diacritic mark will be used to distinguish tree-local elements from global elements (e.g., the tree-local partition
from the global partition
). Each tree-local feature group in the tree-local partition
is a subset of a corresponding global feature group that includes at least one global feature used by the given decision tree. However, each given tree-local feature group in the tree-local partition
includes the global feature(s) in the corresponding global feature group that are used in the given decision tree, but excludes any global features in the corresponding global feature group that are not used by the given decision tree. Furthermore, the tree-local partition
does not include any tree-local feature groups that correspond to any global feature groups that do not include any features used by the given decision tree. As a result, the local partition
effectively distills the global partition
down to feature groups and features that are pertinent to the given tree.
Since the given tree might not include corresponding tree-local feature groups for some of the global feature groups, if the tree-local partition were to use the global indices of the corresponding global feature groups to indicate the tree-local feature groups, the global indices included in the tree-local partition
might not be consecutive.
To facilitate implementation of the techniques described further below (e.g., in programming code), it may be helpful to represent the tree-local partition by using consecutive tree-local indices to indicate the tree-local feature groups (e.g., so that a loop control variable in programming code can readily iterate through the indices of the tree-local feature groups). Therefore, the tree-local partition
may be represented as {{tilde over (S)}1, . . . {tilde over (S)}{tilde over (m)}}, where {tilde over (m)} is a positive integer that represents the number of tree-local feature groups included in the tree-local partition
. Note that {tilde over (m)}≤m because each tree-local feature group corresponds to a respective global feature group.
Furthermore, if the tree-local feature groups were to use the global indices of the features to indicate which features are included in each tree-local feature group, the global indices used to indicate the features in a given tree-local feature group might not be consecutive.
Again, to facilitate implementation of the techniques described further below (e.g., in programming code), it may be helpful to assign additional tree-local indices to the features that are collectively found in the tree-local feature groups (e.g., so that a loop control variable in programming code can readily iterate through the features used in the given tree). The tilde diacritic used for the tree-local feature groups {tilde over (S)}1, . . . {tilde over (S)}m in the tree-local partition may, in addition to signifying that the tree-local groups have been assigned tree-local indices, also signifies that the features collectively included in the tree-local feature groups {tilde over (S)}1, . . . {tilde over (S)}{tilde over (m)} have also been assigned tree-local indices such that, for the given tree, the tree-local index for a given feature identifies the given feature. The set of tree-local indices for the features used in the given tree may be represented by K={1, . . . , k}, where each tree-local index represents a respective feature used in the given tree. For reference,
In addition assigning tree-local indices to tree-local groups and to features used in the given tree, it may be helpful to assign group-local indices to the features within each tree-local feature group. This may facilitate implementing the techniques described further below (e.g., in programming code) because it may allow a loop control variable to begin at a default starting value (e.g., one or zero) for the group-local index of a first feature in a tree-local feature group and iterate through the features in that tree-local group. By contrast, if the tree-local indices for the features in the tree-local group were to be used, additional coding may have to be done to ensure that a loop control variable would begin at the value of the correct tree-local index (which would be different for each tree-local group).
Given the advantages described above that result from using tree-local indices and group-local indices, let {circumflex over (P)} represent the tree-local partition by using tree-local indices for tree-local feature groups and group-local indices for the features found within each tree-local feature group. In general throughout the discussion that follows, the circumflex diacritic (e.g., as included in {circumflex over (P)}) will be used to distinguish the representation that use group-local indices from representations that do not. For example, the tree-local partition
may also be written as {Ŝ1, Ŝ2, . . . , Ŝ{tilde over (m)}}, where the circumflex diacritic signifies that each of the tree-local groups Ŝ1,Ŝ2, . . . , Ŝ{tilde over (m)} uses group-local indices to identify the features included therein.
The tree-local indices for the tree-local feature groups, the tree-local indices for the features used in the given tree, and/or the group-local indices for the features used in the given tree may facilitate rapid computation of the Owen values for leaves of the given tree for the features used in the given tree. However, since Owen values for leaves of the given tree will also be summed with Owen values for leaves of other trees in the ensemble, it will be helpful to define a mapping that maps tree-local indices and/or group-local indices used in the tree-local partition to the global indices for the corresponding global feature groups and/or features. Let Iloc:glob represent such a mapping for the given decision tree. The mapping Iloc:glob will allow a computing platform that implements the techniques described herein to transform the tree-local and/or group-local indices used for the tree-local partition of the given decision tree and tree-local and/or group-local indices used by other trees to their corresponding global indices so that summation of Owen values across the leaves into which a given input instance lands in the ensemble can be performed rapidly and accurately. For reference,
With the symbols and notations above thus defined, the Owen value h for a path a (and the leaf indicated thereby) for a feature i (where i represents the respective global index of a given global feature) that is included in a tree-local feature group Ŝj of a tree-local partition for the given decision tree may be represented by the expression hi∈Ŝ
). (Note that the feature i is considered to be included in the tree-local feature group Ŝj if a group-local index included in the tree-local feature group Ŝj maps to i in the mapping Iloc:glob for the given tree.) The inventors have shown, in Filom et al. (cited above), that the following formula can be used to compute the Owen-like value hi∈Ŝ
Where Δi is the difference between a minuend δ1 and a subtrahend δ2 (i.e., Δi=δ1−δ2). The minuend is defined as:
The subtrahend is defined as:
The formula shown above for computing the Owen-like value hi∈Ŝ
As indicated by the bounds of the outer sum, W represents a subset of {tilde over (M)} (recall that {tilde over (M)} represents the set of tree-local indices {1, . . . , {tilde over (m)}} of the tree-local feature groups for the given tree, where each tree-local index represents the respective tree-local feature group to which it corresponds in the local partition ={Ŝ1, Ŝ2, . . . , Ŝ{tilde over (m)}}). Specifically,
is a subset of {tilde over (M)}\{j}. In this context, the expression {tilde over (M)}{j} indicates the subset of {tilde over (M)} that includes each element (e.g., tree-local index and/or tree-local group represented thereby) of {tilde over (M)} except j. Recall that j represents the tree-local index for the tree-local feature group Ŝj in the tree-local partition
, so W represents a subset of M that does not include j.
Further, Z represents a subset of W. Thus, both Z and W are sets of tree-local indices that do not include the tree-local index j, where each tree-local index represents a respective corresponding tree-local feature group for the given tree. The cardinality of Z (i.e., the number of tree-local indices corresponding to tree-local feature groups in Z) is indicated by the expression |Z|. Similarly, the cardinality of W (i.e., the number of tree-local indices corresponding to tree-local feature groups in W) is indicated by the expression |W| and the cardinality of {tilde over (M)} (i.e., the number of tree-local indices corresponding to tree-local feature groups in {tilde over (M)}) is indicated by the expression |{tilde over (M)}|. Since Z is a subset of W, W is a subset of {tilde over (M)}, and neither Z nor W includes j, the inequality |Z|≤|W|<|{tilde over (M)}| holds.
In the outer sum, both and Z are presented as iterator variables. Specifically, the first summation is computed across the possible values of
and, for each possible value of
, across the corresponding possible values of Z (recall that Z represents a subset of
).
With respect to the functions α1 and α2, the Owen-like value hi∈Ŝ
The Shapley value weight function a (e.g., similar to the known weight value recited with respect to the generalized formula for computing Shapley values) receives two parameters as input. For example, as shown in the multiplicand of the outer sum above, the function a receives the expression |Z| (i.e., the cardinality of Z) as the first parameter and the expression |{tilde over (M)}|+|Z|−|W| as the second parameter. To evaluate the function α in this example, let the first parameter be represented by γ and let the second parameter be represented by θ. In terms of these letters, the function α(γ,θ) equals the value that results from evaluating the expression
Thus, in the multiplicand of the outer sum, α1(|Z|, |{tilde over (M)}|+|Z|−|W|) equals the value that results when |Z| is inserted in the place of γ and |{tilde over (M)}|+|Z|−|W| is inserted in the place of θ in the expression for α(γ,θ).
Turning to the difference Δi, Ŵ* is a subset of the tree-local feature group Ŝj. (Recall that the tree-local feature group Ŝj includes group-local indices for the features used in the given tree and that the mapping Iloc:glob maps those group-local indices to respective global indices for those features.) Further, {circumflex over (Z)}* is a subset of Ŵ*. W* represents the set of global indices to which the group-local indices included in Ŵ* map in the mapping Iloc:glob. Similarly, Z* represents the set of global indices to which the group-local indices included in {circumflex over (Z)}*. map in the mapping Iloc:glob. |{circumflex over (Z)}*| represents the cardinality of {circumflex over (Z)}*, |Ŵ*| represents the cardinality of Ŵ*, and |Ŝj| represents the cardinality of Ŝj. These aspects of the iterator variables Ŵ*, W*, {circumflex over (Z)}*, and Z* are consistent across the minuend δ1 and the subtrahend δ2. However, the minuend δ1 and the subtrahend δ2 specify different types of relationships between the iterator variables Ŵ*, W*, {circumflex over (Z)}*, and Z* and the global index i (which corresponds to the global feature for which an Owen value is to be calculated), as explained below.
For the minuend δ1, the expression i∈Z* indicates that the global index i is included in Z*, thereby indicating that the feature corresponding to the global index i is included in {circumflex over (Z)}* (and in Ŵ* and Ŝj, since {circumflex over (Z)}* is a subset of Ŵ* and Ŵ* is a subset of Ŝj). (Recall that the feature i is considered to be included in the tree-local feature group Ŝj if a group-local index included in the tree-local feature group Ŝj maps to i in the mapping Iloc:glob for the given tree.)
By contrast, for the subtrahend δ2, the expression i∉W* indicates that the global index i is not included in W*, thereby indicating that the feature corresponding to the global index i is not included in Ŵ* (nor in {circumflex over (Z)}*, since {circumflex over (Z)}* is a subset of Ŵ*).
As explained above, the function α2 (like the function α1 used in the multiplicand of the outer sum) refers to the Shapley value weight function that receives a first parameter and a second parameter as input. Thus, in the minuend, α2(|{circumflex over (Z)}*|−1, |Ŝj|+|{circumflex over (Z)}*|−|Ŵ*|) equals the value that results when |Z*|−1 is inserted in the place of γ and |Sj|+|{circumflex over (Z)}*|−|Ŵ*| is inserted in the place of θ in the expression for α(γ,θ). Further, in the subtrahend, α2(|{circumflex over (Z)}*|, |Sj|+Z*|−|Ŵ*|) equals the value that results when |{circumflex over (Z)}*| is inserted in the place of γ and |Sj|+|{circumflex over (Z)}*|−|Ŵ*| is inserted in the place of θ in the expression for α(γ,θ).
In the difference Δi, both Ŵ* and {circumflex over (Z)}* are presented as iterator variables. Specifically, for the minuend, a summation with the general term α2(|{circumflex over (Z)}*|−1, |Ŝj|+|{circumflex over (Z)}*|−|Ŵ*|) is computed across the possible values of Ŵ* and, for each possible value of Ŵ*, across the corresponding possible values of {circumflex over (Z)}* (recall that {circumflex over (Z)}* represents a subset of Ŵ*). Similarly, for the subtrahend, a summation with the general term α2(|{circumflex over (Z)}*|, |Ŝj|+|{circumflex over (Z)}*|−|Ŵ*|) is computed across the possible values of Ŵ* and, for each possible value of Ŵ*, across the corresponding possible values of {circumflex over (Z)}*.
In addition, both the minuend δ1 and the subtrahend δ2 include a summation with the general term mp(b, −Z). In this context, the expression mp(b, −Z) represents the marginal path expectation (e.g., as explained with respect to the discussion for Shapley values above) for the path b and the set of features −Z. The path b is a path that is included in the preimage ε−1(a,W) for the path a in accordance with the definition of preimages provided above in the discussion of Shapley values.
In both the minuend δ1 and the subtrahend δ2, Z represents a set of features that are represented by their corresponding tree-local indices (i.e., each element in Z is a tree-local index that corresponds to a respective feature that is included in Z). Z itself is included in a set of sets of features which will hereafter be called the intersection preimage. The intersection preimage for the tree-local group indicated by the tree-local index j is represented by the expression Int(Z, {circumflex over (Z)}*, j,) and will be explained in greater detail in the following paragraphs. For both the minuend δ1 and the subtrahend δ2, respectively, Z* represents the same set of features as Z (as noted above, though, recall that the relationship between Z* and the global index i differs between the minuend δ1 and the subtrahend δ2). However, Z* represents those features by their corresponding global indices instead of their corresponding tree-local indices (i.e., each element in Z* is a global index that corresponds to a respective feature that is included in Z, although Z uses a tree-local index to represent the respective feature). Also note that −Z represents the complement of Z (i.e., features that are included in K that are not included in Z are included in −Z).
Like Z, W also represents a set of features that are represented by their corresponding tree-local indices (i.e., each element in W is a tree-local index that corresponds to a respective feature that is included in W) in both the minuend δ1 and the subtrahend δ2. The set W itself is included in a set of sets of features which will hereafter be called the inclusion preimage for the tree-local group indicated by the tree-local index j. The inclusion preimage is represented by the expression Inc(,W*,j,
) and will be explained in greater detail in the following paragraphs. For both the minuend δ1 and the subtrahend δ2, respectively, W* represents the same set of features as W (as noted above, though, recall that the relationship between W and the global index i differs between the minuend δ1 and the subtrahend δ2). However, W represents those features by their corresponding global indices instead of their corresponding tree-local indices (i.e., each element in W* is a global index that corresponds to a respective feature that is included in W*).
In terms of the difference Δi, both Ŵ* and {circumflex over (Z)}* serve as iterator variables. Specifically, in the minuend δ1, a summation with the general term α2(|{circumflex over (Z)}* |−1,|Ŝj|+|{circumflex over (Z)}*|−|Ŵ*|) is computed across the possible values of Ŵ* and, for each possible value of Ŵ*, across the corresponding possible values of Z (recall that Z represents a subset of Ŵ*). Similarly, in the subtrahend δ2, a summation with the general term α2(|{circumflex over (Z)}*|, |Ŝj|+|{circumflex over (Z)}*|−|Ŵ*|) is computed across the possible values of Ŵ* and, for each possible value of Ŵ*, across the corresponding possible values of {circumflex over (Z)}*
Furthermore, in both minuend δ1 and the subtrahend δ2, respectively, b, Z, and W each serve as iterator variables. Specifically, both the minuend δ1 and the subtrahend δ2 include summations that are computed across each possible combination of the possible values of b, Z, and W.
As noted above, the inclusion preimage and the intersection preimage are represented by the expression Inc(W*,j,
) and the expression Int(Z,Z*,j,
), respectively. The inclusion preimage and the intersection preimage refer to structures that the inventors created to allow Owen values to be determined with increased computational efficiency (e.g., by obviating multiple redundant computations that other approaches for determining Owen values would entail). The inclusion images represented by the expression Inc(
,W*,j,
) conform to the formal definition below:
This formal definition establishes that the inclusion preimage for the tree-local group indicated by the tree-local index j includes each set Q of features that satisfies three conditions: (i) Q is a subset of K, where K indicates the set of features that are used in the given decision tree; (ii) the intersection set of Q and the set of local features {tilde over (S)}j equals the provided subset W*; and (iii) the set of local indices other than j for local groups that are subsets of Q is equal to .
In addition, the intersection preimage represented by the expression Int(Z,Z*,j,) conform to the formal definition below:
This formal definition establishes that the intersection preimage for the tree-local group indicated by the tree-local index j includes each set Q of features that satisfies three conditions: (i) Q is a subset of K, where K indicates the set of features that are used in the given decision tree; (ii) the intersection set of Q and the set of local features S equals the provided subset {circumflex over (Z)}*; and (iii) the set of local indices other than j for local groups whose respective intersection sets with Q are not empty is equal to Z.
Turning now to an explanation of how the various elements described above (e.g., the inclusion preimages, the intersection preimages, the path preimages, the marginal path expectations, the tree-local indices, the group-local indices, etc.) may be utilized in combination to compute Owen values with increased computational efficiency, it may be helpful to rewrite the formula provided above for determining the Owen value hi∈Ŝ
Rewriting the formula for determining Owen-like values in this manner facilitates the design of a fast implementation of this formula via a number of techniques which shall be presented below.
In this rewritten version of the formula for determining Owen-like values, several elements have been renamed and/or reformatted for illustrative purposes. For example, the inclusion preimage Inc(,W*,j,
) has been replaced by the expression
(
,Ŵ*,j,
). The tilde diacritic and the circumflex diacritic have been added to signify that the features indicated by the use of global indices in W* are indicated by group-local indices in Ŵ*. The set of sets of features represented by the inclusion preimage, though, is unchanged. Similarly,
(Z,{circumflex over (Z)}*,j,
) replaces −Int(Z,Z*,j,
), the complement of Int(Z,Z*,j,
), to signify that the features indicated by the use of global indices in Z, are indicated by group-local indices in {circumflex over (Z)}*. The set of sets of features represented by the intersection preimage, though, is unchanged.
In addition, the Shapley value weight α1(|Z|, |{tilde over (M)}+|Z|−|W|) (recall that the function α1 equals the function α in the case of Owen values, as described above) has been replaced by the expression wouter[||, |Z|;j]·1{j∉W} (the dependence on {tilde over (M)} omitted here because it is fixed in this case). These two expressions refer to the same quantity, but the expression that includes wouter and the indicator function
presents the parameters more concisely and clarifies that this Shapley value weight is zero when
includes j (i.e., because the indicator function
equals one when
does not include j and zero otherwise.)
Similarly, the Shapley value weight α2(|{circumflex over (Z)}* |−1,|Ŝj|+|{circumflex over (Z)}*|−|Ŵ*|) (recall that the function α2 equals the function α in the case of Owen values, as described above) has been replaced by the expression winner+[Ŵ*,{circumflex over (Z)}*;j]·1i∈Z
Similarly, the Shapley value weight α2(|{circumflex over (Z)}*|, |Ŝj|+|{circumflex over (Z)}*|−|Ŵ*|) (again recall that the function α2 equals the function α in the case of Owen values, as described above) has been replaced by the expression winner−[Ŵ*,{circumflex over (Z)}*;j]·1i∉W
Turning now to computationally efficient approaches for determining Owen values, rewritten version of the formula for determining Owen-like values allows computations that would be performed redundantly in the naïve implementation to be performed just once such that the results can be stored and retrieved for use during subsequent computations rather than recomputed for use during those subsequent computations. Approaches for determining Owen values may leverage this unexpected advantageous property to increase computational efficiency.
One such an approach for determining Owen values with increased computational efficiency may commence by performing a number of preliminary computations (i.e., precomputations) beforehand and storing the results of these computations in a data structure for later retrieval when Owen values are in the process of being determined. For a given decision tree, the path preimages ε−1(a,W) are computed for each leaf a and each possible subset W of the features indicated by the tree-local indices for the features K used in the given tree. The path preimages ε−1(a,W) may be stored, for example, as a matrix in any type of data structure that is suitable for storing matrices.
Furthermore, the marginal path expectations mp(a,W) are computed for each leaf a and each possible subset W of the features indicated by the tree-local indices for the features K used in the given tree. The marginal path expectations mp(a,W) may be stored, for example, as a matrix in any type of data structure that is suitable for storing matrices.
Further still, the Shapley value weights indicated by wouter, winner+, and winner− are precomputed for each tree-local feature group j represented by the tree-local indices in M. The Shapley value weights indicated by wouter may be stored as a matrix of size K×K in any data structure suitable for storing matrices. For each tree-local group j, the Shapley value weights indicated by winner+ may be stored as a matrix of size |{tilde over (S)}j|×|{tilde over (S)}j|. Similarly, for each tree-local group j, the Shapley value weights indicated by winner− may also be stored as a matrix of size |{tilde over (S)}j|×|{tilde over (S)}j|.
Further still, a set of possible pairs of the possible values of Z and W is computed. (Recall that Z is a subset of W and W is a subset of {tilde over (M)}.) In addition, for each possible value of j, a set of possible pairs of the possible values of {circumflex over (Z)}* and Ŵ* is computed. (Recall that {circumflex over (Z)}* is a subset of Ŵ* and Ŵ* is a subset of Ŝj.) Persons of skill in the art will recognize that sets of possible pairs may be stored in many different types of data structures.
Further still, the intersection preimages and the inclusion preimages are computed for each tree-local feature group j represented by the tree-local indices in {tilde over (M)}. The intersection preimage for a single tree-local feature group is, in and of itself, a set of sets of features. Therefore, the intersection images for multiple tree-local feature groups collectively constitute a set of sets of sets. Similarly, the inclusion preimages for multiple tree-local feature groups collectively constitute a set of sets of sets. Persons of skill in the art will recognize that sets of sets of sets may be stored in many different types of data structures (e.g., lists of lists of lists in Python, arrays (or hashes) of arrays (or hashes) of arrays (or hashes) in Perl, etc.).
Once the preliminary computations listed above are complete, the Owen values for the features used by a given decision tree can be computed for each leaf in the given decision tree via a number of nested loops as explained below. Note that the nested loops implement an approach in which iterating through the possible values of (i) the leaf a and (ii) the feature i is performed within a fourth loop that is nested within three loops that iterate over other types of values. By contrast, a naïve implementation of the rewritten version of the formula for determining Owen-like values described above would involve iterating through the possible values of the leaf a and the possible values of the feature i in loops that are not nested as deeply, such that many redundant calculations would be made. By iterating through the possible values of the leaf a and the possible values of the feature i within the fourth loop in conjunction with utilizing the results of the precomputations, this approach helps to achieve greater computational efficiency than the naïve implementation because redundant calculations involved in the naïve implementation are avoided.
In this approach, a vector of respective tree-local Owen values for the features used in the given decision tree for each leaf in the given decision tree may be initialized with each respective tree-local Owen value initially set to zero. Next, a first loop may be configured to iterate over each possible value of W (recall that W represents a subset of the set of tree-local indices M of the tree-local feature groups for the given tree, where each tree-local index represents the respective tree-local feature group to which it corresponds in the local partition ={Ŝ1, Ŝ2, . . . , Ŝ{tilde over (m)}}).
A second loop may be nested within the first loop and may be configured to iterate over each possible value of Z (recall that Z represents a subset of W).
A third loop may be nested within the second loop and may be configured to iterate over each possible value of j that is not included in W (recall that j represents a tree-local index of a tree-local feature group).
A fourth loop may be nested within the third loop and may be configured to iterate over each possible value of Ŵ* (recall that Ŵ* is a subset of the tree-local feature group Ŝj, where group-local indices are used to refer to features). Within the fourth loop, the inclusion preimage for the current values of , Ŵ*, and j (i.e., the preimage
(
,Ŵ*,j,
)) may be retrieved from a data structure in which the inclusion preimages were stored during the preliminary computations. The retrieved inclusion preimage may be stored in the form of a variable that is defined at least within the scope of the fourth loop.
A fifth loop may be nested within the fourth loop and may be configured to iterate over each possible value of {circumflex over (Z)}* (recall that {circumflex over (Z)}* is a subset of Ŵ*). Within the fifth loop, the intersection preimage for the current values of Z, {circumflex over (Z)}*, and j (i.e., the preimage (Z,{circumflex over (Z)}*;j,
)) may be retrieved from a data structure in which the intersection preimages were stored during the preliminary computations. The retrieved intersection preimage may be stored in the form of a variable that is defined at least within the scope of the fourth loop.
Returning to the fourth loop, the Shapley value weight wouter for the current value of and the current value of Z may be retrieved from a data structure in which the wouter Shapley
value weights were stored during the preliminary computations. The retrieved Shapley value weight wouter may be stored in the form of a variable that is defined at least within the scope of the fourth loop. In addition, the Shapley value weight winner+ for the current values of Ŵ*, {circumflex over (Z)}*, and j may be retrieved from a data structure in which the winner+ Shapley value weights were stored during the preliminary computations. The retrieved Shapley value weight winner+ may be stored in the form of a variable that is defined at least within the scope of the fourth loop. Furthermore, the Shapley value weight winner− for the current values of Ŵ*, {circumflex over (Z)}*, and j may be retrieved from a data structure in which the winner− Shapley value weights were stored during the preliminary computations. The retrieved Shapley value weight winner− may be stored in the form of a variable that is defined at least within the scope of the fourth loop.
Furthermore, within the fourth loop, a plurality of vectors in the matrix of marginal path expectations that correspond to the inclusion preimage for the current values of Z and Z, are added together to form an intermediate vector. The intermediate vector may be stored in the form of a variable that is defined at least within the scope of the fourth loop. In addition, a sum-tracker vector configured to store sums of values in the intermediate vector that correspond to possible values of W may be initialized with each sum to be stored therein as zero. The sum-tracker vector may be in scope at least throughout the fourth loop.
To illustrate how the intermediate vector relates to the rewritten version of the formula for determining Owen-like values described above, let the following triple sum that appears in the rewritten version be represented by as shown below:
Furthermore, let the intermediate vector be represented by v as shown below:
In the expression v(⋅;Z,{circumflex over (Z)}*,j,), “⋅” represents a placeholder for a leaf and Z and {circumflex over (Z)}* represent the current values of Z and {circumflex over (Z)}*, respectively. Each entry in the intermediate vector v(⋅;Z,{circumflex over (Z)}*,j,
) corresponds to a respective leaf in the given tree. The intermediate vector v(⋅;Z,{circumflex over (Z)}*,j,
) can be used to evaluate the triple sum
(a;Z,{circumflex over (Z)}*,
,Ŵ*,j) with increased efficiency if the triple sum is rewritten as shown below:
Utilizing the intermediate vector v to evaluate the triple sum (e.g., within the sixth loop, as described below) obviates the recalculation of identical sums that would occur in a naïve implementation.
A sixth loop may be nested within the fourth loop and may be configured to iterate through each possible value of W (recall that represents a set of features that are re resented by their corresponding tree-local indices) that is included in the inclusion preimage (
,Ŵ*,j,
) (recall that the inclusion preimage is a set of sets of features). Within the sixth loop, the path preimages ε−1(a,W) for each leaf a in the given tree for the current value of W may be retrieved from a data structure in which the path preimages were stored during the preliminary computations. The retrieved path preimages may be stored in the form of a variable that is defined at least within the scope of the sixth loop. Furthermore, the sum of elements in the intermediate vector that correspond to the retrieved path preimages for the current value of W may be added to the sum corresponding to the current value of W in the sum-tracker vector. The entries (e.g., sums) in the sum-tracker vector are complete when the sixth loop has completed iterating through the possible values of W that are included in the inclusion preimage
(
,Ŵ*,j,
). Each entry in the sum-tracker vector corresponds to a respective leaf in the given tree. Specifically, the sum-tracker vector is the vector of triple sums
(⋅;Z,{circumflex over (Z)}*,
,Ŵ*,j), where ⋅ is a placeholder for a leaf value.
Returning to the fourth loop, a first condition specifying that the feature represented by the global feature index i for the current feature (i.e., the feature for which an Owen value is currently being determined) is included in {circumflex over (Z)}* may be evaluated. If the first condition is satisfied, each tree-local Owen value for the current feature for each realizable leaf may be increased by an amount equal to the corresponding sum for the realizable leaf in the sum-tracker vector multiplied by the winner+ for the current values of Ŵ*, {circumflex over (Z)}*, and j. Furthermore, a second condition specifying that the feature represented by the global feature index i for the current feature is included in the complement of Ŵ* may be evaluated. If the second condition is satisfied, each tree-local Owen value for the current feature for the realizable leaf may be increased by an amount equal to the corresponding sum for the realizable leaf in the sum-tracker vector multiplied by the winner+ for the current values of Ŵ*, {circumflex over (Z)}*, and j. The respective tree-local Owen values may be stored in the vector of respective tree-local Owen values mentioned above.
The respective operations (e.g., increasing one or more tree-local Owen values) that are performed in response to determining the first condition and/or the second condition are satisfied allow the respective tree-local Owen values for a feature to be updated for multiple leaves during a single iteration of the fourth loop. This increases computational efficiency because the actions of determining whether first and second conditions are satisfied and executing the blocks of programming code that are contingent on the first and second conditions do not have to be repeated for each of the features.
Turning to
Prior to commencement of the example process 700, a model object for a data science model that is to be deployed by an entity for use in making a particular type of decision may be trained. In general, this model object may comprise any model object that is configured to (i) receive an input data record comprising a set of actual parameters that are related to a respective individual (e.g., person) and map to a particular set of formal parameters (which may also be referred to as the model object's “features” or the model object's “predictors”), (ii) evaluate the received input data record, and (iii) based on the evaluation, output a score that is then used make the given type of decision with respect to the respective individual. Further, the model object that is trained may take any of various forms, which may depend on the particular data science model that is to be deployed.
For instance, as one possibility, the model object may comprise a model object for a data science model to be utilized by an entity to decide whether or not to extend a particular type of service (e.g., a loan, a credit card account, a bank account, or the like) to a respective individual within a population. In this respect, the set of formal parameters for the model object may comprise data variables that are predictive of whether or not the entity should extend the particular type of service to a respective individual (e.g., variables that provide information related to credit score, credit history, loan history, work history, income, debt, assets, etc.), and the score may indicate a likelihood that the entity should extend the particular type of service to the respective individual, which may then be compared to a threshold value in order to reach a decision of whether or not to extend the particular type of service to the respective individual.
The function of training the model object may also take any of various forms, and in at least some implementations, may involve applying a machine-learning process to a training dataset that is relevant to the particular type of decision to be rendered by the data science model (e.g., a set of historical data records for individuals that are each labeled with an indicator of whether or not a favorable decision should be rendered based on the historical data record). In this respect, the machine-learning process may comprise any of various machine learning techniques, examples of which may include decision-tree techniques, among various other possibilities.
As shown in
As shown in block 704, the example process 700 further includes partitioning the set of features into a plurality of global feature groups based on dependencies between the features.
The features may be partitioned into two or more global feature groups based on dependencies (e.g., based on the mutual information shared between the features), where each global feature group comprises at least one feature. In this respect, any technique now known or later developed to partition the features into groups may be used based on such dependencies, including, but not limited to, any of various possible clustering techniques.
For instance, as one possibility, the features may be partitioned into global feature groups utilizing a clustering technique that is based on Maximal Information Coefficient (MIC) values, which are a regularized (e.g., normalized) version of mutual information that provide measures of the dependency strengths between different pairs of features. For example, an MIC value for a pair of features that is near or at 0 indicates that there is little or no dependency between the features in the pair (e.g., the features in the pair are independent of one another), whereas an MIC value for a pair of features that is at or near 1 indicates that there is a strong dependency between the features in the pair (e.g., the features are dependent on one another).
In order to cluster based on these MIC values, a computing platform may begin by determining a respective MIC value for each possible pair of features based on an analysis of a training dataset (e.g., the training dataset that was used to train the date science model). Next, the computing platform may (i) translate the MIC values into dissimilarity values (e.g., by taking the respective complements of the MIC values, where the respective complement of a given MIC value may be determined by subtracting the given MIC value from one) and then (ii) input those dissimilarity values into an agglomerative clustering algorithm that functions to cluster the features in a “bottom up” manner by initially treating each feature as a single-feature cluster, and then during each iteration of the algorithm, merging a selected pair of clusters (e.g., the pair of clusters having the a smallest intergroup dissimilarity) into a combined cluster. Such an algorithm may continue to iterate until all of the features have been merged into one combined cluster, and the result is a dendrogram (also referred to as a partition tree) that encodes the strength of the dependencies between the features in terms of hierarchal tree of clusters, where the height of the line that connects two lower-level clusters represents the dissimilarly between the lower-level clusters. After the dendrogram has been produced, the computing platform may apply a threshold dissimilarity value to the dendrogram in order to cut the tree at a given height and thereby define a particular set of feature clusters that satisfy the threshold dissimilarity value, which may then be utilized as the global feature groups that are defined based on dependencies.
As another possibility, the computing platform may group the model object's features together utilizing a clustering technique that is based on principal component analysis (PCA) (e.g., the PROC VARCLUS clustering technique developed by SAS®). According to one such PCA-based clustering technique, the computing platform may begin by assigning each of the model object's features to a single cluster, generating a covariance matrix for the model object's features based on an analysis of a training dataset (e.g., the training dataset that was used to train model object), and then utilizing the generated covariance matrix to split the single cluster of features into two clusters of features. The computing platform may then iteratively repeat this process in a “top down” manner for each resulting cluster until all clusters include only a single feature, which forms a tree structure representing the relationships between the features. In turn, the computing platform may then combine clusters of features within the tree structure together into a group if the correlation between the features in the clusters is above a threshold. However, it should be understood that this is just one possible example of a PCA-based clustering technique, and that other PCA-based clustering techniques could also be utilized to group the model object's features together based on dependencies.
The computing platform could also utilize other clustering techniques to group the model object's features together based on their dependencies.
Further details regarding these and other techniques for grouping a model object's features together based on dependencies can be found in (i) U.S. patent application Ser. No. 16/868,019, which was filed on May 6, 2020 and is entitled “SYSTEM AND METHOD FOR UTILIZING GROUPED PARTIAL DEPENDENCE PLOTS AND SHAPLEY ADDITIVE EXPLANATIONS IN THE GENERATION OF ADVERSE ACTION REASON CODES,” and (ii) U.S. patent application Ser. No. 17/322,828, which was filed on May 17, 2021 and is entitled “SYSTEM AND METHOD FOR UTILIZING GROUPED PARTIAL DEPENDENCE PLOTS AND GAME-THEORETIC CONCEPTS AND THEIR EXTENSIONS IN THE GENERATION OF ADVERSE ACTION REASON CODES,” each of which is incorporated herein by reference in its entirety, and (iii) the paper entitled “Stability theory of game-theoretic group feature explanations for machine learning models” by Miroshnikov et al. (2024), which can be found at https://arxiv.org/abs/2102.10878v5 and is incorporated herein by reference in its entirety.
As shown in block 706, the example process 700 further includes inputting the group of actual parameters into the trained data science model. The trained data science model comprises an ensemble of decision trees wherein each individual decision tree in the ensemble is symmetric, each individual decision tree in the ensemble is configured to receive a respective subset of the features as input, and, within each individual decision tree, internal nodes that are positioned in a same level designate a same splitting criterion based on a same feature selected from the respective subset of features. The trained data science model may be, for example, a CatBoost model.
As shown in block 708, the example process 700 further includes, for each individual decision tree in the ensemble, (i) assigning each individual feature in the subset of features for the individual decision tree to a corresponding local feature group that is a subset of a given global feature group that includes the individual feature, (ii) identifying a respective leaf such that the actual parameters satisfy a series of splitting conditions for edges that connect nodes in a respective path from a root of the individual decision tree to the respective leaf, and (iii) based on the corresponding local feature groups, determining a set of respective individual contribution values for the respective leaf based on the corresponding local feature groups. Each of the respective individual contribution values maps to a respective feature found in the respective subset of features of the individual decision tree. The respective individual contribution values and the respective overall contribution values may be, for example, Owen values or Banzhaf-Owen values.
As shown in block 710, the example process 700 further includes, for each individual feature in the set of features, computing a respective overall contribution value based the respective global feature group and based on a sum of the respective individual contribution values that map to that individual feature. This may be achieved, for example, by summing the local contribution values for the individual feature across the trees in the ensemble.
As shown in block 712, the example process 700 further includes computing, via the trained data science model, the score for the input data record. For example, the input data record score for the first input data record may be determined based on respective labels (e.g., scores) associated with the leaves in the ensemble into which the input data record falls.
Further, as shown in block 714, the example process may further include identifying at least one reason code for the score based on the respective overall contribution values for the individual features in the set of features.
Still further, as shown in block 716, the example process 700 may include transmitting the score and the at least one reason code in response to the request.
Turning to
Turning to :glob) be defined as shown in the selection 811. The mapping may be stored, for example,
as an array (e.g., as shown in the selection 811) such that a position of a given value stored the array specifies a tree-local feature index, while the given value at the position specifies a corresponding global feature index. For instance, as shown in the selection 811, the value “2” at the first position in the array indicates that a tree-local feature index of one maps to a global feature index of two. Similarly, the value “4” at the second position in the array indicates a tree-local feature index of two maps to a global feature index of four. This pattern holds for the other values shown in the array.
Collectively, as shown in the selection 812, the tree-local features {tilde over (X)}1, {tilde over (X)}2, {tilde over (X)}3, {tilde over (X)}4, {tilde over (X)}5, {tilde over (X)}6 may be depicted by the vector {tilde over (X)}. However, to further facilitate efficient computation for the operations discussed below, let the tree-local feature groups be represented by their respective local feature indices (depicted by the set K), as shown in the selection 813.
Turning again to the global feature groups S1, S2, S3, S4 in the partition P, note that some of the global feature groups include global features that the given decision tree is not configured to receive as input. In the selection 814, the global feature indices of the global features that the given decision tree is not configured to receive as input are shown in strikethrough. To facilitate efficiency, the global feature indices of these unused global features may be excluded from further analysis for the purpose of determining Owen values in the context of the given tree, thereby forming subsets of the global feature groups (e.g., as shown in the selection 814). Furthermore, to further facilitate efficient computation for the operations discussed below, the global feature indices included in the subsets of the global feature groups may be converted (e.g., via the mapping depicted by :glob) to their corresponding tree-local feature indices. The tree-local feature groups shown in the selection 815 are the subsets of the global feature groups {tilde over (S)}1, {tilde over (S)}2, {tilde over (S)}3, {tilde over (S)}4 shown in the selection 814 after the global feature indices have been converted to their corresponding tree-local indices.
Turning to ) of tree-local features into tree-local feature groups can be transcribed in the forms shown in the selection 821.
To facilitate rapid translation of tree-local feature groups to global feature groups, let a mapping (depicted by I:glob) be defined as shown in the selection 822. The mapping may be stored, for example, as an array (e.g., as shown in the selection 822) such that a position of a given value stored the array specifies a tree-local group index, while the given value at the position specifies a corresponding global group index. For instance, as shown in the selection 822, the value “1” at the first position in the array indicates that a tree-local group index of one maps to a global group index of one. Similarly, the value “2” at the second position in the array indicates a tree-local group index of two maps to a global group index of two. This pattern holds for the other values shown in the array. Note that, although the tree-local group indices in this example match their corresponding global group indices, tree-local group indices and global group indices might not match each other in other examples (e.g., examples in which a given decision tree in the ensemble does not include at least one global feature from a global feature group, such that the global feature group is not represented in the given decision tree).
In addition, to further facilitate efficient computation for the operations discussed below, let the tree-local features in each given tree-local feature group be represented in the form of corresponding group-local features that are local to the given tree-local feature group. For instance, as shown in the selection 823, let the tree-local feature {tilde over (X)}1 included in the tree-local feature group {tilde over (S)}1 be depicted by the group-local feature {tilde over (X)}1. Similarly, as shown in the selection 824, let the tree-local feature {tilde over (X)}2 and the tree-local feature {tilde over (X)}3 included in the tree-local feature group {tilde over (S)}2 be depicted by the group-local features {tilde over (X)}1, {tilde over (X)}2. In addition, as shown in the selection 825, let the tree-local feature {tilde over (X)}4 and the tree-local feature {tilde over (X)}5 included in the tree-local feature group {tilde over (S)}3 be depicted by the group-local features {circumflex over (X)}1, {circumflex over (X)}{tilde over (X)}2. Also, as shown in the selection 826, let the tree-local feature {tilde over (X)}6 included in the tree-local feature group {tilde over (S)}4 be depicted by the group-local feature {circumflex over (X)}1.
Turning to ) can be transcribed in the forms shown in the selection 838.
With the notations for the global feature indices, the global group indices, the tree-local feature indices, the tree-local group indices, and the group-local feature indices for this example explained above
Turning to . The possible subsets for each given value of
are the possible values of Z for the given
.
Turning to
The selections 851a-c depict the possible subsets of Ŝ2 using the group-local feature indices specific to the group Ŝ2. The selections 852a-c depict the possible subsets of Ŝ2 after group-local feature indices have been converted to tree-local feature indices. The selections 853a-c depict the possible subsets of Ŝ2 after tree-local feature indices have been converted to global feature indices. With the feature indices this converted, the possible subsets are the possible values of Ŵ*. The selections 854a-e depict the possible subsets of each respective value of Ŵ* (which are the possible values of {circumflex over (Z)}*).
Note that the example above represents precomputation that can be advantageously performed before any input data record is evaluated. This, in turn, provides for the efficient evaluation of Owen values for each feature across each leaf in each tree of a model. Similar to the manner described above with regard to Shapley values, Owen values can be computed on a leaf-by-leaf basis rather than on an input-data-record-by-input-data-record basis. In particular, the formula for hi∈Ŝ
Many advantages result from applying the techniques, structures, and systems described above for computing Owen values. For instance, certain structures described herein (e.g., path preimages, intersection preimages, and inclusion preimages) serve as new data structures that, when applied as described herein, facilitate increased computational efficiency that cannot be achieved by previous methods for determining Owen values. This increased computational efficiency allows Owen values to be computed rapidly enough for practical use in industry applications, where an explanation of a model's classification decision is needed with very little delay. By contrast, the inefficiency of previous approaches for determining Owen values would result in unacceptable levels of lag time for even large cloud computing systems, thereby rendering such approaches impractical for industry use for ensembles that have acceptable levels of classification accuracy (e.g., because ensembles that achieve sufficient levels of classification accuracy typically use hundreds or even thousands of trees and many input features). In addition, the techniques, structures, and systems described herein allow the computational load for determining Owen values to be distributed over a period time before deployment of the ensemble. Distribution of the computational load of determining Owen values over such a period of time before deployment is not possible with previous methods for determining Owen values. Thus, the techniques, structures, and systems described herein both reduce the computational load and facilitate distribution of the computational load over a time period that is unavailable for computation under previous approaches.
Turning now to
For instance, the one or more processors 902 may comprise one or more processor components, such as one or more central processing units (CPUs), graphics processing units (GPUs), application-specific integrated circuits (ASICs), digital signal processor (DSPs), and/or programmable logic devices such as a field programmable gate arrays (FPGAs), among other possible types of processing components. In line with the discussion above, it should also be understood that the one or more processors 902 could comprise processing components that are distributed across a plurality of physical computing devices connected via a network, such as a computing cluster of a public, private, or hybrid cloud.
In turn, data storage 904 may comprise one or more non-transitory computer-readable storage mediums, examples of which may include volatile storage mediums such as random-access memory, registers, cache, etc. and non-volatile storage mediums such as read-only memory, a hard-disk drive, a solid-state drive, flash memory, an optical-storage device, etc. In line with the discussion above, it should also be understood that data storage 904 may comprise computer-readable storage mediums that are distributed across a plurality of physical computing devices connected via a network, such as a storage cluster of a public, private, or hybrid cloud that operates according to technologies such as AWS for Elastic Compute Cloud, Simple Storage Service, etc.
As shown in
The one or more communication interfaces 906 may comprise one or more interfaces that facilitate communication between computing platform 900 and other systems or devices, where each such interface may be wired and/or wireless and may communicate according to any of various communication protocols, examples of which may include Ethernet, Wi-Fi, serial bus (e.g., Universal Serial Bus (USB) or Firewire), cellular network, and/or short-range wireless protocols, among other possibilities.
Although not shown, the computing platform 900 may additionally include or have an interface for connecting to one or more user-interface components that facilitate user interaction with the computing platform 900, such as a keyboard, a mouse, a trackpad, a display screen, a touch-sensitive interface, a stylus, a virtual-reality headset, and/or one or more speaker components, among other possibilities.
It should be understood that computing platform 900 is one example of a computing platform that may be used with the examples described herein. Numerous other arrangements are possible and contemplated herein. For instance, other computing systems may include additional components not pictured and/or more or less of the pictured components.
This disclosure makes reference to the accompanying figures and several examples. One of ordinary skill in the art should understand that such references are for the purpose of explanation only and are therefore not meant to be limiting. Part or all of the disclosed systems, devices, and methods may be rearranged, combined, added to, and/or removed in a variety of manners without departing from the true scope and spirit of the present invention, which will be defined by the claims.
Further, to the extent that examples described herein involve operations performed or initiated by actors, such as “humans,” “curators,” “users” or other entities, this is for purposes of example and explanation only. The claims should not be construed as requiring action by such actors unless explicitly recited in the claim language.
The current application is a continuation in part of U.S. patent application Ser. No. 18/499,974, filed on Nov. 1, 2023, which is hereby incorporated by reference in its entirety for all purposes.
| Number | Date | Country | |
|---|---|---|---|
| Parent | 18499974 | Nov 2023 | US |
| Child | 18815524 | US |