The exemplary embodiment relates to automated classification based on a set of features and finds particular application in connection with a system and method for modifying the set of features to change the output class, without modifying the classifier.
Increasingly, processes governing human lives are based, at least in part, on an automatic decision step. A feature vector is generated which encodes attributes of a given person and a binary classifier is used to output a decision which influences the final outcome. In some cases, these decisions do not have significant impact on the person, such as a decision on movies or books to recommend to a person. However, consequences can be more severe in some cases. For example, decision algorithms are used in determining whether to provide a mortgage to a person, based on age, financial details, etc., whether grant bail to an accused person, based on factors such as the risk of flight, the severity of the alleged crime, the likelihood the person could pose a danger to others, and so forth, or to determine a length of a sentence. In some cases, these systems are beneficial. For example, they can reduce the number of those incarcerated based on their inability to post bail. However, in other cases, they may have more severe consequences. In some cases, the decision is computed with a proprietary model and the way in which risk factors are taken into account is not clearly understood. For example, a longer sentence could be given to a person because a similar population is predicted to have a greater risk of reoffending. This puts people into situations where they cannot know why that decision was reached, nor can they modify the outcome.
Current solutions like open-sourcing the decision model, or adding justification of the decisions can provide a level of transparency, but do not change the decision.
Given a binary classifier trained to classify a feature vector as being in a given class or not, it is desirable to provide a mechanism by which the feature vector can be modified, in a small way, in order to obtain a modified feature vector that is classified by the classifier as being in the opposite class. This can be achieved by minimizing a distance between the original feature vector and the modified feature vector. Methods for solving this have been proposed, for different classifiers, with the aim of active learning or adversarial learning. One approach considers the case of the Naïve Bayes classifier, and proposes a learning strategy that takes into account the presence of an adversary using tools from game theory. See Dalvi, et al., “Adversarial classification,” Proc. 10th ACM SIGKDD Int'l Conf. on Knowledge discovery and data mining, pp. 99-108, 2004. It solves the problem for linear models, multi-layer perceptrons and for various kernels of SVM.
Various projected gradient descent methods have also been proposed to obtain realistic samples (which are in a dense space with respect to the training data). See. Biggio, et al., “Evasion attacks against machine learning at test time,” Joint European Conf. on Machine Learning and Knowledge Discovery in Databases, pp. 387-402, 2013 (hereinafter, Biggio 2013). In Kantchelian, et al., “Evasion and Hardening of Tree Ensemble Classifiers,” Intl Conf. on Machine Learning (ICML), 2016, the problem of evading a decision tree is addressed. Two solutions are proposed: an exact one, relying on integer linear programming techniques, and a heuristics-based one using iterative coordinate descent. Both solutions are given for the case where the distance to be minimized is a norm (lp). This precludes cases where the features are meaningful attributes (instead of, say, pixels), some of which cannot be changed, and where the cost may vary greatly.
A system and method are provided which enable a user to influence the decision process by defining cost functions for individual features which reflect the user's circumstances and preferences.
In accordance with one aspect of the exemplary embodiment, a method for guiding users in an automated decision-making environment includes receiving an initial feature vector which is classified with a classifier model, the classification being a second of a plurality of classes. Provision is made for a user to define costs for independently modifying feature values for at least some features in the initial feature vector. Subspaces are identified in a feature space in which the classifier model classifies an input feature vector in a first of the set of classes. With a cost function which takes into account the user-defined costs, a modified feature vector is identified in one of the identified subspaces which optimizes the cost function. The modified feature vector or information based thereon is output.
At least one of the steps of the method may be performed with a processor.
In accordance with another aspect of the exemplary embodiment, a system for guiding users in an automated decision-making environment includes a classifier component which classifies a feature vector with a classifier model and outputs a classification for an input feature vector. A graphical user interface generator provides for a user to define costs for modifying feature values for at least some of the features in a feature vector for which the classification is a second of a set of classes. A mapping component identifies subspaces in a feature space in which the classifier model classifies an input feature vector in a second of the set of classes. A modification component identifies a modified feature vector in one of the identified subspaces which optimizes a cost function with a subset of the user defined costs. An output component outputs the modified feature vector or information based thereon. A processor implements the classifier component, graphical user interface generator, mapping component, modification component, and output component.
In accordance with another aspect of the exemplary embodiment, a method for guiding users in an automated decision-making environment includes identifying leaves of decision trees of a random forest classifier model which are associated with a first of a plurality of classes. A graph is generated in which nodes represent the identified leaves. The graph generation includes connecting pairs of nodes, which represent leaves that are not inconsistent, with edges. Cliques in the graph of size at least
nodes are identified, where k is the number of decision trees. Each clique corresponds to a subspace in which a feature vector is classified by the classifier model in a first of a plurality of classes. Provision is made for a user to define costs for modifying feature values of at least some of the features in an initial feature vector which is classified by the classifier model in a second of the plurality of classes. With a cost function which takes into account the user-defined costs, identifying a modified feature vector in one of the identified subspaces which optimizes the cost function. The modified feature vector or information based thereon is output.
At least one of the identifying leaves, identifying cliques and identifying a modified feature vector may be performed with a processor.
A system and method are now described which assume the existence of a binary classifier which has been trained to output a decision based on an input vector of attributes. A user for whom a decision is being made is provided with alternatives which would make the decision algorithm decide differently. The method includes enumerating subspaces within the multidimensional space defined by the ranges of possible feature values where the classifier provides the desired output. As an example, the specific case of classifiers based on decision forests (ensemble methods based on decision trees) is considered by mapping the problem to an iterative version of enumerating k-cliques.
The system and method provide the user with a set of steps to perform in order to achieve the desired outcome. The system and method do not require disclosing details about the model which makes the decision. Rather, the user is asked to weight, with a non-negative value, the relative cost of changing the features. The weights can vary depending on the feature, and may be linear, infinite (e.g., for changing height, or getting younger), quadratic (e.g., losing weight) or any other function, that is not necessarily differentiable or symmetric. Based on the user inputs, the system recommends an alternative set of feature values that optimizes (e.g., minimizes) the modification cost but ensures that the output decision would change. The system may be in the form of a tool which can be used either independently by the end-user, or it could be part of a solution provided to an intermediate human agent whose interest is to provide a positive solution but without raising red flags in his institutional system, for example, an agent processing credit requests.
By enumerating all subspaces where the classifier would provide the desired decision, and returning those that are close enough to the original feature vector, with respect to the cost function, the method can be very flexible and user-specific.
In the following, the terms “optimization,” “minimization,” and similar phraseology are to be broadly construed as one of ordinary skill in the art would understand these terms. For example, these terms are not to be construed as being limited to the absolute global optimum value, absolute global minimum, and so forth. For example, minimization of a function may employ an iterative minimization algorithm that terminates at a stopping criterion before an absolute minimum is reached. It is also contemplated for the optimum or minimum value to be a local optimum or local minimum value.
With reference to
The computer system 10 may include one or more computing devices 40, such as a PC, such as a desktop, a laptop, or palmtop computer, portable digital assistant (PDA), server computer, cellular telephone, tablet computer, pager, combination thereof, or other computing device capable of executing instructions for performing the exemplary method.
The memory 20 may represent any type of non-transitory computer readable medium such as random access memory (RAM), read only memory (ROM), magnetic disk or tape, optical disk, flash memory, or holographic memory. In one embodiment, the memory 20 comprises a combination of random access memory and read only memory. In some embodiments, the processor 24 and memory 20 may be combined in a single chip. Memory 20 stores instructions for performing the exemplary method as well as the processed data 14, 18, etc. The classifier model 12 may be resident on the computing device 40 or accessed on a remote computing device, such that the parameters of the model are not known to the system.
The interface 28, 30 allows the computer to communicate with other devices via a wired or wireless link 42, e.g., a computer network, such as a local area network (LAN) or wide area network (WAN), or the internet, and may comprise a modulator/demodulator (MODEM) a router, a cable, and/or Ethernet port.
The digital processor device 24 can be variously embodied, such as by a single-core processor, a dual-core processor (or more generally by a multiple-core processor), a digital processor and cooperating math coprocessor, a digital controller, or the like. The digital processor 24, in addition to executing instructions 22 may also control the operation of the computer 40.
The term “software,” as used herein, is intended to encompass any collection or set of instructions executable by a computer or other digital system so as to configure the computer or other digital system to perform the task that is the intent of the software. The term “software” as used herein is intended to encompass such instructions stored in storage medium such as RAM, a hard disk, optical disk, or the like, and is also intended to encompass so-called “firmware” that is software stored on a ROM or the like. Such software may be organized in various ways, and may include software components organized as libraries, Internet-based programs stored on a remote server or so forth, source code, interpretive code, object code, directly executable code, and so forth. It is contemplated that the software may invoke system-level code or calls to other software residing on a server or other location to perform certain functions.
The illustrated instructions include a mapping component 48, a classifier component 50, a Graphical User Interface (GUI) generator 52, a modification component 54, and an output component 56. Briefly, the mapping component 48 identifies one or more subspaces in the feature space where the feature vectors are classified by the classifier in the desired (first) class. In the case of decision trees as the classifier model, this may include generating a graph 58 which connects leaves of the trees representing the desired class by edges whenever the associated leaves are not mutually exclusive. Once the mapping has been performed, the mapping component is no longer needed and can be omitted from the system. The classifier component 50 uses the classifier model 12 to generate a classification for the feature vector 16. The GUI generator 52 generates a GUI 60 for display to the user on the display device 32 which enables the user to assign costs 36 to modifications to the feature values when the classification is a second (undesired) class. The modification component 54 identifies a modified feature vector 18, based on the assigned costs, which optimizes (e.g., minimizes, when costs are positive values) the cost of changing the initial feature vector so that it is in a subspace identified by the mapping component where the modified feature vector is classified in the first, desired class. The output component outputs information 62, such as the modified feature vector 18, the decision 14, the result of a process performed based thereon, information based thereon, or a combination thereof.
The exemplary decision system 10 can provide help to the user, in the form of concrete actions that the user can take in order to obtain a desired output. The user is able to specify a user-specific cost function, which can be of any form, allowing comparison between different changes of feature values. By enumerating all (or at least a significant quantity) of the subspaces where the classifier would provide the desired decision, a suitable modification to the input vector can be identified. In the case of forests of decision trees, enumerating these subspaces can be mapped to the problem of enumerating k-cliques, for which an efficient implementation exists.
At S102, multidimensional feature space corresponding to the possible values of a set of features is mapped, by the mapping component 50, to identify regions in the space where the classifier model 12 assigns a feature vector to a desired (first) class. This may include enumerating all (or at least some) subspaces where the classifier outputs a desired decision. This step may alternatively be performed later in the process, for example, before or during step S114.
At S104 an initial feature vector 16 is provided, e.g., input by a user.
At S106, the initial feature vector 16 is input to the trained classifier model 12, which outputs an initial decision 14, such as a class from a plurality of classes. In the exemplary embodiment, there are only two classes, although it is contemplated that the method could be extended to more than two classes.
If at S108, the decision is the first class of the plurality of classes, which is the outcome desired, the method proceeds to S110, where the decision is output and/or used to implement a process, such as confirming a credit application, or the like. Otherwise, the method proceeds to S112.
At S112 a mechanism is provided for the user to assign costs to feature modifications, e.g., through a graphical user interface 60 generated by the GUI generator 52 and displayed to the user on the display device 32. The user interacts with the GUI to specify which of the features in the feature vector can be modified and a cost (or cost function for computing the cost) for each of a set of modifications. Features which cannot be modified or values that are not possible or acceptable to the user under any circumstances are given an infinite cost.
At S114, a modified feature vector 18 (classified by the model 12 in the desired first class) is identified, by the modification component 54, which uses a subset of the user-defined modifications, whose distance to the input feature vector results in the minimal total cost. The method then proceeds to S110. If no such feature vector can be found, the method may return to S112 to allow the user to modify the costs differently.
The method ends at S116.
The method illustrated in
Alternatively, the method may be implemented in transitory media, such as a transmittable carrier wave in which the control program is embodied as a data signal using transmission media, such as acoustic or light waves, such as those generated during radio wave and infrared data communications, and the like.
The exemplary method may be implemented on one or more general purpose computers, special purpose computer(s), a programmed microprocessor or microcontroller and peripheral integrated circuit elements, an ASIC or other integrated circuit, a digital signal processor, a hardwired electronic or logic circuit such as a discrete element circuit, a programmable logic device such as a PLD, PLA, FPGA, Graphics card CPU (GPU), or PAL, or the like. In general, any device, capable of implementing a finite state machine that is in turn capable of implementing the flowchart shown in
Further details of the system and method will now be described.
In the case of binary classifiers 12, it can be assumed that the starting point is a trained classifier f and an initial feature vector v (v∈R|v|), which is classified as class c or c′ (0 or 1): f(v)→{c, c′}, where c is considered a bad (second) class and c′ is a good (first) one. In some embodiments, |v|, the number of features in v, may be at least five or at least eight, or at least ten. At least two, or at least three, or all of the features may be modifiable by a user.
The aim is to modify v as little as possible in order to obtain v′ that is classified by the classifier as the opposite class c′=1−c. v′ can thus be described as the vector which minimizes a non-negative cost function d for changing from v to v′:
where the cost function d: R|v|×R|v|→R
In contrast to other approaches, the distance/cost function d is not restricted to being a norm. Rather, the aim is for it to be general, and may relax metric assumptions. To achieve this, d is defined component-wise, as the sum of the cost of changing from one feature value to another:
Each di(vi,v′i) thus represents the cost of changing a respective initial feature value vi in the initial feature vector to a modified feature value v′i in the modified feature vector, when the modified feature value differs from the initial feature value. This component-wise cost di is user-specific, and can be independently defined, as different users may value one attribute more than another. The costs di(vi,v′i) may be defined through linear functions, or non-linear functions, e.g., exponential (e.g., quadratic), or stepwise functions, or combinations thereof, although the method is not limited to any specific type of function. In a linear function, the cost increases the same amount for each equal incremental change in the value (up or down). Different functions may be used for increasing values and decreasing values.
Each feature may have a set or range of possible feature values, which, in combination, define the multidimensional feature space of the classifier. For each or at least some, of the features, the user is permitted to assign a cost di for changing from the initial feature value to one or more of the other possible feature values or a function by which the cost for each changed feature value is computed. The cost for changing from a given feature value to another may be in a range of 0 to infinity. Changes that are assigned an infinite cost will not be changed in v′. All of the costs for the features are on the same scale, so that lower cost changes are more likely to be used in the minimal total cost function than higher cost changes, assuming that they result in a positive outcome.
Qualitative features may be encoded with one hot vectors. For example, in the case of gender, this may be a binary encoding with 1 corresponding to male and 0 corresponding to female. In the case of numerical features, such as income, a set of non-overlapping intervals may be defined as feature values, such as under 10, >10 to 20, >20 to 30, >30 to 40, and >40 thousand, which could be encoded as {1, 2, 3, 4}, or any other sort of alphanumeric encoding, which is associated in memory with the feature value it represents.
In one embodiment, the user may be provided with a table of possible modifications and asked to assign a cost to each. This is particularly useful where there is a small set of possible values for the feature.
As an example, consider the case of a user applying for a car loan. The feature vector includes values for the following features (attributes): Age in years, Income in 1000's of dollars, Marital Status (1=married, 0=unmarried), Amount of loan in 1000's of dollars, and Repayment Period, in years. The user supplies the information for creating an initial feature vector: [29, 25, 0, 15, 8]. The classifier returns the decision that the loan application is rejected. As illustrated in
In one embodiment, the cost functions may be generated based on user answers to questions provided by the system, such as “how willing would you be to change the loan amount from 15,000 to 10,000 be?” (answer on a scale of 1 to 10 where 1 is very willing and 10 is not willing). Or the user may be free to select a different cost for each of a set of possible values of the feature. For example, if there are two cars available priced at 15 and 9 thousand, she could assign a cost of 0.1 to a change in the loan amount from 15 to 14, and the same from 14-13, a cost of 10 from 15 to 12, a cost of 2 for a change from 15-9, and so forth, depending on the value to her of the different loan amounts. In one embodiment, the GUI may identify the range of values for a given feature which are known to occur in one of the subspaces that are in the positive, first class, given the feature values which the user has already assigned. For example, if the user specifies that the income cannot be changed, the GUI may show the user that a loan of over $12,000 cannot be achieved without modifications to other features.
The modification component 54 then generates a modified feature vector by minimizes the total cost while resulting in a favorable loan decision. For example, it could generate a vector 18 [29, 25, 0, 11.5, 6] and present the information in the GUI in textual form as shown, for example, at 62.
A method for identifying subspaces in which the vector 18 is assigned to the desired, first class is now described for classifiers learned with Tree-Ensembles (random forests). Random forest classifiers are very efficient non-linear classifiers which are widely used in a variety of applications. They operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random forests correct for the decision trees' habit of overfitting to their training set.
It is assumed a random forest is composed of k binary decision trees, where k is at least 2, each tree being composed of a set of nodes, where at each node n, a single-feature threshold decision is made, dividing the remaining data-points into two sets, depending on whether feature x(n) is a) smaller than or equal to, or b) larger than a threshold τn. Each leaf is associated with an outcome class(n), and each tree classifies an entry according to the leaf associated to the sequence of decisions in the path from the root. The ensemble method uses simple voting to determine the final prediction, each tree having one vote.
For example,
Construction of the Graph
To identify suitable subspaces, the leaf nodes of the first class c′ (leaves 104, 108, 114, 116, 122, and 124 in the
represents the floor of
i.e., me iargest integer iess than or equal to
A clique is a subset of vertices of the undirected graph 58 such that its induced subgraph is complete, i.e., every vertex of that subgraph is connected to all other of that subgraph.
An undirected graph G=(V,E), is constructed, with V denoting the set of vertices (nodes) and E the set of pairs of nodes denoting edges. Each leaf node i of class c′ of decision tree j will correspond to a node (vertex) in the graph G, where:
V={n
i
(j)|1≤j≤k,class(ni(j))=c′,ni(j) is a leaf node of tj} (3)
Then a pair of leaf nodes (ni
1. The intersection of their corresponding intervals is non-empty (which, in particular, implies that j1≠j2, i.e., the nodes are not from the same tree). As an example, in the classifier of
2. They denote a consistent solution: A consistent solution refers to potential global constraints due to the representation of qualitative attributes in the feature space. For example, a person's gender may be encoded as a one-hot binary vector, but an interval which forces both to be 0 is not consistent (given a one-hot encoding of length m, of the m interval restrictions at least one has to admit a 1 and m−1 has to admit a 0).
With this graph, any clique of size at least
now corresponds to a (possibly empty) space where the random forest would predict class c′ as outcome. The method includes enumerating those cliques, filtering out empty and inconsistent ones, and, in step S114, measuring their distance d to the original feature vector v, using equations (1) and (2).
As an example,
Thus, the minimum size of a clique is 2. Two different cliques 136, 138 of size 3 are illustrated by way of example, although other and larger sized cliques can be observed in the graph.
Most problems involving cliques are NP-hard, and thus rapidly increase in complexity, and finding k-cliques is no exception, even more so enumerating them (see, Garey, et al., “Computers and intractability: A Guide to the Theory of NP-Completeness,” W.H. Freeman and Company, New York, 1979). Enumerating cliques is known to be polynomial in the output (which can be exponential), and with time delay (the time between two consecutive outputs) of O(|E∥V|) (Tsukiyama, et al., “A new algorithm for generating all the maximal independent sets,” SIAM Journal on Computing, 6(3), 505-517, 1977). However, efficient algorithms exist that are fast enough to provide enough samples of cliques in order to be of reasonable use in practice. See, for example, Ryan Rossi, “Parallel Maximum Clique Library,” 2013, available at https://github.com/ryanrossi/pmc. While not necessarily enumerating all possible solutions, such methods are shown in the examples below to provide beneficial results in terms of modifying the input feature vector to generate the desired outcome.
For each clique, the intersection of all its corresponding vertices is found. For example, for clique 138, this corresponds to the shaded area 140 on
For linear, SVM or neural network-based classifiers, existing solutions, such as those described in Biggio 2013, can be applied.
The application of Eq. (1) in a system where the goal is to help the user (or the intermediate human agent) find better solution differs from existing methods. Adversarial learning has different objectives, which are more often than not reflected in the choice of methods used. The data is assumed to be non-stationary, so that the adversary can cherry-pick data-points, which leads to an arms-race with malicious intent (spam, malware, network intrusion, etc.). Here, the component is applied in a system where the goal is beneficial to the user and the provider, in order to help guide the user to find the least expensive way for a positive outcome. Previous approaches to find adversarial examples (such as those of Biggio 2013) can also be adapted to this new scenario.
The reduction to a clique problem is useful to address the case of a random forest where the cost function can be arbitrary. The algorithm is optimal, in the sense that it will find the exact solution if it keeps running. However, the incremental nature of the algorithm permits partial solutions to be shown as soon as they are found, and in the examples below, they were found to be quickly good enough. In particular, the cliques may be identified and/or evaluated one by one, computing the distance (cost) from the subspace defined by the clique to the input vector 16. Each time a new clique is identified that has a lower distance (cost) it is stored in memory. The system may output the lowest cost solution (smallest distance), or may output a set of solutions which are below a given threshold.
The system and method find application in a variety of situations including insurance (e.g., binary classes corresponding increased cost and decreased cost), loan or credit applications (e.g., loan granted and loan not granted), college acceptance (accepted and not accepted), among others.
Without intending to limit the scope of the exemplary embodiment, the following examples illustrate the application of the method.
The algorithm described above was implemented and tested on the German Credit Data from UCI see, Lichman, “UCI Machine Learning Repository” http://archive.ics.uci.edu/ml, Irvine, Calif.: University of California, School of Information and Computer Science, 2013. Each qualitative attribute (13 out of 20) was encoded as a one-hot vector, while the other 7 numerical attributes were used in their original form. These features include gender, credit history, savings, employment status, and others. This is a binary classification problem, where each feature vector is labeled as good or bad. A random forest classifier 12 (using 10 decision trees) was employed. The classifier achieves an accuracy of 74.6% on 3-fold cross-validation, which is in line with what is reported in the literature (Ratanamahatana, et al., “Scaling up the Naive Bayesian Classifier: Using Decision Trees for Feature Selection,” Citeseer, pp. 1-10, 2002, O'Dea, et al., “Combining feature selection and neural networks for solving classification problems,” Proc. 12th Irish Conf. Artificial Intell. Cognitive Sci., pp. 157-166, 2001). It had similar performance to other classifiers investigated (outperforming nearest-neighbor, Naive Bayes with various priors, and SVM with various kernels; while logistic regression obtained better performance).
For the clique finder, the Parallel Maximum Clique (PMC) Library (Rossi, et al., “A fast parallel maximum clique algorithm for large sparse graphs and temporal strong components,” ArXiv:1302.6256, 2013; https://github.com/ryanrossi/pmc), which proved to be very fast.
User-specific weights: The user can specify any possible weights (costs). For this data-set, a parser was generated which allows for numerical attributes to say how much a single unit modification costs (differentiating up and down), allowing for a linear weight. For qualitative attributes the user can give the cost from any attribute to any other.
The random forest classifier is created (and checked to be sure that it provides reasonable accuracy) with the following code:
The graph is created with the following code, creating a node for each leaf of class 1 (the desired). This can take some time, since it is quadratic in the number of leaves:
The data-structure creation step has to be done only once, as the target class should in generally be the same.
The attributes considered are shown in TABLE 1.
A new attribute vector for somebody applying for a credit was generated to test the system. The following is a credit application for 100000 DM (Deutsche Mark), for a duration of 72 months to buy a new car; submitted by a 64 year-old, male, single, unskilled resident.
This example is constructed such that the rating would be bad (class 2):
Table 2 provides an example of some of the features weights (costs), which can be varied by the user. Inf. indicates an infinite weight. For qualitative features, the user enters a cost to change from one feature to another.
For each clique, the intersection of all its corresponding vertices is found and the shortest path between the target (input) vector and that interval is computed. Every time a better solution (cheaper) is found, the changes to be done are enumerated. The implementing code, followed by a sequence of modifications, is shown in TABLE 3.
As can be seen from this example, the enumeration of cliques results in a low cost solution. The user may be presented with the lowest cost solution identified, or a set of the lowest cost solutions.
It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, variations or improvements therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.