Embodiments generally relate to providing an explanation for machine learning algorithms, and more particularly to providing a pairwise feature attribution for information retrieval machine learning algorithms.
Governments and businesses are relying more and more on predictions from artificial intelligence models and machine learning algorithms. Many of these machine learning algorithms are a black box, making it difficult to determine which variables are most responsible for the predictions. To enhance end-user trust and help in the analysis of possible prediction errors, these predictions need to be accompanied by additional information which at least partially explains why a machine learning algorithm makes a certain prediction.
One important class of machine learning problems is the area of information retrieval. Information retrieval problems include semantic search, image retrieval and entity matching. Information retrieval problems often have specific interactions between features which may impact predictions. These feature pairs can impact predictions more than either feature on its own. This issue is also present in classification problems with strong feature interactions, such as where the feature set splits into two distinct groups, such as multi-modal classification of image and text.
Accordingly, what is needed are methods, systems, and media for providing an explanation of which features are significant for information retrieval machine learning algorithms involving interactions between features.
Disclosed embodiments of the present technology solve the above-mentioned problems by providing systems, methods, and computer-readable media for determining which features and feature pairs are significant for machine learning algorithms. By examining the interactions between pairs of features, additional explanations may be provided which would not be knowable by examining individual features alone. Such explanations are particularly useful for explaining information retrieval algorithms, where the interactions between features may be especially important. These solutions are also model agnostic, allowing the solutions to be used for any machine learning model type and do not require any feature pruning to be efficient. Further, an improved sampling scheme increases computational efficiency by sampling based on a normalized probability distribution to determine the feature weights using fewer samples to improve runtime.
In some aspects, the techniques described herein relate to one or more non-transitory computer-readable media storing computer-executable instructions that, when executed by at least one processor, perform a method for feature attribution in a machine learning model, the method including: receiving, from a user, a machine learning model and input data; generating a prediction using the machine learning model and the input data; generating a plurality of samples for the machine learning model by eliminating features from the input data and the prediction; calculating a weight for at least one feature and at least one feature pair of the input data and the prediction using the plurality of samples; and transmitting the weight for the at least one feature and the at least one feature pair to the user.
In some aspects, the techniques described herein relate to a non-transitory computer-readable media, wherein the machine learning model is an information retrieval machine learning model.
In some aspects, the techniques described herein relate to a non-transitory computer-readable media, wherein the plurality of samples for the machine learning model are generated based on a normalized probability distribution.
In some aspects, the techniques described herein relate to a non-transitory computer-readable media, wherein calculating the weight for one or more features and one or more feature pairs involves a local interpretable model-agnostic explanation method.
In some aspects, the techniques described herein relate to a non-transitory computer-readable media, wherein generating a plurality of samples for the machine learning model uses a Hamming distance to determine the plurality of samples.
In some aspects, the techniques described herein relate to a non-transitory computer-readable media, wherein the input data is an image, and the plurality of samples are generated by graying out regions of superpixels from the image.
In some aspects, the techniques described herein relate to a non-transitory computer-readable media, wherein calculating the weight for the at least one feature and the at least one feature pair is done using a ridge regression.
In some aspects, the techniques described herein relate to a method for method for feature attribution in a machine learning model, the method including: receiving, from a user, a machine learning model and input data; generating a prediction using the machine learning model and the input data; generating a plurality of samples for the machine learning model by eliminating features from the input data and the prediction; calculating a weight for at least one feature and at least one feature pair of the input data and the prediction using the plurality of samples; and transmitting the weight for the at least one feature and the at least one feature pair to the user.
In some aspects, the techniques described herein relate to a method, wherein the machine learning model is an information retrieval machine learning model.
In some aspects, the techniques described herein relate to a method, wherein the plurality of samples for the machine learning model are generated based on a normalized probability distribution.
In some aspects, the techniques described herein relate to a method, wherein calculating the weight for one or more features and one or more feature pairs involves a local interpretable model-agnostic explanation method.
In some aspects, the techniques described herein relate to a method, wherein generating a plurality of samples for the machine learning model uses a Hamming distance to determine the plurality of samples.
In some aspects, the techniques described herein relate to a method, wherein the input data is an image, and the plurality of samples are generated by graying out regions of superpixels from the image.
In some aspects, the techniques described herein relate to a method, wherein calculating the weight for the at least one feature and the at least one feature pair is done using a ridge regression.
In some aspects, the techniques described herein relate to a system for feature attribution in a machine learning model, the system including: at least one processor; and at least one non-transitory memory storing computer executable instructions that when executed by the at least one processor cause the system to carry out actions including: receiving, from a user, a machine learning model and input data; generating a prediction using the machine learning model and the input data; generating a plurality of samples for the machine learning model by eliminating features from the input data and the prediction; calculating a weight for at least one feature and at least one feature pair of the input data and the prediction using the plurality of samples; and transmitting the weight for the at least one feature and the at least one feature pair to the user.
In some aspects, the techniques described herein relate to a system, wherein the machine learning model is an information retrieval machine learning model.
In some aspects, the techniques described herein relate to a system, wherein the plurality of samples for the machine learning model are generated based on a normalized probability distribution.
In some aspects, the techniques described herein relate to a system, wherein calculating the weight for one or more features and one or more feature pairs involves a local interpretable model-agnostic explanation method.
In some aspects, the techniques described herein relate to a system, wherein generating a plurality of samples for the machine learning model uses a Hamming distance to determine the plurality of samples.
In some aspects, the techniques described herein relate to a system, wherein the input data is an image, and the plurality of samples are generated by graying out regions of superpixels from the image.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Other aspects and advantages of the present teachings will be apparent from the following detailed description of the embodiments and the accompanying drawing figures.
Embodiments are described in detail below with reference to the attached drawing figures, wherein:
The drawing figures do not limit the present teachings to the specific embodiments disclosed and described herein. The drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the disclosure.
The following detailed description of embodiments references the accompanying drawings that illustrate specific embodiments in which the present teachings can be practiced. The described embodiments are intended to illustrate aspects of the present teachings in sufficient detail to enable those skilled in the art to practice the present teachings. Other embodiments can be utilized, and changes can be made without departing from the claims. The following detailed description is, therefore, not to be taken in a limiting sense. The scope of embodiments is defined only by the appended claims, along with the full scope of equivalents to which such claims are entitled.
In this description, references to “one embodiment,” “an embodiment,” or “embodiments” mean that the feature or features being referred to are included in at least one embodiment of the technology. Separate reference to “one embodiment” “an embodiment”, or “embodiments” in this description do not necessarily refer to the same embodiment and are also not mutually exclusive unless so stated and/or except as will be readily apparent to those skilled in the art from the description. For example, a feature, structure, or act described in one embodiment may also be included in other embodiments but is not necessarily included. Thus, the technology can include a variety of combinations and/or integrations of the embodiments described herein.
Embodiments are contemplated which permit a user to determine which features, and which pairwise features, of a machine learning model are significant. Machine learning models are often black boxes which limits a user's ability to understand how the model actually works. By allowing a user to understand which features are significant, a user can have a better understanding of the underlying model to ensure the model is functioning properly. In some embodiments, an existing machine learning model may be supplied. The machine learning model may be an information retrieval machine learning model, or any other type of machine learning model. Samples may be generated for the machine learning model which can allow the weights of the features, as well as the weights of feature pairs, to be determined. In some embodiments, the samples may be generated using a normalized probability distribution. Using the samples, a weight is determined for every feature and every feature pair. The weight for each feature and feature pair is a measure of how significant that feature is for the model's predictions. A feature or feature pair with a higher weight means that the feature is more important in the model's prediction, whereas a lower weight indicates that the feature or feature pair is less significant. The weights of the features can be used to ensure both that the machine learning model is functioning properly and permit troubleshooting of any issues. Problems with the training set may be detected if features which should not be significant have a large weight. For example, a user may intend for a machine learning model to classify pictures of animals based on what the animal in the image looks like. However, a data set with pictures of dogs on grass and cats on snow may lead a machine learning model to classify an animal based on whether or not the background of a picture is grass or snow, not what the animal looks like. By showing that the background of images in such a classification algorithm has a high weight, the issue with the training data may be addressed. As another example, the weights of a machine learning model can help determine if a machine learning model is improperly relying on some features of a data set, such as gender, which may be contrary to laws in certain regions.
Bank statement 102 is depicted along with invoice 110. Bank statement 102 may comprise columns for amount 104, business partner name 106, and note to payee 108, among other columns. Invoice 110 may comprise columns for amount 112, organization 114, and document number 116, among other columns. Columns from bank statement 102 may correspond to columns from invoice 110, indicating that bank statement 102 corresponds to invoice 110. In some embodiments, columns may match when both the column name and value are the same. For example, both amount 104 and amount 112 have the same name, amount, and value, 990. In further embodiments, columns may match when at least the value is the same. For example, both note to payee 108 and document number 116 have the same value, 1000789. In some embodiments, the value from a first column may be present in a matching column within other text, such as if note to payee 108 included additional notes in addition to 1000789. In still further embodiments, columns may match when there is a fuzzy or incomplete match, or the value between the columns is similar enough. For example, business partner name 106 has a value of ABCD CORP which may be a fuzzy match to organization 114 which has a value of ABCD Corporation. A machine learning model may determine that amount 104, business partner name 106, and note to payee 108 were significant in determining that bank statement 102 corresponds to invoice 110. However, a user would prefer to know, for example, that the pair of amount 104 and amount 112 are significant. It is the interaction between the features of bank statement 102 and invoice 110 which are actually determinative of the match. Disclosed embodiments capture this information by determining the weight of pairwise functions.
As one example of a specific scenario of how the representation of relevant features and interactions could work in an embedding model, an embedding for image 202 (entity a) could result in an embedding vector that may be linearly decomposed as:
{right arrow over (g)}
a
=z
z
a
{right arrow over (g)}
Dog
+z
2
a
{right arrow over (g)}
Cat.
Likewise, image 210 (entity b) would result in an embedding vector that may be linearly decomposed as
{right arrow over (g)}
b
=z
z
b
{right arrow over (g)}
Cat
+z
2
b
{right arrow over (g)}
Giraffe.
In other words, in this example, the pairwise interactions are significant while the individual features do not contribute. In both instances, it may be assumed that the background is mapped to the zero vector as it is irrelevant for the current task of finding an image containing a matching animal. Assuming that the embedding vectors for individual animals are roughly orthogonal, the mixed product terms containing, for example, {right arrow over (g)}Dog{right arrow over (g)}Cat are approximately zero. The score function for the inner product of the embedding would therefore be
f({right arrow over (a)}, {right arrow over (b)})={right arrow over (g)}({right arrow over (a)})·{right arrow over (g)}({right arrow over (b)})≈z2az1b{right arrow over (g)}Cat·{right arrow over (g)}Cat.
Machine learning model system may comprise training process 302. In some embodiments, training process 302 comprises training data 304 and initial model 306. Training data 304 may be labeled or unlabeled depending on the specific machine learning application. In some embodiments, training data 304 may exist in multiple different locations. Initial model 306 may be any initial machine learning model which is to be trained using training data 304. In some embodiments, training initial model 306 involves iteratively training initial model 306 using training data 304. In some embodiments, a portion of training data may be reserved to evaluate the accuracy of intermediate versions of initial model 306. Training data 304 may be selected depending on the type of initial model 306. In some embodiments, training process 302 may involve multiple machine learning models training in an adversarial environment. Training process 302 may be used to train any type of machine learning model, including models training with supervised learning, unsupervised learning, or reinforcement learning. In some embodiments, a portion of training data 304 may be reserved until the end of training process 302 to provide data for testing the initial model 306 throughout or after training.
In some embodiments, training process 302 results in trained machine learning model 308. Input data 310 can be input into trained machine learning model 308 to produce predictions 312. For example, trained machine learning model 308 may receive as input data 310 an input image and an image database, and be required to find an image in the image database which corresponds to the input image. For example, given input data 310, machine learning 308 may produce a numeric score for each image in the image database, and select as the prediction the image with the highest score. In some embodiments, trained machine learning model 308 may continue to be trained and refined even after training process 302. Predictions 312 may be stored in a database or transmitted to a user.
Once trained machine learning model 308 generates predictions 312 based on input data 310, sample data 402 is generated to determine the weights of the features of input data 310 and predictions 312 which caused machine learning model 308 to generate predictions 312. Sample data 402 may be used to evaluate the output score of the machine learning model with some features of input data 310 and predictions 312 displaced or turned off, such as by replacing a subset of features with a neutral or background version of itself. The details of replacing a feature with a neutral value may vary based on the specific feature domain and the machine learning application. For example, in some embodiments, sample data 402 may be generated by removing text tokens or sentences, graying out parts of an image, or replacing numerical features with random values that follow the distribution from a training set, or replacing numeral features with a fixed value, such as the median or mean of the training set for a particular value. For example, a series of sample data 402 may include images wherein each superpixel of an image is grayed out in one instance of sample data 402. In further embodiments, features can be binary values that represent whether original features of input data 310 or predictions 312 are preserved or displaced or turned off, and sample data 402 can consist of these features being turned on or off, or the features as absent or present. For example, a feature that is dropped could be represented by a 0, and a feature present could be represented as a 1. In some embodiments, sample data 402 may be generated using a normalized probability distribution to minimize the amount of sample data 402 required.
In some embodiments, sample data 402 may be determined in part by specifying a distance function and a Kernel function to determine the sample neighborhood. In further embodiments, the Hamming distance may be used as a distance function. For example, the distance function may be represented as the number of features dropped/absent:
In further still embodiments, an exponential Kernel function may be used. For example, the exponential Kernel function may be represented as K(d)=Ae−λd where A and λ are positive real numbers representing hyperparameters that may be selected based on heuristics. In some embodiments, an exponential Kernel function may be used such that the sample data is based on a normalized probability distribution, thus reducing the amount of required sample data 402. For example, a normalized distribution as a function of distance, d, may be represented as
Such a normalized distribution would allow for a loss function of:
Where the sampling is done according to the probability distribution P and S is the number of samples. In some embodiments, the kernel function may be a cubic function. The samples may be produced by randomly picking a distance by sampling from the discrete distribution of the probability of each distance, and then randomly removing the features associated with that distance. In some embodiments, sample data 402 may be determined using a uniform random distribution, such as turning off each feature with a 50% probability.
Using sample data 402, feature relevancy determination 404 is used to determine weight 406 for each of the features. Feature relevancy determination 404 may minimize a loss function to find the weights associated with each feature. For example, the loss function to be minimize may be:
swherein the optimal set of weights {right arrow over (w)}*=argmin{right arrow over (w)}L({right arrow over (w)}) of the linear model may be readily interpretable to provide the feature attribution and feature importance, and wherein s({right arrow over (z)}′) represents the machine learning model's output score given the input data 310 and a particular instance {right arrow over (z)}′ of sample data 402. In some embodiments, feature relevancy determination 404 may use a modified local interpretable model-agnostic explanation approach. In further embodiments, a linear model such as K-LASSO or Ridge regression may be used. In some embodiments, weight 406 is also determined for all pairwise features. For example, the binary feature set may be extended by concatenating a set of engineered pairwise binary functions. This allows the feature interactions to be uncovered and the weights for relevant pairwise features to be determined. Thus, not only are all the individual features for each entity assigned a weight, but the pairwise features between multiple entities are also assigned a weight. For example, the loss function including pairwise binary functions may be:
with the difference being that the extended binary feature vector {right arrow over (z)}′pair is used in the ridge regression, where {right arrow over (z)}′pair=({right arrow over (z)}a, {right arrow over (z)}b, {right arrow over (z)}a ×{right arrow over (z)}b). Embodiments are also contemplated for n-tuples of features for an arbitrary n, such as 3, 4, 5, or any other value up to and including the number of features in the data set.
Weight 406 for each feature and pairwise feature indicates the significance of each feature and pairwise feature. For example, in this embodiment the weight of the pair of features of animal 208 and animal 214 of
At step 502, a machine learning model and input data is received and used to generate a prediction. The input data may be used with the machine learning model such that the machine learning model generates a prediction. In some embodiments, the machine learning model may be an information retrieval machine learning model. In some embodiments, the machine learning model may be received from a user. In other embodiments, the machine learning model may be generated based on training data. A machine learning model may take data as an input and predict an output. The input data may comprise a set of features relevant to the input data. For example, the machine learning model may be trained to find a matching image for an input image, and the input data may be an image and an image database to search. The image database to search may be located at a separate location and may be given as an identification of the location of the image database. In some embodiments, the machine learning model may be received after being trained on training data.
At step 504, samples are generated. The samples are generated using the input data and the prediction such that they may be used to determine the significance of each feature and feature pair. In some embodiments, the samples may be generated from a normalized probability distribution. Samples may be generated by creating a perturbed sample by replacing a subset of features with a neutral feature. For example, in a text search embodiment individual words may be replaced with an empty or null string. As another example, in an image search embodiment, portions of the images may be grayed out. Other types of input data may have other techniques for generating sample data appropriately, as discussed above. The sample data can be used to determine how the absence of particular features affects the predictions of the machine learning algorithm.
At step 506, a binary feature set may be extended by concatenating a set of engineered pairwise binary features. For example, a normal binary feature set for two images would include the features from a first image and the features from a second image. In some embodiments, a set of engineered pairwise binary features would include the cartesian product of the features from the first image and the features from the second image. The resulting binary feature set would include the features from the first image, the features from the second image, and the product of all of the features from the first image and the second image. Adding the pairwise binary features allows the weights to be determined for the pairwise binary features as well as the binary features, thus enabling a better understanding of a machine learning model which has feature interactions.
At step 508, the weights are calculated for each binary and pairwise binary feature. In some embodiments, the weights may be a number between zero and one. The weights may be an indication of the significance of a particular feature or feature pair to a prediction from the machine learning model. For example, the weights may simulate a linear regression model for the indicated feature. In some embodiments, the weights are calculated using the generated sample data. In further embodiments, the weights are calculated by minimizing a loss function to measure the impact of each feature and feature pair on the prediction.
At step 510, the weights are transmitted. In some embodiments, the weights may be transmitted to a user in response to the user transmitting a machine learning model. In further embodiments, a subset of the weights may be transmitted. For example, only the highest weight may be transmitted. As another example, the top five weights may be transmitted. In some embodiments, there may be a threshold and only weights above the threshold may be transmitted. In further embodiments, instead of the weights being transmitted an ordered list ranked by weights may instead be transmitted to indicate an order of feature or feature pair significance.
Thus, non-transitory, computer-readable media include both volatile and nonvolatile media, removable and nonremovable media, and contemplate media readable by a database. For example, computer-readable media include (but are not limited to) RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile discs (DVD), holographic media or other optical disc storage, magnetic cassettes, magnetic tape, magnetic disk storage, and other magnetic storage devices. These technologies can store data temporarily or permanently. However, unless explicitly specified otherwise, the term “computer-readable media” should not be construed to include physical, but transitory, forms of signal transmission such as radio broadcasts, electrical signals through a wire, or light pulses through a fiber-optic cable. Examples of stored information include computer-useable instructions, data structures, program modules, and other data representations.
Finally, network interface card (NIC) 624 is also attached to system bus 604 and allows computer 602 to communicate over a network such as network 626. NIC 624 can be any form of network interface known in the art, such as Ethernet, ATM, fiber, Bluetooth, or Wi-Fi (i.e., the Institute of Electrical and Electronics Engineers (IEEE) 802.11 family of standards). NIC 624 connects computer 602 to local network 626, which may also include one or more other computers, such as computer 628, and network storage, such as data store 630. Generally, a data store such as data store 630 may be any repository from which information can be stored and retrieved as needed. Examples of data stores include relational or object-oriented databases, spreadsheets, file systems, flat files, directory services such as LDAP and Active Directory, or email storage systems. A data store may be accessible via a complex API (such as, for example, Structured Query Language), a simple API providing only read, write and seek operations, or any level of complexity in between. Some data stores may additionally provide management functions for data sets stored therein such as backup or versioning. Data stores can be local to a single computer such as computer 628, accessible on a local network such as local network 626, or remotely accessible over public Internet 632. Local network 626 is in turn connected to public Internet 632, which connects many networks such as local network 626, remote network 634 or directly attached computers such as computer 636. In some embodiments, computer 602 can itself be directly connected to public Internet 632.
One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural language, an object-oriented programming language, a functional programming language, a logical programming language, and/or in assembly/machine language. As used herein, the term “computer-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and/or data to a programmable processor, including a computer-readable medium that receives machine instructions as a computer-readable signal. The term “computer-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The computer -readable medium can store such machine instructions in a non-transitory manner, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium. The computer-readable medium can alternatively or additionally store such machine instructions in a transient manner, for example as would a processor cache or other random-access memory associated with one or more physical processor cores.
Many different arrangements of the various components depicted, as well as components not shown, are possible without departing from the scope of the claims below. Embodiments have been described with the intent to be illustrative rather than restrictive. Alternative embodiments will become apparent to readers of this disclosure after and because of reading it. Alternative means of implementing the aforementioned can be completed without departing from the scope of the claims below. Certain features and sub-combinations are of utility and may be employed without reference to other features and sub-combinations and are contemplated within the scope of the claims. Although the present teachings have been described with reference to the embodiments illustrated in the attached drawing figures, it is noted that equivalents may be employed and substitutions made herein without departing from the scope of the present teachings as recited in the claims.
Having thus described various embodiments, what is claimed as new and desired to be protected by Letters Patent includes the following: