Incorporated herein in its entirety is related U.S. patent application Ser. No. 17/232,671 titled DATASET-FREE, APPROXIMATE MARGINAL PERTURBATION-BASED FEATURE ATTRIBUTIONS filed on Apr. 16, 2021 by Zahra Zohrevand et al.
The present invention relates to machine learning (ML) explainability (MLX). Herein are local explanation techniques for black box ML models based on feature importance established by perturbation of a dataset.
Machine learning (ML) and deep learning are becoming ubiquitous for two main reasons: their ability to solve complex problems in a variety of different domains and growth in performance and efficiency of modern computing resources. However, as the complexity of problems continues to increase, so too does the complexity of the ML models applied to these problems.
Deep learning is a prime example of this trend. Other ML algorithms, such as neural networks, may only contain a few layers of densely connected neurons, whereas deep learning algorithms, such as convolutional neural networks, may contain tens to hundreds of layers of neurons performing very different operations. Increasing the depth of the neural model and heterogeneity of layers provides many benefits. For example, going deeper can increase the capacity of the model, improve the generalization of the model, and provide opportunities for the model to filter out unimportant features, while including layers that perform different operations can greatly improve the performance of the model. However, these optimizations come at the cost of increased complexity and reduced human interpretability of model operation.
Explaining and interpreting the results from complex deep learning models is a challenging task compared to many other ML models. For example, a decision tree may perform binary classification based on N input features. During training, the features that have the largest impact on the class predictions are inserted near the root of the tree, while the features that have less impact on class predictions fall near the leaves of the tree. Feature importance can be directly determined by measuring the distance of a decision node to the root of the decision tree.
Such models are often referred to as being inherently interpretable. However, as the complexity of the model increases (e.g., the number of features or the depth of the decision tree increases), it becomes increasingly challenging to interpret an explanation for a model inference. Similarly, even relatively simple neural networks with a few layers can be challenging to interpret, as multiple layers combine the effects of features and increase the number of operations between the model inputs and outputs. Consequently, there is a requirement for alternative techniques to aid with the interpretation of complex ML and deep learning models.
ML explainability (MLX) is the process of explaining and interpreting ML and deep learning models. MLX can be broadly categorized into local and global explainability:
An ML model accepts as input an instance such as a feature vector that is based on many features of various datatypes that respectively have many or an infinite amount of possible values. Each feature provides a dimension in a vast multidimensional problem space in which a given multi-featured input is only one point. Even though a global explanation may be based on many input instances, most of the multidimensional problem space is missed by those instances, and the instances are separated from each other by huge spatial gaps. Thus, for explaining a particular inference by an ML model for a particular input that almost always falls within such a spatial gap of unknown behavior of the ML model, a global explanation may have low accuracy. An approach such as Shapley for local explaining requires a number of input instances and output inferences that grows exponentially with the number of features because, by design, Shapley explores relations between features, which is combinatorically intractable. In other words, best of breed local explainers are not scalable and may be computationally overwhelmed by a wide feature vector.
In the drawings:
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.
Herein are explanation techniques that extract local feature importance for a trained machine learning (ML) or deep learning (DL) model, referred to as a black box model. To locally explain the behavior of the ML model, perturbation-based ML explanation (MLX) techniques evaluate how the predictions of the ML model change on permuted versions of an instance to be explained. A feature that, when permuted, has a much larger effect on the ML model's predictions is considered to be more important than a permuted feature that results in little-to-no change in the ML model's predictions. Herein is the first highly-stable, linear-time, perturbation-based, model-agnostic, local feature attribution approach for MLX.
Techniques herein sample a feature value from the empirical marginal distribution of a reference dataset for increased realism that avoids assessing the importance of a feature completely outside the domain of realistic values (e.g., a negative value for a weight feature). By using the underlying data distributions for permutation, overall quality (i.e. accuracy) of the explanations may quantitatively improve because the generated data instances can explore parts of the ML model's multidimensional latent space that may be encountered by future realistic instances that were not observed in the reference dataset. Likewise, this approach may decrease the number of instances that must be generated to obtain an explanation of equal quality, thereby decreasing consumption of time and space. As explained below, this approach counterintuitively can both decrease instances and increase explanation accuracy, which is more or less impossible with Shapley based techniques where instances and accuracy are naturally positively correlated.
Data distribution is crucial for realism. Perturbing an original instance to generate a new instance may lead to out-of-distribution samples. Unrealistic data is problematic because it may confuse an ML model, which decreases accuracy of inferencing such as classification. In other words, unrealistic instances occur in regions of a multidimensional problem space where the ML model is unreliable or even unstable such as prone to unpredictable discontinuities in the prediction solution space that prevent an instance from being modified or used as-is in the real world as predicted. Thus, unrealistic instances have little explanatory value and may undermine confidence in MLX.
Important local MLX use cases are interactive and do not tolerate latency well. Customer experience (CX) may be at stake. For example, local MLX may be used during a phone conversation such as with a support or sales agent. A localized neighborhood of perturbed instances should be quickly generated. Optimizing the above concerns and criteria is expensive with high dimensional datasets having many constituent datatypes.
Unlike Shapley, this approach only requires a linear number of inferences and perturbed tuples. That provides acceleration that may be used to increase the density and/or radius of a sampled neighborhood that surrounds the instance to be explained to increase the accuracy of a local explanation. Increased accuracy makes an explaining computer itself more reliable.
As discussed in the Background, Shapley succumbs to intractable features combinatorics. Because techniques herein do not assess the impact of interactions between features, complexity of computing local perturbation-based feature attributions is reduced from exponential to linear time, which is a substantial acceleration previously thought possible only by sacrificing accuracy and stability such as with global explaining that is prone to unstable discontinuities of accuracy due to sparsity of multidimensional spatial exploration. The approach herein provides increased neighborhood density that increases explanatory accuracy and stability while, counterintuitively, also providing acceleration.
In an embodiment, a computer hosts a machine learning (ML) model that infers a particular inference for a particular tuple that is based on many features. For each feature, and for each of many original tuples, the computer: a) randomly selects many perturbed values from original values of the feature in the original tuples, b) generates perturbed tuples that are based on the particular tuple and a respective perturbed value, c) causes the ML model to infer a respective perturbed inference for each perturbed tuple, and d) measures a respective difference between each perturbed inference of the perturbed tuples and the inference of the particular tuple. For each feature, a respective importance of the feature is calculated based on the differences measured for the feature. Feature importances may be used to rank features by influence and/or generate a local ML explainability (MLX) explanation.
In various embodiments, hosted in memory of computer 100 is already-trained ML model 160 that may operate for classification, regression, prediction, anomaly detection, clustering, or other ML purpose. In operation, ML model 160 is applied to a tuple such as tuple 150 to generate an inference such as inference 170 that may be a class or a value of a regression or prediction. In an embodiment, inference 170 contains one or more numeric scores or probabilities such as a respective probability for each of multiple classes. In an embodiment, inference 170 is numeric and compared to a threshold to detect whether or not tuple 150 is anomalous. Tuples are explained later herein.
ML model 160 may be a black-box model that has an unknown, opaque, or confusing architecture that more or less precludes direct inspection and interpretation of the internal operation of ML model 160. In an embodiment not shown, ML model 160 is hosted in a different computer that is not computer 100, and computer 100 applies techniques herein by remotely using ML model 160. For example, computer 100 may send tuple 150 to ML model 160 over a communication network and responsively receive inference 170 over the communication network. For example, computer 100 and ML model 160 may be owned by different parties and/or hosted in different data centers. In various embodiments that host ML model 160 in computer 100, techniques herein may or may not share an address space and/or operating system process with ML model 160. For example, inter-process communication (IPC) may or may not be needed to invoke ML model 160.
Approaches herein generate local explanations of ML model 160. As explained later herein, a local explanation explains inference I2 by ML model 160 for tuple to explain 151 that may be known or new. As explained below, corpus 110 and/or ML model 160 participate in a sequence of phases that include: training of ML model 160 and MLX invocation that performs exploration 140 based on tuple to explain 151 before generating a local explanation.
In various scenarios, tuple to explain 151 and its inference I2, and/or ML model 160 are reviewed for various reasons. MLX herein can provide combinations of any of the following functionalities:
For example, the explanation may be needed for regulatory compliance. Likewise, the explanation may reveal an edge case that causes ML model 160 to malfunction for which retraining with different data or a different hyperparameters configuration is needed.
Training of ML model 160 entails a training corpus that contains training tuples. In various embodiments, the training corpus is or is not corpus 110. In an embodiment, ML model 160 is supervised, which means that training of ML model 160 is supervised and the tuples of the training corpus are each labeled with a respective known correct inference. In various embodiments explained later herein, training ML model 160: a) is or is not supervised, and b) occurs on computer 100 or a different computer. In any case, ML model 160 is already trained in
Corpus 110 may or may not be used in any of training, validation, and testing of ML model 160. Essentially, original tuples 121 are a small portion of a multidimensional problem space, with each of features 131 providing a respective dimension, that ML model 160 could map to inferences that would provide an additional dimension to a multidimensional solution space.
Corpus 110 includes metadata and data 120 that computer 100 stores or has access to. In an embodiment, metadata is stored or cached in volatile memory, and data 120 is stored in nonvolatile storage that is local or remote. Data 120 defines a portion of the multidimensional problem space and includes original tuples 121. Original tuples 121 are respective points in the multidimensional problem space. Original tuples 121 includes individual original tuples T1-T4 that collectively contain original values 122 that includes individual values V1-V9. Each of original tuples 121 contains a respective value for each of features 131 that includes individual features F1-F4. For example as shown, the value of feature F1 in original tuples T1-T2 is value V1.
Metadata generalizes or otherwise describes data 120. Metadata includes features 131 that can describe tuple 150 that is shown with a dashed outline to demonstrate that tuple 150 may be any individual tuple of tuples 121, 141, or 151.
Tuple 150 contains a respective value for each of features 131. In an embodiment, tuple 150 is, or is used to generate, a feature vector that ML model 160 accepts and that contains more or less densely encoded respective values for features 131. Each of features 131 has a respective datatype. For example, features F1 and F3 may have a same datatype. A datatype may variously be: a) a number that is an integer or real, b) a primitive type such as a Boolean or text character that can be readily encoded as a number, c) a sequence of discrete values such as text literals that have a semantic ordering such as months that can be readily encoded into respective numbers that preserve the original ordering, or d) a category that enumerates distinct categorical values that are semantically unordered.
Categories are prone to discontinuities that may or may not seemingly destabilize ML model 160 such that different categorical values for a same feature may or may not cause ML model 160 to generate very different inferences. One categorical feature may be hash encoded into one number in a feature vector or n-hot or l-hot encoded into multiple numbers. For example, l-hot encoding generates a one for a categorical value that actually occurs in a tuple and also generates a zero for each possible categorical value that did not occur in the tuple.
Tuple 150 may represent various objects in various embodiments. For example, tuple 150 may be or represent a network packet, a record such as a database table row, or a log entry such as a line of text in a console output logfile. Likewise, features 131 may be respective data fields, attributes, or columns that can occur in each object instance. For example, inference 170 may be a binary classification or an anomaly score that indicates whether or not tuple 150 is anomalous such as based on a threshold. When ML model 160 detects an anomaly in a production environment, an alert may be generated to provoke a human or automated security reaction such as terminating a session or network connection, rejecting tuple 150 from further processing, and/or recording, diverting, and/or alerting tuple 150 for more intensive manual or automatic inspection and analysis.
ML model 100 generates or previously generated inference I2 for original tuple to explain 151. In an embodiment that classifies tuple 150 into one of four mutually exclusive classes, inference 170 may be any of inferences I1-I4. However, ML model 160 may have imperfect accuracy that sometimes causes inference 170 to be wrong and not match a label of tuple 150 that is an actually known correct class of tuple 150.
Each of original tuples 121 may be a point in a multidimensional problem space defined by features 131. Although there may be hundreds of thousands of original tuples 121 that each may be a distinct combination of values of features 131 that is a distinct point in the multidimensional problem space, most or nearly all possible points in the multidimensional problem space do not occur in original tuples 121. Thus, inference 170 is unknown for most or nearly all possible points in the multidimensional problem space. Thus, a global explanation based on original tuples 121 would likely have limited accuracy, especially because known points in the multidimensional problem space are usually separated by regions with many possible tuples whose inference 170 is unknown.
Computer 100 generates a local explanation that is more accurate than a global explanation as follows. Inference 170 depends on the values of features 131 in tuple 150. By concentrating the generation of an explanation on the neighborhood of possible points that surround tuple to explain 151 in the multidimensional problem space, the accuracy of the local explanation is increased. Exploration 140 uses sampling of original values 122 to explore the neighborhood around tuple to explain 151 as follows.
Exploration 140 generates perturbed tuples 141 that are probabilistic variations of tuple to explain 151 based on original values 122. As shown, perturbed tuples 141 contains multiple (e.g. three as shown) perturbed variations of tuple to explain 151 for each of features 131. Each perturbed tuple is almost a perfect copy of tuple to explain 151, except that the value of one of features 131 is perturbed (i.e. not a copy). Perturbed 144 demonstratively indicates that feature F2 is perturbed in the shown perturbed tuples 141. However, perturbed tuples 141 also contains an equal amount of unshown tuples that respectively perturb each of other features F1 and F3-F4. In other words, perturbed tuples 141 may have some count of tuples that is a multiple of a count of features 131.
Thus, most of values 142 are copies of original values 122 as follows. As shown, perturbed tuples P1-P2 and P4 are copies of tuple to explain 151 but with respective perturbed values V2, V5, and V3 for feature F2 that instead had original value V8 in tuple to explain 151. Perturbed values are randomly sampled from values of feature F2 in original values 122. For example as shown for feature F2 in values 142, perturbed tuples P1-P2 and P4 have respective values from original tuples T1-T2 and T4. Thus, the values distribution of feature F2 in perturbed tuples P1-P2 and P4 is bounded by the same value range as the original values of the feature and should have more or less a same probability distribution of value frequencies. Because sampling is random, some statistical distortions may occur. For example, some of original tuples 131 might not be sampled for some or all of features 131. For example for feature F2, original tuple T3 is not sampled.
In an embodiment, random selection entails generating real numbers that are inclusively or exclusively between zero and one, and such a real number can be scaled to fit into an integer range that is limited by a count of original tuples 121. For example, the random number may be scaled to be in a range of 0-3 for original tuples T1-T4 respectively. In various embodiments, a perturbed tuple should not match: a) tuple to explain 151 nor b) any other perturbed tuple. For example if a randomly generated perturbed value causes such a match, then another perturbed value may be randomly and repeatedly generated until a unique perturbed tuple is generated.
For example, exploration 140 may be designed to only generate perturbed tuples having distinct combinations of features values. Likewise, tuple to explain 151 (and thus values 142) may contain a value for an unperturbed feature that does not occur for that feature and/or any other feature in original values 122. For example as shown, value V2 occurs for feature F3 in values 142 but not for feature F3 in original values 122. Likewise as shown, value V0 occurs for feature F4 in tuple to explain 151 but not for any of features 131 in original values 122.
Because perturbed tuples P1-P2 and P4 are imperfect copies of tuple to explain 151, ML model 160 may generate same or, as shown, different inferences I1-I3 for almost identical tuples P1-P2, P4, and 151. Inference 170 is shown with a dashed outline to demonstrate that inference 170 may be any individual inference of perturbed inferences 143 or the inference of tuple to explain 151. In various embodiments discussed below, differences 145 are measured differences between perturbed inferences 143 and either a known correct label of tuple to explain 151 or inference I2 of tuple to explain 151.
If inferences are numbers such as scores, probabilities, counts, or amounts, then subtraction may measure the difference between an inference by ML model 160 and a known correct inference (e.g. label), In a supervised embodiment: a) ML model 160 may or may not be supervised, b) tuple to explain 151 is labeled with a known correct inference, and c) loss of any inference as compared to a label may be quantified such as by subtractive difference if inferences are numeric or binary difference (i.e. zero means identical, one means different) if inference are not numeric. Inferences that are naturally ordered such as months may be differenced by subtraction based on ordinal integers. For example, depending on whether ordinals are zero or one based, March may be represented as two or three. As explained earlier herein, inference 170 may be inaccurate and thus have nonzero loss. In an embodiment, a label of tuple to explain 151 is interactively entered along with an identifier of tuple to explain 151 at the start of an interactive MLX invocation.
Unless a perturbed tuple is an exact duplicate of one of original tuples 121 that is labeled, perturbed tuples 141 are unlabeled. Loss for a perturbed tuple is measured with the perturbed inference and the label of tuple to explain 151, not inference I2 for tuple to explain 151. Thus, perturbed loss and loss for tuple to explain 151 may or may not be equal because the perturbed inference and the inference for tuple to explain 151 may or may not be identical. In that embodiment, the difference between inference I2 for tuple to explain 151 and one of perturbed inferences 143 is an arithmetic (e.g. subtraction) difference between original loss and perturbed loss, and squaring or absolute value may be used to ensure a non-negative difference.
In various other embodiments, difference measurement instead is unsupervised and does not use or does not have known correct labels for tuples 121, 141, and 151. In an embodiment, ML model 160 is reconstructive (and usually unsupervised), which means that the output of ML model 160 includes, in addition to inference 170, a reconstruction that is a more or less exact copy of tuple 150. For example, ML model 160 may be an autoencoder as discussed later herein. Loss, although used as discussed above for the preferred embodiment, is instead reconstruction loss that is quantified by comparing input tuple 150 to the output reconstruction that, in some embodiments, is based on aggregating losses of features 131 individually. Reconstruction loss compares either an original tuple to its reconstruction or a perturbed tuple to its reconstruction. If ML model 160 internally generates a reconstruction but outputs neither the reconstruction nor its loss, then computer 100 can access the internal reconstruction to measure loss in an embodiment where ML model 160 is not a black box.
When neither labels, loss, nor reconstruction is available, unsupervised difference measurement may instead occur by measuring the difference between inference I2 of tuple to explain 151 and one of perturbed inferences 143 as follows. When inference I2 of tuple to explain 151 and the perturbed inference are identical, the difference is zero. If the inferences are not identical for a categorical feature, a constant nonzero value such as one is used as a difference as shown. Otherwise, a difference may be measured by arithmetic subtraction, in which case squaring or absolute value may be used to ensure non-negative values.
For example if inferences I1 and I2 are numbers such as scores, probabilities, counts, or amounts, then subtraction may measure their difference. Subtraction may also measure differences for values of a sequential range such as months. For example, inference I1 may be two that represents February, inference I2 may be five that represents May, and their difference may be 5-2=three. In an embodiment, a range such as months may be cyclic. For example, a difference between December of the previous year and January of the next year is only one, not eleven.
In an embodiment, differences are statistical and instead measured in units such as standard deviations based on the statistical distribution of inferences in many tuples such as original tuples 121 for the feature. For example, inference 170 may be a threat level in a logarithmic range from 0-5, in which case the statistical difference between zero and one may be one standard deviation that is less than the statistical difference between four and five that may instead be multiple standard deviations.
As explained above, perturbed tuples 141 only shows perturbations of feature F2 but also contains unshown perturbed tuples for other features. Thus, each of features 131 has an equal count of measurements in differences 145 even though only measurements for feature F2 are shown. Thus, each of features 131 has its own set of differences from which a same aggregate statistic may be derived such as mean, mode, or maximum. That aggregate statistic may be used as a respective importance score for each of features 131. Features 131 may be ranked (i.e. sorted) by importance score to establish a relative ordering of influence of features 131 on the inferential operation of ML model 160.
For example based on shown measurements, the average difference for feature F2 is 0.67. Likewise, the average difference for feature F1 may be 0.2. In that case, feature F2 is more influential than feature F1 on the operation of ML model 160. Thus, feature F2 should have more explanatory power for MLX than does feature F1. Thus, a local explanation of ML model 160 would emphasize feature F2 over feature F1.
Within memory of computer 100, a local explanation may be a data structure that is based on or contains a ranking of features 131 by importance score and/or exclude a threshold count of least influential features or features whose importance score falls below a threshold. For example a local explanation may be limited to a top two most influential features or a variable count of features having an importance score of at least 0.4. Explanation generation is discussed later herein.
As shown, data flows from left to right from inputs on the left to internal and intermediate data structures in the middle to output on the right. Inputs are as follows.
As shown by arrows on the left, the inputs are injected into a nested and dashed rectangle in which internal and intermediate data are shown that are as follows.
The output is feature importances I for all features 131 that may be sorted to rank features 131 by influence.
MLX may occur in three sequential phases that are: a) unexplained inferencing, b) MLX invocation, and c) MLX explanation. The process of
Much time (e.g. weeks) may separate unexplained inferencing by step 302 and MLX invocation that causes the remaining steps of
Steps 303A-D are repeated for each of features 131. Steps 303A-B may be combined to populate a subset of rows shown in perturbed tuples 141 and values 142 that correspond to feature F2. That subset of rows is a fixed count of perturbed tuples that step 303B generates. Each of those perturbed tuples is based on tuple to explain 151 and, as provided by step 303A, a respective perturbed value for the feature. Step 303A generates perturbed values for the feature based on randomly sampling values of the feature in original values 122. For example as shown, step 303B may generate perturbed tuples P1-P2 and P4 as copies of tuple to explain 151 and, in perturbed tuples P1-P2 and P4, step 303A may provide perturbed values for feature F2 as sampled from original values 122.
Repetition of steps 303A-B for each of features 131 fully populates perturbed tuples 141 and values 142. Repetition of step 303C fully populates perturbed inferences 143. Step 303C applies ML model 160 to each perturbed tuple in the subset of rows of perturbed tuples 141 to cause ML model 160 to infer respective perturbed inferences that may or may not match inference I2 of tuple to explain 151.
Repetition of step 303D fully populates differences 145. Step 303D measures respective differences between each perturbed inference of the subset of rows of perturbed inferences 143 and, depending on the embodiment, either the known correct label or inference I2 of tuple to explain 151 as explained earlier herein. For horizontal scaling, steps 303A-D may be concurrently performed by a separate execution context respectively for each of features 131 or batched subsets of features.
An execution context may be based on a lightweight thread, an operating system process, a hyper thread, a processing core of a central processing unit (CPU), a CPU, a coprocessor, and/or a separate computer. For pipeline parallelism additionally or instead, a first pipeline stage may perform steps 303A-B, a second may perform step 303C, and a third may perform step 303D. For example, the third stage may measure inference differences for perturbed tuples for feature F1 while the second stage generates perturbed inferences for perturbed tuples P1-P2 and P4 for feature F2 while the first stage generates perturbed tuples for feature F3.
For each of features 131, step 304 calculates a respective importance based on inference differences measured for the feature as explained earlier herein. After step 304, computer 100 may: a) rank features 131 and/or retain their respective importance scores, b) discard exploration 140, and/or c) generate a local MLX explanation for tuple to explain 151 as discussed elsewhere herein.
Respective importance scores for features 131 is the output of the pseudocode that may be used to rank features 131 and/or generate a local explanation as discussed earlier herein.
One or more of the following examples may be implemented in an embodiment of computer 100. These examples are discussed with reference to
As explained earlier herein a local explanation of ML model 160 may be generated based on tuple to explain 151.
As explained earlier herein, a local explanation of ML model 160 may be based on the importance of at least one of features 131.
As explained earlier herein, a local explanation may contain a ranking of at least two of features 131 based on importances of the at least two features.
As explained earlier herein, original tuples 121 need not contain tuple to explain 151.
As explained earlier herein, original values 122 of one of features 131 need not contain the value of that feature in tuple to explain 151.
As explained earlier herein, original values 122 need not contain every value in tuple to explain 151.
As explained earlier herein, measuring differences 145 may entail measuring a respective difference between a respective loss of each of perturbed tuples 143 and a loss of inference I2 for tuple to explain 151.
As explained earlier herein, ML model 160 may be unsupervised, which means unsupervised trained.
As explained earlier herein, original tuples 121 may be unlabeled, which means without known correct inferences.
As explained earlier herein, two execution contexts may concurrently generate two respective tuples of perturbed tuples 141.
Additional complementary local and global explanation techniques are presented in related U.S. patent application Ser. No. 17/232,671.
According to one embodiment, the techniques described herein are implemented by one or more special-purpose computing devices. The special-purpose computing devices may be hard-wired to perform the techniques, or may include digital electronic devices such as one or more application-specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs) that are persistently programmed to perform the techniques, or may include one or more general purpose hardware processors programmed to perform the techniques pursuant to program instructions in firmware, memory, other storage, or a combination. Such special-purpose computing devices may also combine custom hard-wired logic, ASICs, or FPGAs with custom programming to accomplish the techniques. The special-purpose computing devices may be desktop computer systems, portable computer systems, handheld devices, networking devices or any other device that incorporates hard-wired and/or program logic to implement the techniques.
For example,
Computer system 500 also includes a main memory 506, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 502 for storing information and instructions to be executed by processor 504. Main memory 506 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 504. Such instructions, when stored in non-transitory storage media accessible to processor 504, render computer system 500 into a special-purpose machine that is customized to perform the operations specified in the instructions.
Computer system 500 further includes a read only memory (ROM) 508 or other static storage device coupled to bus 502 for storing static information and instructions for processor 504. A storage device 510, such as a magnetic disk, optical disk, or solid-state drive is provided and coupled to bus 502 for storing information and instructions.
Computer system 500 may be coupled via bus 502 to a display 512, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 514, including alphanumeric and other keys, is coupled to bus 502 for communicating information and command selections to processor 504. Another type of user input device is cursor control 516, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 504 and for controlling cursor movement on display 512. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
Computer system 500 may implement the techniques described herein using customized hard-wired logic, one or more ASICs or FPGAs, firmware and/or program logic which in combination with the computer system causes or programs computer system 500 to be a special-purpose machine. According to one embodiment, the techniques herein are performed by computer system 500 in response to processor 504 executing one or more sequences of one or more instructions contained in main memory 506. Such instructions may be read into main memory 506 from another storage medium, such as storage device 510. Execution of the sequences of instructions contained in main memory 506 causes processor 504 to perform the process steps described herein. In alternative embodiments, hard-wired circuitry may be used in place of or in combination with software instructions.
The term “storage media” as used herein refers to any non-transitory media that store data and/or instructions that cause a machine to operate in a specific fashion. Such storage media may comprise non-volatile media and/or volatile media. Non-volatile media includes, for example, optical disks, magnetic disks, or solid-state drives, such as storage device 510. Volatile media includes dynamic memory, such as main memory 506. Common forms of storage media include, for example, a floppy disk, a flexible disk, hard disk, solid-state drive, magnetic tape, or any other magnetic data storage medium, a CD-ROM, any other optical data storage medium, any physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, NVRAM, any other memory chip or cartridge.
Storage media is distinct from but may be used in conjunction with transmission media. Transmission media participates in transferring information between storage media. For example, transmission media includes coaxial cables, copper wire and fiber optics, including the wires that comprise bus 502. Transmission media can also take the form of acoustic or light waves, such as those generated during radio-wave and infra-red data communications.
Various forms of media may be involved in carrying one or more sequences of one or more instructions to processor 504 for execution. For example, the instructions may initially be carried on a magnetic disk or solid-state drive of a remote computer. The remote computer can load the instructions into its dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 500 can receive the data on the telephone line and use an infra-red transmitter to convert the data to an infra-red signal. An infra-red detector can receive the data carried in the infra-red signal and appropriate circuitry can place the data on bus 502. Bus 502 carries the data to main memory 506, from which processor 504 retrieves and executes the instructions. The instructions received by main memory 506 may optionally be stored on storage device 510 either before or after execution by processor 504.
Computer system 500 also includes a communication interface 518 coupled to bus 502. Communication interface 518 provides a two-way data communication coupling to a network link 520 that is connected to a local network 522. For example, communication interface 518 may be an integrated services digital network (ISDN) card, cable modem, satellite modem, or a modem to provide a data communication connection to a corresponding type of telephone line. As another example, communication interface 518 may be a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links may also be implemented. In any such implementation, communication interface 518 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.
Network link 520 typically provides data communication through one or more networks to other data devices. For example, network link 520 may provide a connection through local network 522 to a host computer 524 or to data equipment operated by an Internet Service Provider (ISP) 526. ISP 526 in turn provides data communication services through the world wide packet data communication network now commonly referred to as the “Internet” 528. Local network 522 and Internet 528 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and the signals on network link 520 and through communication interface 518, which carry the digital data to and from computer system 500, are example forms of transmission media.
Computer system 500 can send messages and receive data, including program code, through the network(s), network link 520 and communication interface 518. In the Internet example, a server 530 might transmit a requested code for an application program through Internet 528, ISP 526, local network 522 and communication interface 518.
The received code may be executed by processor 504 as it is received, and/or stored in storage device 510, or other non-volatile storage for later execution.
Software system 600 is provided for directing the operation of computing system 500. Software system 600, which may be stored in system memory (RAM) 506 and on fixed storage (e.g., hard disk or flash memory) 510, includes a kernel or operating system (OS) 610.
The OS 610 manages low-level aspects of computer operation, including managing execution of processes, memory allocation, file input and output (I/O), and device I/O. One or more application programs, represented as 602A, 602B, 602C . . . 602N, may be “loaded” (e.g., transferred from fixed storage 510 into memory 506) for execution by the system 600. The applications or other software intended for use on computer system 500 may also be stored as a set of downloadable computer-executable instructions, for example, for downloading and installation from an Internet location (e.g., a Web server, an app store, or other online service).
Software system 600 includes a graphical user interface (GUI) 615, for receiving user commands and data in a graphical (e.g., “point-and-click” or “touch gesture”) fashion. These inputs, in turn, may be acted upon by the system 600 in accordance with instructions from operating system 610 and/or application(s) 602. The GUI 615 also serves to display the results of operation from the OS 610 and application(s) 602, whereupon the user may supply additional inputs or terminate the session (e.g., log off).
OS 610 can execute directly on the bare hardware 620 (e.g., processor(s) 504) of computer system 500. Alternatively, a hypervisor or virtual machine monitor (VMM) 630 may be interposed between the bare hardware 620 and the OS 610. In this configuration, VMM 630 acts as a software “cushion” or virtualization layer between the OS 610 and the bare hardware 620 of the computer system 500.
VMM 630 instantiates and runs one or more virtual machine instances (“guest machines”). Each guest machine comprises a “guest” operating system, such as OS 610, and one or more applications, such as application(s) 602, designed to execute on the guest operating system. The VMM 630 presents the guest operating systems with a virtual operating platform and manages the execution of the guest operating systems.
In some instances, the VMM 630 may allow a guest operating system to run as if it is running on the bare hardware 620 of computer system 500 directly. In these instances, the same version of the guest operating system configured to execute on the bare hardware 620 directly may also execute on VMM 630 without modification or reconfiguration. In other words, VMM 630 may provide full hardware and CPU virtualization to a guest operating system in some instances.
In other instances, a guest operating system may be specially designed or configured to execute on VMM 630 for efficiency. In these instances, the guest operating system is “aware” that it executes on a virtual machine monitor. In other words, VMM 630 may provide para-virtualization to a guest operating system in some instances.
A computer system process comprises an allotment of hardware processor time, and an allotment of memory (physical and/or virtual), the allotment of memory being for storing instructions executed by the hardware processor, for storing data generated by the hardware processor executing the instructions, and/or for storing the hardware processor state (e.g. content of registers) between allotments of the hardware processor time when the computer system process is not running. Computer system processes run under the control of an operating system, and may run under the control of other programs being executed on the computer system.
The term “cloud computing” is generally used herein to describe a computing model which enables on-demand access to a shared pool of computing resources, such as computer networks, servers, software applications, and services, and which allows for rapid provisioning and release of resources with minimal management effort or service provider interaction.
A cloud computing environment (sometimes referred to as a cloud environment, or a cloud) can be implemented in a variety of different ways to best suit different requirements. For example, in a public cloud environment, the underlying computing infrastructure is owned by an organization that makes its cloud services available to other organizations or to the general public. In contrast, a private cloud environment is generally intended solely for use by, or within, a single organization. A community cloud is intended to be shared by several organizations within a community; while a hybrid cloud comprise two or more types of cloud (e.g., private, community, or public) that are bound together by data and application portability.
Generally, a cloud computing model enables some of those responsibilities which previously may have been provided by an organization's own information technology department, to instead be delivered as service layers within a cloud environment, for use by consumers (either within or external to the organization, according to the cloud's public/private nature). Depending on the particular implementation, the precise definition of components or features provided by or within each cloud service layer can vary, but common examples include: Software as a Service (SaaS), in which consumers use software applications that are running upon a cloud infrastructure, while a SaaS provider manages or controls the underlying cloud infrastructure and applications. Platform as a Service (PaaS), in which consumers can use software programming languages and development tools supported by a PaaS provider to develop, deploy, and otherwise control their own applications, while the PaaS provider manages or controls other aspects of the cloud environment (i.e., everything below the run-time execution environment). Infrastructure as a Service (IaaS), in which consumers can deploy and run arbitrary software applications, and/or provision processing, storage, networks, and other fundamental computing resources, while an IaaS provider manages or controls the underlying physical cloud infrastructure (i.e., everything below the operating system layer). Database as a Service (DBaaS) in which consumers use a database server or Database Management System that is running upon a cloud infrastructure, while a DbaaS provider manages or controls the underlying cloud infrastructure and applications.
The above-described basic computer hardware and software and cloud computing environment presented for purpose of illustrating the basic underlying computer components that may be employed for implementing the example embodiment(s). The example embodiment(s), however, are not necessarily limited to any particular computing environment or computing device configuration. Instead, the example embodiment(s) may be implemented in any type of system architecture or processing environment that one skilled in the art, in light of this disclosure, would understand as capable of supporting the features and functions of the example embodiment(s) presented herein.
A machine learning model is trained using a particular machine learning algorithm. Once trained, input is applied to the machine learning model to make a prediction, which may also be referred to herein as a predicated output or output. Attributes of the input may be referred to as features and the values of the features may be referred to herein as feature values.
A machine learning model includes a model data representation or model artifact. A model artifact comprises parameters values, which may be referred to herein as theta values, and which are applied by a machine learning algorithm to the input to generate a predicted output. Training a machine learning model entails determining the theta values of the model artifact. The structure and organization of the theta values depends on the machine learning algorithm.
In supervised training, training data is used by a supervised training algorithm to train a machine learning model. The training data includes input and a “known” output. In an embodiment, the supervised training algorithm is an iterative procedure. In each iteration, the machine learning algorithm applies the model artifact and the input to generate a predicated output. An error or variance between the predicated output and the known output is calculated using an objective function. In effect, the output of the objective function indicates the accuracy of the machine learning model based on the particular state of the model artifact in the iteration. By applying an optimization algorithm based on the objective function, the theta values of the model artifact are adjusted. An example of an optimization algorithm is gradient descent. The iterations may be repeated until a desired accuracy is achieved or some other criteria is met.
In a software implementation, when a machine learning model is referred to as receiving an input, being executed, and/or generating an output or predication, a computer system process executing a machine learning algorithm applies the model artifact against the input to generate a predicted output. A computer system process executes a machine learning algorithm by executing software configured to cause execution of the algorithm. When a machine learning model is referred to as performing an action, a computer system process executes a machine learning algorithm by executing software configured to cause performance of the action.
Inferencing entails a computer applying the machine learning model to an input such as a feature vector to generate an inference by processing the input and content of the machine learning model in an integrated way. Inferencing is data driven according to data, such as learned coefficients, that the machine learning model contains. Herein, this is referred to as inferencing by the machine learning model that, in practice, is execution by a computer of a machine learning algorithm that processes the machine learning model.
Classes of problems that machine learning (ML) excels at include clustering, classification, regression, anomaly detection, prediction, and dimensionality reduction (i.e. simplification). Examples of machine learning algorithms include decision trees, support vector machines (SVM), Bayesian networks, stochastic algorithms such as genetic algorithms (GA), and connectionist topologies such as artificial neural networks (ANN). Implementations of machine learning may rely on matrices, symbolic models, and hierarchical and/or associative data structures. Parameterized (i.e. configurable) implementations of best of breed machine learning algorithms may be found in open source libraries such as Google's TensorFlow for Python and C++ or Georgia Institute of Technology's MLPack for C++. Shogun is an open source C++ ML library with adapters for several programing languages including C#, Ruby, Lua, Java, MatLab, R, and Python.
An artificial neural network (ANN) is a machine learning model that at a high level models a system of neurons interconnected by directed edges. An overview of neural networks is described within the context of a layered feedforward neural network. Other types of neural networks share characteristics of neural networks described below.
In a layered feed forward network, such as a multilayer perceptron (MLP), each layer comprises a group of neurons. A layered neural network comprises an input layer, an output layer, and one or more intermediate layers referred to hidden layers.
Neurons in the input layer and output layer are referred to as input neurons and output neurons, respectively. A neuron in a hidden layer or output layer may be referred to herein as an activation neuron. An activation neuron is associated with an activation function. The input layer does not contain any activation neuron.
From each neuron in the input layer and a hidden layer, there may be one or more directed edges to an activation neuron in the subsequent hidden layer or output layer. Each edge is associated with a weight. An edge from a neuron to an activation neuron represents input from the neuron to the activation neuron, as adjusted by the weight.
For a given input to a neural network, each neuron in the neural network has an activation value. For an input neuron, the activation value is simply an input value for the input. For an activation neuron, the activation value is the output of the respective activation function of the activation neuron.
Each edge from a particular neuron to an activation neuron represents that the activation value of the particular neuron is an input to the activation neuron, that is, an input to the activation function of the activation neuron, as adjusted by the weight of the edge. Thus, an activation neuron in the subsequent layer represents that the particular neuron's activation value is an input to the activation neuron's activation function, as adjusted by the weight of the edge. An activation neuron can have multiple edges directed to the activation neuron, each edge representing that the activation value from the originating neuron, as adjusted by the weight of the edge, is an input to the activation function of the activation neuron.
Each activation neuron is associated with a bias. To generate the activation value of an activation neuron, the activation function of the neuron is applied to the weighted activation values and the bias.
The artifact of a neural network may comprise matrices of weights and biases. Training a neural network may iteratively adjust the matrices of weights and biases.
For a layered feedforward network, as well as other types of neural networks, the artifact may comprise one or more matrices of edges W. A matrix W represents edges from a layer L−1 to a layer L. Given the number of neurons in layer L−1 and L is N[L−1] and N[L], respectively, the dimensions of matrix W is N[L−1] columns and N[L] rows.
Biases for a particular layer L may also be stored in matrix B having one column with N[L] rows.
The matrices W and B may be stored as a vector or an array in RAM memory, or comma separated set of values in memory. When an artifact is persisted in persistent storage, the matrices W and B may be stored as comma separated values, in compressed and/serialized form, or other suitable persistent form.
A particular input applied to a neural network comprises a value for each input neuron. The particular input may be stored as vector. Training data comprises multiple inputs, each being referred to as sample in a set of samples. Each sample includes a value for each input neuron. A sample may be stored as a vector of input values, while multiple samples may be stored as a matrix, each row in the matrix being a sample.
When an input is applied to a neural network, activation values are generated for the hidden layers and output layer. For each layer, the activation values for may be stored in one column of a matrix A having a row for every neuron in the layer. In a vectorized approach for training, activation values may be stored in a matrix, having a column for every sample in the training data.
Training a neural network requires storing and processing additional matrices. Optimization algorithms generate matrices of derivative values which are used to adjust matrices of weights W and biases B. Generating derivative values may use and require storing matrices of intermediate values generated when computing activation values for each layer.
The number of neurons and/or edges determines the size of matrices needed to implement a neural network. The smaller the number of neurons and edges in a neural network, the smaller matrices and amount of memory needed to store matrices. In addition, a smaller number of neurons and edges reduces the amount of computation needed to apply or train a neural network. Less neurons means less activation values need be computed, and/or less derivative values need be computed during training.
Properties of matrices used to implement a neural network correspond neurons and edges. A cell in a matrix W represents a particular edge from a neuron in layer L−1 to L. An activation neuron represents an activation function for the layer that includes the activation function. An activation neuron in layer L corresponds to a row of weights in a matrix W for the edges between layer L and L−1 and a column of weights in matrix W for edges between layer L and L+1. During execution of a neural network, a neuron also corresponds to one or more activation values stored in matrix A for the layer and generated by an activation function.
An ANN is amenable to vectorization for data parallelism, which may exploit vector hardware such as single instruction multiple data (SIMD), such as with a graphical processing unit (GPU). Matrix partitioning may achieve horizontal scaling such as with symmetric multiprocessing (SMP) such as with a multicore central processing unit (CPU) and or multiple coprocessors such as GPUs. Feed forward computation within an ANN may occur with one step per neural layer. Activation values in one layer are calculated based on weighted propagations of activation values of the previous layer, such that values are calculated for each subsequent layer in sequence, such as with respective iterations of a for loop. Layering imposes sequencing of calculations that is not parallelizable. Thus, network depth (i.e. amount of layers) may cause computational latency. Deep learning entails endowing a multilayer perceptron (MLP) with many layers. Each layer achieves data abstraction, with complicated (i.e. multidimensional as with several inputs) abstractions needing multiple layers that achieve cascaded processing. Reusable matrix based implementations of an ANN and matrix operations for feed forward processing are readily available and parallelizable in neural network libraries such as Google's TensorFlow for Python and C++, OpenNN for C++, and University of Copenhagen's fast artificial neural network (FANN). These libraries also provide model training algorithms such as backpropagation.
An ANN's output may be more or less correct. For example, an ANN that recognizes letters may mistake an I as an L because those letters have similar features. Correct output may have particular value(s), while actual output may have somewhat different values. The arithmetic or geometric difference between correct and actual outputs may be measured as error according to a loss function, such that zero represents error free (i.e. completely accurate) behavior. For any edge in any layer, the difference between correct and actual outputs is a delta value.
Backpropagation entails distributing the error backward through the layers of the ANN in varying amounts to all of the connection edges within the ANN. Propagation of error causes adjustments to edge weights, which depends on the gradient of the error at each edge. Gradient of an edge is calculated by multiplying the edge's error delta times the activation value of the upstream neuron. When the gradient is negative, the greater the magnitude of error contributed to the network by an edge, the more the edge's weight should be reduced, which is negative reinforcement. When the gradient is positive, then positive reinforcement entails increasing the weight of an edge whose activation reduced the error. An edge weight is adjusted according to a percentage of the edge's gradient. The steeper is the gradient, the bigger is adjustment. Not all edge weights are adjusted by a same amount. As model training continues with additional input samples, the error of the ANN should decline. Training may cease when the error stabilizes (i.e. ceases to reduce) or vanishes beneath a threshold (i.e. approaches zero). Example mathematical formulae and techniques for feedforward multilayer perceptron (MLP), including matrix operations and backpropagation, are taught in related reference “EXACT CALCULATION OF THE HESSIAN MATRIX FOR THE MULTI-LAYER PERCEPTRON,” by Christopher M. Bishop.
Model training may be supervised or unsupervised. For supervised training, the desired (i.e. correct) output is already known for each example in a training set. The training set is configured in advance by (e.g. a human expert) assigning a categorization label to each example. For example, the training set for optical character recognition may have blurry photographs of individual letters, and an expert may label each photo in advance according to which letter is shown. Error calculation and backpropagation occurs as explained above.
Unsupervised model training is more involved because desired outputs need to be discovered during training. Unsupervised training may be easier to adopt because a human expert is not needed to label training examples in advance. Thus, unsupervised training saves human labor. A natural way to achieve unsupervised training is with an autoencoder, which is a kind of ANN. An autoencoder functions as an encoder/decoder (codec) that has two sets of layers. The first set of layers encodes an input example into a condensed code that needs to be learned during model training. The second set of layers decodes the condensed code to regenerate the original input example. Both sets of layers are trained together as one combined ANN. Error is defined as the difference between the original input and the regenerated input as decoded. After sufficient training, the decoder outputs more or less exactly whatever is the original input.
An autoencoder relies on the condensed code as an intermediate format for each input example. It may be counter-intuitive that the intermediate condensed codes do not initially exist and instead emerge only through model training. Unsupervised training may achieve a vocabulary of intermediate encodings based on features and distinctions of unexpected relevance. For example, which examples and which labels are used during supervised training may depend on somewhat unscientific (e.g. anecdotal) or otherwise incomplete understanding of a problem space by a human expert. Whereas, unsupervised training discovers an apt intermediate vocabulary based more or less entirely on statistical tendencies that reliably converge upon optimality with sufficient training due to the internal feedback by regenerated decodings. Techniques for unsupervised training of an autoencoder for anomaly detection based on reconstruction error is taught in non-patent literature (NPL) “VARIATIONAL AUTOENCODER BASED ANOMALY DETECTION USING RECONSTRUCTION PROBABILITY”, Special Lecture on IE. 2015 Dec. 27; 2(1):1-18 by Jinwon An et al.
Principal component analysis (PCA) provides dimensionality reduction by leveraging and organizing mathematical correlation techniques such as normalization, covariance, eigenvectors, and eigenvalues. PCA incorporates aspects of feature selection by eliminating redundant features. PCA can be used for prediction. PCA can be used in conjunction with other ML algorithms.
A random forest or random decision forest is an ensemble of learning approaches that construct a collection of randomly generated nodes and decision trees during a training phase. Different decision trees of a forest are constructed to be each randomly restricted to only particular subsets of feature dimensions of the data set, such as with feature bootstrap aggregating (bagging). Therefore, the decision trees gain accuracy as the decision trees grow without being forced to over fit training data as would happen if the decision trees were forced to learn all feature dimensions of the data set. A prediction may be calculated based on a mean (or other integration such as soft max) of the predictions from the different decision trees.
Random forest hyper-parameters may include: number-of-trees-in-the-forest, maximum-number-of-features-considered-for-splitting-a-node, number-of-levels-in-each-decision-tree, minimum-number-of-data-points-on-a-leaf-node, method-for-sampling-data-points, etc.
In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The sole and exclusive indicator of the scope of the invention, and what is intended by the applicants to be the scope of the invention, is the literal and equivalent scope of the set of claims that issue from this application, in the specific form in which such claims issue, including any subsequent correction.