The present invention relates to cyber security and more particularly to graph-based systems of element classification using a tool such as VirusTotal.
For the purpose of this document we make the following definitions:
The prior art tool VirusTotal2 (VT) can be regarded as a vast data-base containing security-related information. As this resource is constantly being updated by many parties across the globe, it is considered a quality source for security data, regarding both malicious and benign samples. It contains four main kinds of elements files, domains, IP addresses and URL web addresses as well as the connections between them. 2http://www.virustotal.com/
The Cybereason3 security platform collects (relevant) data from the user's end-point computer and analyzes them on a central server. As part of this analysis, it queries VT for e.g. the image file of every process it encounters. This lets the system know whether the file is already known to the community, and if so, what its reputation is. If the file is unknown to VT, other methods must be utilized in order to assess whether it is malicious. In this work, we discuss one such method. 3http://www.cybereason.com/
When an element is unknown to VT, we obviously lack direct community reputation for it. However, in certain situations, we might still be able obtain indirect information relating to it from VT, and use that in order to estimate its maliciousness level. This can be done when the Cybereason platform detects connections between the element and other elements, its neighbors. If some of the neighbors are known in VT, we can use their VT records to indirectly classify the element itself.
The VT data-base contains information on mainly four element kinds. Each element has various fields, from which features can be extracted. In addition, elements can be connected to other elements. The four VT element kinds are:
We can see that these elements connect with each other via several types of relations. Each element kind has its own possible relations to other kinds. In addition, there are examples where a pair of element kinds has several different relations connecting between them. As detailed below, such complex relations can be best represented using an extension of a graph.
The Cybereason sensors collect system-level data that is useful in the detection of a cyber attack. Types of data collected include, for example:
As we can see, the elements are similar to those of VT, and again can be represented by an extension of a graph. This should not come as a surprise, as the two systems try to describe similar content domains.
This allows us to translate the Cybereason representation to that of VT: Each process has features of its own and features describing the file associated with it. In addition, using the information collected on the connections opened by the process, we can connect neighbors to each process (e.g. resolved domains, IP connections using TCP or UDP, or even the execution parent file), in much the same way as in VT.
Within this context, two important aspects of graph-based inference need to be mentioned. First, graph relations are inherently unordered, e.g. the resolved IP addresses of a domain have no intrinsic ordering. This proves to be a challenge for most algorithms, and in particular ones based on neural networks, which typically require a clear and structured order for processing input.
Secondly, the common practice for propagating information along the graph involves making iterative inferences about intermediate elements. This, however, has an inherent significant drawback: Since the structure of the graph can be quite complex, any single element can participate in the inference of more than a single target element. Moreover, when used for classification of different target elements, each intermediate element can play a different role or appear in a different context. Committing to a class for each intermediate element then using that same class for classification in different contexts can cause errors, which then propagate as the network is traversed. This is especially important for cyber-security purposes, where an element is often not inherently “good” or “bad”, but instead should be considered in the context of a bigger picture.
For example, consider a case of two file elements undergoing classification and a certain web domain. One file was downloaded from the domain, while the other was observed communicating with it. The same domain element therefore serves in a different role relative to each file element. Similarly, an element can be a close neighbor to one target element and a far neighbor to another, can have few or many graph paths leading to the target element, etc.
One way to mitigate the problem is to classify a single element at a time while reclassifying intermediate elements each time, in each context. This, however, is not very efficient and does not take full advantage of the complexity of a true graph structure. A more efficient approach is to avoid classifying intermediate elements altogether and instead use them in a flexible manner which is context-aware. This allows the classification of several elements simultaneously a much more versatile and efficient approach while also avoiding the introduction of errors such ‘hard’ intermediate classifications can cause.
Overcoming these two challenges is crucial for an effective cyber-security graph-based learning system.
The present invention uses a directed hypergraph to classify an element based on information aggregated from its close neighbors in the graph. A crawling procedure is used starting from elements needing classification and collecting a set of their neighbors forming a neighborhood. An example use of the present invention in the Cybereason platform is when the system encounters an element known or unknown to VT, and also detects connections from it to other elements which are found in VT. These neighbors are then used to classify the element. This classifier is able to obtain as input an entire neighborhood. This input is much richer than prior art feature vector approaches. Here, the input includes several feature vectors, one for each element in the neighborhood. In addition, a mapping of interconnections can be provided for each group of elements.
It is an object of the present invention to utilize network (graph) analysis to determine the maliciousness level of elements for the purpose of cybersecurity.
It is a further object of the present invention to use observed properties of elements to indirectly deduce information about them from a cyber-security data-base (such as VT), even when the element in question is not present in the data-base.
Finally, it is an object of the present invention to provide incrimination of an element based on its connections to neighbors without a classification of these neighbors.
We saw that, for our purposes, we can think of both VT and the Cybereason platform as containing information regarding the same kinds of elements, having the same connections (albeit maybe having different features). It is only natural to represent these elements as a directed graph. However, in our case an element can connect to multiple neighbors using the same connection. Therefore, a more suitable structure in this case is in fact a directed hypergraph.
We follow [1] when defining our directed hypergraph and broaden the definition to also generalize a directed multidigraph, or a quiver. We define a directed multi-hypergraph as an ordered tuple G=(, ε, s, t, w), where is a set of vertices (or nodes, or elements) and ε a set of directed hyperedges (or simply edges). The function s:ε→ assigns to each edge a source node, and the function t:ε→\{∅} assigns to each edge a set of target nodes, where ∅ is the empty set; w is a family of weight functions, providing each edge e∈ε with a weight function we:t(e)→>0, where >0 is the set of positive real numbers. A few remarks on our definitions:
In addition to having the structure of a directed hypergraph, our data are also typed and relational, meaning there are different node types (or kinds), each having different features, and relations specifying the allowed connections between them (e.g. hostname is a relation connecting URL elements to Domain elements). To formalize this notion, we add a typing structure to our graph definition. We define a typed and relational version of the hypergraph G as the ordered pair c=(, M). The typing structure M=(, κs, κt) is a quiver, which we call the meta-graph of G. Its node and edge sets, and , are partitions of and ε, respectively, representing the different node kinds and relations. Similarly to s and t, the functions κs,t:→ assign to each relation its source and target node kinds, respectively. In order for these definitions to be meaningful, we must also impose some consistency constraints and, for every relation r∈ and every edge e∈r, require that s(e)∈κs(r) and t(e)⊆κt(r). As before, if the hypergraph is undirected, for every relation r∈ we need to include in its reverse relation r′ satisfying κs(r′)=κt(r) and κt(r′)=κs(r).
Armed with a meta-graph, we can declutter the graph somewhat by unifying edges of the same relation having the same source node. Since the hypergraph allows for several target nodes for each edge, there is no benefit in maintaining several edges having the same “role”, i.e. belonging to the same relation. Formally, for every r∈ and v∈κs(r) define the plurality set Pr(v)={e∈r|s(e)=v}. If |Pr(v)|>1, we remove all the edges in Pr(v) from ε and replace them with a single edge ē satisfying
Finally, the unified edge ē is included in the relation r.
Our goal is to classify an element based on information aggregated from its close neighbors in the graph. Here we describe what we mean by “close neighbors” and how we acquire the necessary data for classification.
Again following [1], we define the forward star of a node v∈ and relational forward star of a node kind k∈ as
F★(u)={e∈|s(e)=u}
F★(k)={r∈|s(r)=k} (2)
respectively. The neighborhood of a node is then
We can now recursively define the series
each reaching further along the graph than its predecessor. We call (v) the set of -neighbors of node v. We also define the -neighborhood of v as (v)=Ni(v).
We set an integer parameter L>0, determining the furthest neighbor participating in the prediction for each element. We will see that this parameter corresponds to the number of layers used in the neural-network classifier. Given L, we require L(v) for each element v we want to classify.
We acquire the L-neighborhoods using a crawling procedure. We start from the elements in need of classification, which we call the seeds, and collect their -neighbors sequentially: for each seed v, we first construct N0(v) and then for =1, . . . , L use v) to find N(v).
As discussed in [2], it is advised to limit the size of the neighborhoods due to performance considerations. This can be achieved by e.g. uniformly sampling the target nodes of each edge when crawling the graph. To do so, we set a size limit Smax=20 and, whenever using Eq. (3), for every edge e satisfying |t(e)|>Smax we use a uniform sample (without replacement) of only Smax nodes from t(e). Other sampling schemes can of course be used, and the edge weights w can also be taken into consideration when sampling.
One example use of this classifier in the Cybereason platform is in cases in which the system encounters a file unknown to VT, but also detects connections from it to other elements which are found in VT. These neighbors are then used to classify the unknown file. This means that in this example, we actually have in our graph two different kinds of File elements: files found in VT, encountered as neighbors, and unknown files, encountered only in the Cybereason platform, acting as seeds requiring classification. We call the former node kind File and the latter FileSeed. As FileSeeds are unknown, their features are a small subset of File features. This means we are able to take File elements and, by removing some features, generate mock FileSeed elements from them, to use for training.4 4Similar procedures can be applied for other uses as well, such as classification of other element types, or classification of known elements.
Therefore, in this example, we apply our crawling procedure for acquiring the data differently, depending on whether we collect data for training the classifier, or at production time. During training, after deciding on a set of files suitable for training that will act as seeds, all known to VT, we crawl for their L-neighborhoods and convert the original seed elements to FileSeed kind. During production time, we have FileSeeds encountered by Cybereason, unknown to VT, and one or more neighbors of each. We then only crawl VT for the (L−1)-neighborhood of each neighbor to acquire all the data we need.
When training our classifier, we also require labels for the seed elements. Since, in our example, at training time our seeds are made from complete File elements known to VT, we can use all the available information to produce the labels. This can be done either manually by domain experts, or automatically using some logic operating on each VT File report. In the example given here we choose to classify to two classes (‘malicious’ or ‘benign’), but the method described below is generic and works for multi-class classification as well.
We stress that while classification of unknown files is the main example given in this work, our method can be used to classify other element kinds as well, even simultaneously with File elements. We can have seeds of several node kinds with no alteration to the formalism.
We have described how, for each seed v, we crawl for its neighbors (v) required for its classification. Our classifier then has to be able to get as input an entire neighborhood, or in other words, the subgraph induced by (v). This input is much richer than the traditional “feature vector” approach: The input consists of several feature vectors, one for each element in the neighborhood. The number of neighbors is not constant between samples, and they can be of different kinds (each element kind has its own set of features). We also have to provide a mapping of their connections, i.e. which element is related to which. The architecture suited for this task is the Graph Convolution Network.
As our classifier, we use the neural-network architecture known as Graph Convolution Network (GCN). It is suitable for our task since it operates on subgraphs as input. While other methods require an entire graph in order to produce classification to some of its nodes, using GCN we learn a set of weights used to aggregate information from neighborhoods. The same weights can then be used for new neighborhoods to classify new, yet unseen elements.
The following architecture is based on the GraphSAGE algorithm described in [2]. However, we have generalized it, as [2, 3] and others deal only with undirected graphs. In addition, approaches from [4] are incorporated in order to generalize the method to our typed and relational graph.
Our typed graph of elements in which different node kinds have different features can be represented as follows: For each kind k∈, we arbitrarily assign indices to its elements as k={v1k, . . . , v|k|k}. We then build a feature matrix Xk of dimensions |k|×fk, where fk is the number of features of kind k. In this matrix, row i holds the features of vik.
In order to represent the connections between elements, we build for each relation r∈ an adjacency matrix Ar of dimensions |κs(r)|×|κt(r)|. This is a sparse matrix in which the entry in row i and column j is non-zero if there is an edge e∈r such that viκ
Thus, with the features matrices {Xk} and || adjacency matrices {Ar}, we can fully represent the graph. The only remaining piece of information we require is which of the elements in the graph function as seeds, i.e. which elements we are actually trying to classify. At training time, they are the only elements which have labels, and at inference time, they are the ones we need to classify. As mentioned above, in general several of the elements can function as seeds. There can even be seeds of several different node kinds.
Much like a conventional neural network, the GCN is constructed in successive layers. Having chosen L and collected (v) for each seed v, we require L layers in total. As mentioned above, this includes the furthermost neighbors that participate in the prediction for each seed. For example, if L=2, each seed receives information from, at most, its neighbors' neighbors.
Each layer of the network consists of two processing steps:
In this way, at each iteration (layer), information flows along the edges of the graph a distance of one edge, in the direction opposite to that of the edge.
We note that each element's own features are always used when calculating its activations for the next layer. We should therefore never explicitly consider an element to be its own neighbor, in order not to give extra, unfair, weight to its own features. To avoid that, for each relation r connecting a node kind to itself, i.e. κs(r)=κt(r), we set the diagonal of the corresponding adjacency matrix Ar to zero.
We denote the activation matrix for node kind k in layer by . Its dimensions are |k|×, where is the chosen number of units for kind k in this layer. We initially set Zk(0)=Xk and fk(0)=fk for each k∈. The final number of units, fk(L), is the number of output classes if there are seeds of kind k. Otherwise, Zk(L) is never calculated (see below).
For each layer and each relation r∈, we need to choose an aggregation function, . This function takes the features of the target nodes (i.e. neighbors) of relation r and aggregates them together for each source node, according to the adjacency matrix. The result is a feature matrix for neighbors,
=(, Ar), (5)
in which row i holds the aggregated features from the neighborhood of viκ
We now use the original features in addition to the aggregated features and feed them all into a fully-connected neural layer. To do that, we define for each node kind k a kernel matrix of dimensions × and a bias vector of length . We also define for each relation r a kernel matrix of dimensions . The layer's output is then calculated as
where σ is an activation function. Various activation functions can be used, for example, softmax5 for the last layer and ReLU6 for all other layers. 5The softmax function operates on vectors or, in our case, rows of a matrix and is defined as softmax(x)i=ez
Finally, from the output matrices {Zk(L)}, we take only the rows corresponding to the seeds. This is the output of the network. We note that, actually, only output matrices for node kinds that can have seeds should ever be calculated. Other kinds are only used as neighbors. Therefore, for these node kinds, the calculation of the final Zk(L) can be skipped.
As discussed in [2], any function used to aggregate neighbors' features should, when viewed as operating on the neighbors of a single node, have the following properties:
Here we consider as examples two aggregation functions: mean and max pooling. The former is an example of a very simple aggregator and the latter a more complex one that is more expressive.
This simple aggregation function has no trainable weights. It calculates the weighted mean of each feature over all neighbors. As such, the number of features remains unchanged, i.e. =.
We use the adjacency matrix Ar to build its out-degree matrix {acute over (D)}r, a diagonal matrix of dimensions |κs(r)|×|κs(r)| satisfying ({acute over (D)}r)ij=δijΣk(Ar)ik, where δij is the Kronecker delta. The aggregated feature matrix is then given by
={acute over (D)}r−1Ar (7)
For efficiency, the matrix {acute over (D)}r−1Ar can of course be calculated only once, in advance.
Other normalizations can of course be considered. For example, motivated by the symmetric normalized Laplacian operator, a variation of mean aggregation is considered by [3]. In it, the in-degree matrix {grave over (D)}r, satisfying ({grave over (D)}r)ij=δijΣk(Ar)kj, is also utilized and the aggregated feature matrix is calculated as
={acute over (D)}r−1/2Ar {acute over (D)}r−1/2 (8)
While similar, the two aggregators weigh the features differently.
The main advantage of the mean aggregator is its simplicity. There are no weights to train and the logic is straightforward. It is, however, not very expressive: for each feature, all neighbors contribute according to their edge weights. Many neighbors must exhibit extreme values before the effect is evident in the aggregated feature.
A more sophisticated aggregation function uses max pooling. In a sense, it picks out the most extreme evidence from all features, over all neighbors. The neighbors' features are first fed into a fully-connected neural layer, possibly changing the number of features in the process. For each output feature, the maximum over all neighbors is then selected.
In the most general form of this aggregator, we select a pooling dimension for each relation r, and define a pooling kernel matrix of dimensions × and a pooling bias vector of length . These would both be trained with the rest of the network weights. The aggregated feature matrix is then given by
=Ar⊙σ(+), (9)
where σ is an activation function such as ReLU and we define the operator C as
i.e. similar to the regular dot product, but where one takes the maximum instead of summing. The resulting number of aggregated features is then =.
In practice, we have found it preferable to share the pooling weights between all relations having the same target node kind. The motivation for sharing weights is to reduce the complexity of the aggregator, and thus reduce overfitting. Moreover, it makes sense that the means to aggregate features of a certain kind should not depend strongly on the use of these aggregated features later on. In fact, this argument can be applied to any aggregation function which uses trained weights. While the general formalism allows for different weights for each relation, it is often advantageous to share weights in this manner.
In the version of max pooling incorporating shared weights, we only have a pooling dimension defined for each kind k∈, and similarly a kernel matrix and a bias vector . The aggregated matrix is now
and the number of aggregated features is =.
We can see that the max pooling aggregator is not as simple as the mean aggregator and contains trainable weights. However, it is much more expressive. It is designed to be sensitive to extreme neighbors by taking the maximum over them. The fully-connected neural layer also allows for great flexibility, as combinations of features can also be considered, different features can be given different relative weights and, by flipping the signs, the maximum function can effectively be changed to a minimum. While other, more complicated functions can also be considered, we have found the max pooling aggregator to perform very well and strike a good balance between simplicity and expressiveness.
Having provided a label for each seed, we can train the GCN using standard stochastic gradient descent. We can use any standard loss function such as categorical cross-entropy, and employ standard practices like dropout and regularization.
Notably, the way we have described the GCN architecture allows for mini-batch training without any special adaptation. After selecting a mini-batch of seeds B⊂ for training, we crawl for all their L-neighborhoods,
and build the subgraph induced by (B). Effectively, this means selecting only the rows of the feature matrices {Xk} and rows and columns of the adjacency matrices {Ar} corresponding to nodes in (B). These reduced matrices are then fed into the network in the same manner described above.
The outputs of the GCN are the matrices Zk(L), one for each kind k which has seed elements. In our example, we are interested in classifying File elements, so we take as output the matrix corresponding to the File kind. Furthermore, we take only its rows corresponding to our seed elements, the ones we are interested in classifying. Its number of columns, fk(L), is the number of possible output classes of our classifier. If a trait to be inferred is continuous, it is represented by a single “class”. If discrete, the simplest case is of a binary classifier, having two output classes, e.g. benign and malicious.
The procedure for determining the class of each classified element is standard, as for most classifiers based on a neural network. In case of a continuous regressor, depending on the choice of activation function, the output value can simply be the inferred maliciousness level. In case of a discrete classifier, assuming the activation function a used for the last layer was chosen to be softmax, the values in each row are non-negative, and their sum is 1. We can therefore interpret them as the probabilities for the sample to belong to the various classes. The network is trained to provide this meaning to the values: As is standard practice, the training labels are encoded prior to training using “one-hot” encoding,7 i.e. the class Cm, is encoded as the vector xi=δim assigning 100% probability to the class Cm, and zero probability to all others. 7See e.g. https://en.wikipedia.org/wiki/One-hot
Having the class probabilities for our newly classified samples, we can simply choose for each sample the class with the highest probability. Alternatively, a more sophisticated scheme can be implemented, in which we have a threshold of minimal probability for a class, and we allow this class to be chosen only if its probability is above the threshold. In this case, we must also assign a default class, reported in case the probability for no other class is above their threshold. These thresholds can be calibrated on a test set, e.g. by requiring a certain value of precision, recall or some other metric.
In any case, once a class has been determined for a sample, its probability can be considered the class “score”, or confidence level, and reported along with the chosen class. Any system using the results of the classifier can then use this reported level to determine the severity of the classification. For example, in a binary file classifier capable of reporting either malicious or benign for each file, this score (i.e., probability) can determine whether an alert is raised, some automatic remediation performed, or some other action taken.
The Cybereason platform acts as a complex graph of elements and events, in which any new piece of information triggers a decision-making process in every element affected by it, directly or indirectly.
The system manages a wide variety of information types for every element. One prominent type of information is reputation indications gathered from external sources. This type of information is extremely impactful for decision making processes, since external information typically has the potential to give a broader context than what can be observed in the context of some specific detection scenario. As a result, many important techniques for providing security value rely on external reputation, such as detecting malicious elements or identifying properties of observed threats. This property is common to most advanced cybersecurity solutions.
One major drawback of using external reputation sources is that they require the relevant elements to have been observed beforehand. This is especially true in the common case where the element being analyzed cannot be exported for examination, due to timing constraints, privacy issues or other reasons. In these cases, the element is considered “unknown”, which can, in and of itself, be a valid indication, albeit a considerably less useful one than, say, “malicious” or “benign”.
The graph-based classifier described here provides an additional, novel, source of external reputation for various elements. For example, in the important case of files, it allows the classification of unknown files (i.e., never before seen in an external source such as VirusTotal), for which relations have been observed in the Cybereason platform to other elements which are known to the external source. Using this new classifier, we now have indirect reputation for these files, in the form of the output of the classifier—effectively making many “unknown” cases into “known”. This reputation can include a classification such as “malicious” or “benign”, an indication of the kind of threat, and so on, together with a confidence score that can further be used to make higher-level security decisions. This indirect reputation is now added as metadata to the element, in a similar way as is done with direct sources of external reputation for “known” elements. Notably, the classifier can provide reputation for elements for which we otherwise would not have any.
Furthermore, the same process can be used even on known elements, to provide an additional level of external information, one that combines the externally observed reputation of individual elements with their locally observed relations. An example would be an element whose external reputation provides only a marginal indication of maliciousness, not enough to convict it as a threat. However, observing it communicate with another element with marginal reputation, the graph classifier can potentially provide an aggregated, indirect, indication of maliciousness which is now enough to convict the sample.
Finally, the reputation indications provided by the graph-based classifier join all other collected data in the decision-making process constantly taking place within the Cybereason platform. More specifically, based on the result, alerts can be raised, customers notified, or any other action taken. Consequently, this novel classifier enriches the Cybereason platform and significantly expands its capabilities in handling different kinds of cybersecurity scenarios.
Thus, the Cybereason platform, acting as a profiler, determines a maliciousness level profile for the element based on aggregation of nodes and edges in the hypergraph. It then links information generated relating to the element and the maliciousness level profile for the element to various databases, including VT, and to the network. For example, for an incriminated file one or more actions can be taken, such as isolating a machine that received the file, killing processes started by the file, removing persistence of the file on the network or affected computer, cleaning infected samples, modifying risk assessment for computer or network, generating a report, collecting additional artifacts, triggering a search for related elements, blocking a user from taking actions and sending information to other IT or security systems. For other element types, some of the above actions are applicable as well. In addition, there are also other actions specific to particular element types, e.g. blocking an IP address or a web domain from network access, restricting user authorization, blocking access to an external device, shutting down computers, erasing memory devices, filtering e-mail messages, and many more.8 8See http://www.cybereason.com/.
This application is related to, and claims priority from U.S. Provisional Patent Application No. 63/005,621 filed Apr. 6, 2020. Application 63/005,621 is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63005621 | Apr 2020 | US |