Micro-Clustering System and Method

FIELD OF THE SPECIFICATION

This application relates in general to computer security, and more particularly though not exclusively to a system and method for providing micro-clustering of objects.

BACKGROUND

Modern computing ecosystems often include “always on” broadband internet connections. These connections leave computing devices exposed to the internet, and the devices may be vulnerable to attack.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is best understood from the following detailed description when read with the accompanying FIGURES. It is emphasized that, in accordance with the standard practice in the industry, various features are not necessarily drawn to scale, and are used for illustration purposes only. Where a scale is shown, explicitly or implicitly, it provides only one illustrative example. In other embodiments, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion. Furthermore, the various block diagrams illustrated herein disclose only one illustrative arrangement of logical elements. Those elements may be rearranged in different configurations, and elements shown in one block may, in appropriate circumstances, be moved to a different block or configuration.

FIG. 1 is a block diagram of selected elements of a security ecosystem.

FIG. 2 is a block diagram illustration of selected elements of a clustering system.

FIGS. 3a-3f are a block diagram illustration of selected elements of a clustering method.

FIG. 4 is a distance plot illustrating selected aspects of clustering.

FIG. 5 is a flowchart of selected elements of a clustering method.

FIG. 6 is a block diagram of selected elements of a hardware platform.

FIG. 7 is a block diagram of selected elements of a network function virtualization (NFV) infrastructure.

FIG. 8 is a block diagram of selected elements of a containerization infrastructure.

SUMMARY

A computer-implemented system and method of clustering a universe of featurized objects into micro-clusters includes selecting a vantage point having a feature vector; computing, for the featurized objects in the universe, respective distances from the vantage point, and sorting the featurized objects into a sorted container based on their distances from the vantage point; clustering adjacent objects into a plurality of micro-clusters based on determining that objects have a distance from a next adjacent object less than a maximum distance; and storing the micro-clusters onto a tangible computer-readable medium to modify operation of a computing apparatus based on objects in the micro-clusters.

EMBODIMENTS OF THE DISCLOSURE

The following disclosure provides many different embodiments, or examples, for implementing different features of the present disclosure. Specific examples of components and arrangements are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. Further, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Different embodiments may have different advantages, and no particular advantage is necessarily required of any embodiment.

Overview

Clustering is a machine learning (ML) technique useful for identifying similarities between sample objects. In general terms, a clustering system may start by building a feature vector for each object, wherein various attributes of the object are quantified. The feature vector thus provides an array of numerical values (commonly real, floating-point values), with each value representing an attribute of the object. The system may then compute the “distance” between two objects by calculating a scalar distance between their feature vectors. Two objects that have a short distance are similar, while two objects that have a long distance are dissimilar. This may provide a more flexible comparison of objects than, for example, hashing, which is best at detecting objects that are identical to one another.

Once objects are featurized, the clustering system may compute the distances between objects, for example using a location-sensitive hashing (LSH) algorithm such as MinHash or TLSH. By computing a scalar distance between objects, the system can determine which objects are most similar to one another. Those objects can then be “clustered” based on their proximity to one another. In some cases, clustering is based on selecting all samples within a minimum distance (MIN_DISTANCE, sometimes also ε) of a sample, and then performing the same operation on the selected samples. This can form large data structures of clusters, with hundreds or thousands of objects in a cluster, for example. One limitation of clustering is that, given such a data structure, while objects may have much in common with their immediate neighbors, objects on one extreme end of the cluster may have less in common with objects on another extreme end of the cluster. This can make it difficult to craft a signature that is broad enough to capture all points in the cluster, but narrow enough to avoid large numbers of false positives.

Clustering has many applications in many different branches of the data sciences. In this specification, detection of malware and other cybersecurity operations is used as an illustrative and nonlimiting example of the clustering method taught herein. However, the method is also applicable to other clustering applications, and those applications are not intended to be excluded from the present specification and the appended claims.

Sample clustering has multiple applications in cybersecurity and other industries. One common use case is clustering samples for the purpose of discovering a common pattern. This common pattern can then be used to achieve different objectives. One of these objectives in cybersecurity is to create signatures that can be used to detect all samples within a cluster. For example, a cluster can be include several variants of ransomware in a family, even though those variants are not identical and would not be detected by a single hash. With a good cluster, researches can identify common patterns among samples and author a signature that can capture the variants in the cluster, as well as similar variants that have not yet been encountered.

One issue with existing clustering systems is that they often form very large clusters of objects, containing a large variety of related samples. Yet samples on opposite edges of such a large cluster may be only tangentially related to one another, and it can be difficult to craft a signature that captures all those samples, while avoiding false positives. These large clusters may be considered “impure,” in the sense that they contain an large group of only vaguely related samples. Existing algorithms can also require constant fine tuning of the parameters (such as how many clusters to derive), and may struggle to scale with big data. These factors create a challenging environment for security researchers trying to derive signatures from these clusters.

The present specification provides a system and method that will ordinarily yield smaller “micro-clusters,” usually smaller than those that are found in existing clustering algorithms. These micro-clusters may maintain higher purity and precision, and may require significantly less parameter tuning. This provides an effective method for deriving signatures to proactively detect samples from the cluster and beyond.

The micro-clustering begins with an arbitrary vantage point called “Moon,” which is expected to be far from all samples in the sample universe (e.g., just as every person on the earth is relatively far from the center of mass of the moon). Samples are sorted into a “sorted container,” based on their distance from the Moon vantage. Samples are then compared only to their adjacent neighbors, and the samples cluster together so long as they do not exceed a distance threshold. Once an adjacent sample exceeds the distance threshold, the micro-cluster is closed out, and other samples in the set may or may not form more micro-clusters.

On a next pass, the Moon vantage is not used (because it would yield the same results). Instead a new vantage is selected, and a new sorted container is built, based on distance from the new vantage. In an illustrative case, the new vantage point is not a distant feature vector like Moon, but rather is selected from the set of samples that did not cluster on the first pass. In a specific example, the median remaining sample may be selected. Micro-clustering passes may continue in this manner, until a pass results in no new clusters, or until a threshold of MAX_PASSES. After all passes are complete, any samples that remain are considered unclustered samples.

While operating on sorted containers, the system may use a “laser cutting” strategy that obtains micro-clusters based on adjacent distance measurements, and cuts off a cluster once an adjacent sample is found above a distance threshold. Micro-clustered samples are then removed from the sorted container, and the process repeats with a new vantage point.

The system and method disclosed herein may realize advantages over existing clustering algorithms. For example, some algorithms have poor precision, such as K-Means. Because of this, authoring signatures from unstable clusters yields poor coverage, or may not even be possible where no pattern can be extrapolated from a group of vaguely related samples.

Other algorithms are more precise, but tend to form very large clusters. Creating a signature to capture all samples in such a cluster can yield a signature so broad that it also captures many false positives. The model can be fine-tuned to find “just right” sizes, but this is time consuming, highly data dependent, and may be undesirable for maintenance purposes.

Algorithms like DBSCAN may struggle to form reliable clusters when there is no obvious drop in density between clusters. In other words, if the input data have many clusters that may overlap (for example, malware data), DBSCAN may group multiple groups into a single cluster (thus, obtaining a single big blob cluster), which reduces the reliability of the solution for signature authoring.

Existing clustering algorithms may also be slow, and may face some scalability issues when dealing with big data. For example, a DBSCAN on over 3 million samples can require up to 128 GB of RAM to compute. This can be problematic in modern anti-malware systems that have collected malware samples over the course of decades, and where the number of samples may be in the billions.

Advantageously, the method disclosed herein is highly memory efficient, and may not require significant parameter tuning/maintenance. Furthermore, it yields micro-clusters are small enough to provide high precision and purity and reduce vague relationships. This enables better signature authoring.

The system and method of the present specification work well where the universe of samples can be represented by a Locality Sensitive Hashing (LSH) scheme that supports a distance metric between two arbitrary samples, in compliance with the triangle inequality. In other words, the LSH scheme should be able to measure the distance between any two samples, regardless of how distant they are. One example of a known LSH scheme that supports this is TLSH (the “T” has no specific meaning), but in general, any comparison algorithm that can calculate a distance between two samples is suitable. Within this specification, a value that can compare two samples, regardless of distance, is referred to as an “LSH-compliant” value.

The method disclosed can be iterated until the produced micro-clusters are satisfactory for the use case. The specific number of times to run the method may depend on the input data. It may be beneficial to define a MAX_PASSES limit, to help the system finish clustering within the desired time. For example, setting MAX_PASSES to 15 would run a maximum of 15 passes, and then stop clustering (possibly leaving some unclustered samples behind). Empirical evaluation has found that for many data sets, after 6 passes, the return of investment diminishes significantly. Thus, even though additional passes will still yield extra micro-clusters, these may be less relevant than the ones obtained during the initial passes.

In an illustrative example, an initial vantage point is selected. A vantage point is defined as any arbitrary (real or virtual) LSH-compliant value that may be used as a reference to measure the distance of the universe samples against. For example, a vantage point may be selected with the intent that it is not expected to be very close to any real samples. This may include, for example, creating a fake feature vector with all characters being the same character, such as hexadecimal ‘F,’ the last hexadecimal character. Any other value could be used, such as ‘7’ (the median hexadecimal character), or ‘0’ or ‘1’ (low characters). The hash could also be generated randomly, or based on an alternating pattern (e.g., “017F017F . . . ”). Any of these are statistically unlikely to be close to actual samples. Thus, example “Moon” vantage hashes may include:

“FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

FFFFFFFFFFFFFFFFFF”

“7777777777777777777777777777777777777777777777777777777777

777777777777”

Or any other selected value.

Using the defined vantage point, the system creates a sorted container, including the full universe of samples. A sorted container may be a simple data structure like a list or a dictionary, which is natively sorted using a distance/comparison criterion. In this case, the system uses the LSH-compliant value distance, measured between each sample and the vantage point. Thus, when this sorted container is created, samples closer to the vantage point are at the beginning of the container, while samples distant from the vantage point will be placed towards the end of the container.

The system then iterates through the sorted container, computing the distance between the current sample (N) and the previous sample (N−1). If this distance does not exceed the MAX_DIST (which can be derived empirically depending on the input data), then include the sample N into a temporary micro-cluster. If this is a new micro-cluster, the system also adds sample (N) to the micro-cluster, as there is no previous sample. Subsequently, the system only adds sample (N) if its distance with (N−1) is satisfactory. This identifies “distance valleys,” which are used to form micro-clusters. This effect is depicted in FIG. 4.

While iterating through a sorted container, if the distance between (N) and (N−1) exceeds MAX_DIST, then the temporary micro-cluster is closed. If the temporary micro-cluster has more than MIN_SAMPLES (which can be defined by the user), then the formed micro-cluster is saved and all the samples belonging to it are marked for removal from the sorted container. If the temporary micro-cluster does not have enough samples, then the temporary micro-cluster is discarded and reset. The iteration of the sorted container continues until the end of the container.

Once the sorted container has been fully iterated, samples marked for removal (e.g., because they were sorted into micro-clusters) are removed from the sorted container.

After removing clustered samples from the sorted container, the system identifies a new sample to use as the vantage point for the next pass. This may be, for example, the sample in the “middle” or median of the remaining samples in the sorted container. This may be obtained, for example, by dividing the length of the remaining container by 2, and selecting the sample at that index as the new vantage point.

The system then iterates again, sorting the remaining samples based on their LSH-compliant distance from the new vantage point, and then clustering is repeated.

Once the system has executed MAX_PASSES, or an execution pass is unable to form any new micro-clusters, the formed micro-clusters are collected as the output of the method, and the method terminates.

In this manner, the resulting micro-clusters are “opportunistically” discovered and formed. This is a consequence of sorting the universe against a vantage point. This sorting ensures that very similar samples are near one another, and can be discovered by the laser cutting algorithm that forms micro-clusters.

One rationale behind micro-clustering is that the samples within a micro-cluster are similar enough for discovering patterns (e.g., strings or functionality that the samples have in common) so that a signature can be authored with confidence. A larger and less dense cluster may introduce the risk of not being able to find good commonality between the samples of the cluster.

Selected Examples

The foregoing can be used to build or embody several example implementations, according to the teachings of the present specification. Some example implementations are included here as nonlimiting illustrations of these teachings.

There is disclosed in one example, one or more tangible, nontransitory computer-readable storage media having stored thereon executable instructions to clustering a universe of featurized objects into micro-clusters, the instructions to: receive a selected vantage point having a feature vector; compute, for the featurized objects in the universe, respective distances from the selected vantage point, and sort the featurized objects into a sorted container based on their distances from the selected vantage point; cluster adjacent objects into a plurality of micro-clusters based on determining that objects have a distance from a next adjacent object less than a maximum distance; and store the micro-clusters onto a tangible computer-readable medium to modify operation of a computing apparatus based on objects in the micro-clusters.

There is disclosed another example, wherein computing respective distances comprises using a locality-sensitive hashing (LSH) algorithm.

There is disclosed another example, wherein the LSH algorithm is TLSH.

There is disclosed another example, wherein the instructions are further to remove, from the sorted container, objects that were clustered into micro-clusters, selecting a new vantage point, building a new sorted container, and repeating clustering adjacent objects.

There is disclosed another example, wherein the new vantage point is a median object in the sorted container after removing the objects that were clustered into micro-containers.

There is disclosed another example, wherein the instructions are further to iterate removing objects that were clustered into micro-clusters, selecting a new vantage point, building a new sorted container, and repeating clustering adjacent objects, until an iteration forms no new clusters.

There is disclosed another example, wherein the instructions are further to compute the maximum distance based on the universe of featurized objects.

There is disclosed another example, wherein the instructions are further to reject a micro-cluster if it has fewer than a positive integer MIN_VALUE of samples.

There is disclosed another example, wherein the instructions are further to close out a micro-cluster after determining that a next adjacent object has a distance greater than the maximum distance.

There is disclosed another example, wherein the selected vantage point comprises a feature vector with all characters being a common character.

There is disclosed another example, wherein the selected vantage point comprises a feature vector with all characters being hexadecimal ‘f’.

There is disclosed another example, wherein the selected vantage point comprises a feature vector with all characters being hexadecimal ‘7’.

There is disclosed another example, wherein the selected vantage point comprises a feature vector with all characters being hexadecimal ‘1’ or ‘0’.

There is disclosed another example, wherein the selected vantage point comprises a feature vector with characters comprising a repeating pattern.

There is disclosed another example, wherein the selected vantage point comprises a feature vector with a randomly generated a feature vector.

There is disclosed another example, wherein the instructions are further to find, for a micro-cluster, an object signature that reads on all objects in the micro-cluster.

There is disclosed another example, wherein the instructions are further to use the object signature to detect and remediate computer malware.

There is disclosed another example of a computer-implemented method of clustering a universe of featurized objects into micro-clusters, comprising selecting a vantage point having a feature vector; computing, for the featurized objects in the universe, respective distances from the vantage point, and sorting the featurized objects into a sorted container based on their distances from the vantage point; clustering adjacent objects into a plurality of micro-clusters based on determining that objects have a distance from a next adjacent object less than a maximum distance; and storing the micro-clusters onto a tangible computer-readable medium to modify operation of a computing apparatus based on objects in the micro-clusters.

There is disclosed another example, wherein computing respective distances comprises using a locality-sensitive hashing (LSH) algorithm.

There is disclosed another example, wherein the LSH algorithm is TLSH.

There is disclosed another example, further comprising removing, from the sorted container, objects that were clustered into micro-clusters, selecting a new vantage point, building a new sorted container, and repeating clustering adjacent objects.

There is disclosed another example, wherein the new vantage point is a median object in the sorted container after removing the objects that were clustered into micro-containers.

There is disclosed another example, further comprising iterating removing objects that were clustered into micro-clusters, selecting a new vantage point, building a new sorted container, and repeating clustering adjacent objects, until an iteration forms no new clusters.

There is disclosed another example, further comprising computing the maximum distance based on the universe of featurized objects.

There is disclosed another example, further comprising rejecting a micro-cluster if it has fewer than a positive integer MIN_VALUE of samples.

There is disclosed another example, further comprising closing out a micro-cluster after determining that the next adjacent object has a distance greater than the maximum distance.

There is disclosed another example, wherein selecting the vantage point comprises selecting a feature vector with all characters being a common character.

There is disclosed another example, wherein selecting the vantage point comprises selecting a feature vector with all characters being hexadecimal ‘f’.

There is disclosed another example, wherein selecting the vantage point comprises selecting a feature vector with all characters being hexadecimal ‘7’.

There is disclosed another example, wherein selecting the vantage point comprises selecting a feature vector with all characters being hexadecimal ‘1’ or ‘0’.

There is disclosed another example, wherein selecting the vantage point comprises selecting a feature vector with characters comprising a repeating pattern.

There is disclosed another example, wherein selecting the vantage point comprises randomly generating a feature vector.

There is disclosed another example, further comprising finding, for a micro-cluster, an object signature that reads on all objects in the micro-cluster.

There is disclosed another example, further comprising using the object signature to detect and remediate computer malware.

There is disclosed another example of an apparatus comprising means for performing the method.

There is disclosed another example, wherein the means for performing the method comprise a processor and a memory.

There is disclosed another example, wherein the memory comprises machine-readable instructions that, when executed, cause the apparatus to perform the method.

There is disclosed another example, wherein the apparatus is a computing system.

There is disclosed another example of at least one computer readable medium comprising instructions that, when executed, implement a method or realize an apparatus as described.

There is disclosed another example of a computing platform, comprising: at least one hardware platform comprising a processor circuit and one or more memories; and instructions encoded with the one or more memories to instruct the processor circuit to cluster a universe of featurized objects into micro-clusters, the instructions to: receive a selected vantage point having a feature vector; compute, for the featurized objects in the universe, respective distances from the selected vantage point, and sort the featurized objects into a sorted container based on their distances from the selected vantage point; cluster adjacent objects into a plurality of micro-clusters based on determining that objects have a distance from a next adjacent object less than a maximum distance; and store the micro-clusters onto a tangible computer-readable medium to modify operation of a computing apparatus based on objects in the micro-clusters.