Technical Field
This disclosure relates to malware and more specifically to identifying malware variants by processing structured flow control.
Related Art
Malicious software or malware has become a serious threat to computer systems and the Internet. The creation of new malware instances has become more common with the emergence of automatic malware creation toolkits. Malware writers create a significant number of complex and obfuscated malware variants that mutate and elude antivirus scanners by simply modifying existing malware instances. Typically, antivirus companies processes new malware instances manually to determine their maliciousness and identify their signatures. But with the overwhelming number of new malware instances that are now created automatically, manual analysis is ineffective and has been slow to respond to new emerging threats.
Thus, the fully automated malware clustering system (and process) disclosed below addresses this threat. It eliminates the need for manual malware inspection and speeds up malware classification by clustering variants of malware instances. By identifying the invariant features of malware families in this fully automated turnkey system (and process) the classification of malware variants occurs quickly and is more efficient.
The patent or application file contains at least one drawing executed in color. The Office upon request and payment of the necessary fee will provide copies of this patent or publication with color drawing(s).
Appendix 1-8 system (and process) clustering code.
The disclosed systems (and processes) generate a computed structured control flow (SCF) of a targeted program to identify malware. A static analysis of the compiled targeted software, which may be generated by a function extraction (FX) system, such as Oak Ridge National Laboratory's (ORNL) “Hyperion” system, generates the SCF of the targeted software or program. The computed SCF is a graphical, a string, or a tree representation of all the flow paths that can be traversed through the targeted program during the program's execution. Some computed structures consist of if-then-else, while-do, case and sequence programming constructs, for example. The systems (and processes) reduce the computed structure that are represented as symbolic and/or alphanumeric strings, trees, or graphs in which multiple sequence nodes are reduced to a single node with the exception of external procedure calls. The external procedure calls that are also known as external calls are preserved, and both branching and looping constructs are captured without the details of the external procedure calls. This results in expressions that capture the programming structure, but not its content. The symbolic and alphanumeric strings are then compared using a designated metric such as string edit distance or graph edit distance. The computed values are processed with a metric-based clustering process or quality threshold clustering (QTC) to discover similar programs and/or program fragments including malware variants. Malware clustering is the process of grouping malware variants that share invariant features.
In one implementation, an FX tool generates the SCF, an invariant feature shared by malware variants of the same family. FX, like a static module, is an automatic static analysis and function extraction tool that assists in the analysis of software. The FX tool extracts a structured form of the control flow of malware instances while removing dead code blocks (non-functional code) and eliminating “spaghetti-logic” (incoherent structure in a program that incorporates frequent execution jumps). ORNL maintains the FX tool “Hyperion.” Unlike dynamic analysis, the disclosed static analysis allows for the analysis of the full malware behavior and eliminates the possibility of an emulated environment being detected by malware. The FX tool analyzes the full-targeted program and computes the end-to-end behavior of the targeted program. An SCF is generated by the FX tool by a process that: (1) extracts the digital code (2) disassembles the malware and non-malware executable: (3) scans its instructions; (4) generates the unstructured malware control flow; and (5) structures the unstructured control flow generated in (4). The resulting SCF of the targeted program maps the targeted program's different execution paths through the program with the program's arbitrary jumps eliminated.
The assembly instructions generated by the FX tool are augmented with corresponding functional semantics that account for the effect that each instruction has on the state of the hardware executing the targeted program's instructions. The targeted program's true control flow is generated and transformed into the SCF by applying a structure theorem to it.
Due to the large size of the structured flow-controls generated by the FX tool, the systems (and processes) transform the SCFs into a pattern of characters used to represent the disassembly referred to as regular expression or regex. Regex is a string notation that consists of pattern of characters used to represent an abstract form of disassembled targeted-program.
The regex control flow strings abstract the full SCF for each local function in the malware program. The regex control flow strings keep information about if-then-else, loops, and call functions, but abstract away information about the specific assembly instructions used by the targeted-program.
In
To determine the similarity between the SCF regex strings the systems (and processes) generate a metric. A metric value expresses the similarity of the SCF regex strings. A string edit distance metric such as a Levenshtein edit distance or a Sift3 string edit distance, for example, measures the similarity between the SCF regex strings. The similarity between two or more strings is expressed as a numerical value in the interval from zero (no similarity between the strings) to one (the strings are the same).
Based on a calculated metric, the systems (and processes) via a clustering module calculate a threshold value for the QTC algorithm, which groups similar, SCF regex strings (or alternatively tree structures) into clusters. The systems (and processes) determine the threshold value by calculating the edit distance between malware SCF regex strings. Some implementations calculate a threshold value by dividing the smallest length of the SCF regex strings by a factor (n) of five (n*5), with a divisor of 20 being one of the most effective in some applications. Empirical evaluations establish that the smallest length of the SCF regex strings divided by an overestimated value is also an effective threshold value for the QTC algorithm.
In some systems (and processes) the threshold value processed by the QTC algorithm defines the maximum edit distance value between the SCF regex string at the center of any cluster and the rest of SCF strings in that same cluster. Therefore, when the systems (and processes) use an overestimated threshold value, the systems (and processes) allow for a higher edit distance value between the SCF strings and cluster SCF strings with less similarity. While using a tighter threshold value means the systems (and processes) allow for lower edit distance value between the SCF regex strings, the tighter threshold value increases the clustering of highly similar SCF regex strings.
A clustering may be implemented with a modified QTC algorithm as follows. First, the process initializes the threshold distance that is allowed between the data points of the clusters. The algorithm then builds a candidate cluster for each data point by determining which data point has the greatest similarity with the chosen point. Next, the closest points are added to the cluster without surpassing the diameter threshold value. Then a candidate cluster with the most points is stored in memory as a cluster and its points are excluded from further processing. The process then repeats itself with the reduced data set (the original dataset without the excluded points) until no more clusters are formed. Data points that are not related to other data points are designated as outliers of the clusters and are not grouped.
Some systems (and processes) are based on the automatic static analysis generated by FX tool. The systems (and processes) target the SCF of the malware instances because that SCF reflects the actual functionality of a malware variant, and can be used as a unique feature that identifies malware families.
The systems (and processes) generate both the SCF of a program and the computed behavior. The program's computed behavior is a unique feature that identifies a malware family and can also be used in the clustering process. However, generating the computed behavior of a targeted-program is a complex and time consuming process in comparison to processing the SCF.
The systems (and processes) render various output of the SCF of the analyzed malware instance, such as graphs, human readable assembly instructions, and regex strings. To facilitate the comparison of the SCF of different malware instances, some systems (and processes) output the SCF in in a structure of successive branching and subdivisions known as a tree structure or regex strings. A tree structure is defined (locally) as a collection of nodes that initiate at a root node, where each node is a data structure consisting of a value, together with a list of references to nodes (the “children”), with the constraints that no reference is duplicated, and the structure contains no cycles. Regex is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching. Regex strings abstract out all of the individual assembly instructions, resulting in a much shorter string that is faster to cluster.
To test the systems (and processes) SCF regex strings for individual functions of 303 Windows PE malware instances were merged into a text file. The systems (and processes) calculated the edit distance between all of the generated strings and applied the QTC clustering algorithm. Regex strings of the SCF were classified based on their sizes. SCF regex strings passed to the clustering function by the fully automated systems (and processes) applied clustering application on groups that consisted of fifty strings. The system (and processes) broke down the initial regex strings into group sizes decreasing the time associated with the processing of the edit distance and the clustering.
To estimate the threshold value used by the QTC algorithm, the systems (and processes) calculated the smallest regex string length of each group and divided it by a programmable value. After estimating the suitable threshold value, the calculated edit distance between regex strings in the groups were processed by the QTC algorithm to determine the similarity between the SCF regex strings of local functions of the malware instances. Local functions that share very similar SCF regex strings are clustered together by the systems (and processes).
After obtaining the clusters of malware local functions that had very similar SCFs, the systems (and processes) mapped each local function back to the malware instance containing the local function. The result of this mapping show clusters of malware instances that share at least one local function based on the SCF of the local functions.
To identify malware instances that should be clustered together and identified as one malware family, the systems (and processes) calculate the percentage of shared local functions between each pair of malware instance. The systems (and processes) then execute clustering by processing different percentage values of the shared function. The systems (and processes) added edges of the output graph that represent the relationship between the malware instances, based on the used percentage. In other words, an edge between a pair of malware instances was added when the percentage of shared functions exceeded about fifty percent, seventy percent and ninety-five percent.
To graphically visualize the relationship between the malware instances the systems (and processes) accessed a a graph visualization package such as a JGraphT Java library to generate a Graph Modeling Language (GML) file that presents the relationships between the malware instances in the form of nodes and edges of a directed graph. To display the content of the generated GML file, the systems (and processes) executed Gephie, an open source tool for visualizing and analyzing graphs written in Java using the NetBeans platform to visualize the clustering results. The result of this analysis was a directed graph that consisted of 303 Windows PE malware, presenting the malware instances, and a number of edges that change when different metrics were processed. The edges connect the nodes and present the strength of the relationship between them. Further, the systems (and processes) calculate the modularity of the graph, which describes how the graph is compartmentalized into sub-graphs based on the strength between the clusters nodes. The modularity was then used to partition the graph into sub-graphs. The systems (and processes) rendered a graph that groups malware instances with a high similarity into clusters and presents them using unique colors.
In exemplary use, the systems (and processes) apply the smallest length of regex string in each group, and added edges to the full clustering application when two malware instances share fifty percent, seventy percent and ninety-five percent of the total local functions.
To gain a better understanding of the effect of the percentage of shared functions metric on the results, the systems (and processes) were run with a threshold value of smallest regex string divided by five for a third time, but adding edges to the graph of clusters when the number of the shared local functions between the malware instances in a cluster is ninety five percent or more. The generated graph consisted of 303 node, 11735 edges, and 122 clusters.
To reduce the number of unrelated malware instances clustered together, the systems (and processes) tightened up the threshold value applied by the QTC algorithm. In these use cases, the calculated threshold value was determined by dividing the length of the smallest regex string of each group by twenty rather than five, thereby requiring a higher level of similarity between the malware instances in a cluster. Like the prior use cases the systems (and processes) added edges to the clusters graph if a pair of malware instances share fifty percent, seventy percent and ninety-five percent of the total local function of each malware instance.
The use cases have shown the performance of the systems (and processes) using different metrics. In the first use case, where a threshold value of smallest regex string divided by five was used, the results reflected a percentage of accuracy that ranged between 61% -85%. In the second use case, the threshold value was tightened up by using the value of the smallest regex string length divided by twenty for the threshold value. The results established accuracy between 84%-92%. In the third use case, the systems (and processes) established the length of the smallest regex string length, and dividing it by fifty. The results showed accuracy between 92%-94%. Table 10 summarizes the results.
The results established that tightening up the threshold value increases the accuracy of the output of the systems (and processes). The threshold value determines the acceptable variance between the malware instances regex strings. The results further established that the systems (and processes) effectively detect variant of malware families based on constructed SCFs. The SCFs were generated by first analyzing targeted software through an automatic analysis static tool like the FX tool to render the SCF regex strings of their individual local functions. The SCF regex strings were processed by the systems (and processes) through its automated clustering application. After generating initial clusters, the local functions of the malware instances were mapped to malware instance, and the number of shared functions that had similar SCFs were processed between malware instances and a sample set. Based on the percentage of shared function, the systems (and processes) clustered the malware instances into probable malware families.
The results of the systems (and processes) can be used to build a catalog of malware families based on candidates of malware variant that represents each family. When an unknown program is detected, a representative malware family from a catalog of malware families can be processed by the systems (and processes) to check the SCF of the targeted program against the SCF of the candidate malware variants. Unknown malware can be either classified as a variant of a malware family that is present in the catalog, or added as a new unclassified program.
The systems (and processes) and logic described above is implemented in many different ways in many different combinations of hardware, software or both hardware and software. For example, all or parts of the system may diagnose software or circuitry in one or more controllers, one or more microprocessors (CPUs), one or more signal processors (SPU), one or more servers connected to a network or cloud service (i.e., a server is defined as one or more computers or devices connected to a distributed network via one or more network connections, with each computer or device having one or more applications that generate structured flow control such as a static module; one or more applications that transform the structured flow control into an artifact in which distance may be measured, a clustering application, a server database application(s), and server network application(s)). All or parts of the system may diagnose software through one or more graphics processors (GPUs), one or more application specific integrated circuit (ASIC), one or more programmable media or any and all combinations of such hardware. All or part of the logic, specialized processes, and systems described may be implemented as instructions for execution by multi-core processors (e.g., CPUs, SPUs, and/or GPUs), controller, or other processing device including exascale computers and computer clusters, and may be displayed through a display driver in communication with a remote or local display, or stored in a tangible or non-transitory machine-readable or computer-readable medium such as flash memory, random access memory (RAM) or read only memory (ROM), erasable programmable read only memory (EPROM) or other machine-readable medium such as a compact disc read only memory (CDROM), or magnetic or optical disk. Thus, a product, such as a computer program product, may include a storage medium and computer readable instructions stored on the medium, which when executed in an endpoint, computer system, or other device, cause the device to perform operations according to any of the description above.
The systems (and processes) evaluate software and data structures through processors (e.g., CPUs, SPUs, GPUs, etc.), memory, interconnect shared and/or distributed among multiple system components, such as among multiple processors and memories, including multiple distributed processing systems. Parameters, databases, software and data structures used to evaluate and analyze these systems or logic may be separately stored and managed, may be incorporated into a single memory or database, may be logically and/or physically organized in many different ways, and may implemented in many ways, including data structures such as linked lists, programming libraries, or implicit storage mechanisms. Programs may be parts (e.g., subroutines) of a single program, separate programs, application program or programs distributed across several memories and processor cores and/or processing nodes, or implemented in many different ways, such as in a library, such as a shared library. The library may store behavior abstractions that performs analyze the behavior functionality described herein. While various embodiments have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible.
The term “coupled” disclosed in this description may encompass both direct and indirect coupling. Thus, first and second parts are said to be coupled together when they directly contact one another, as well as when the first part couples to an intermediate part which couples either directly or via one or more additional intermediate parts to the second part. The term “substantially” or “about” may encompass a range that is largely, but not necessarily wholly, that which is specified. It encompasses all but a significant amount, such as a variance within five or ten percent. When devices are responsive to commands events, and/or requests, the actions and/or steps of the devices, such as the operations that devices are performing, necessarily occur as a direct or indirect result of the preceding commands, events, actions, and/or requests. In other words, the operations occur as a result of the preceding operations. A device that is responsive to another requires more than an action (i.e., the device's response to) merely follow another action.
It encompasses all but a significant amount, such as a variance within five or ten percent. While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.
This application claims priority to U.S. Provisional Patent Application No. 62/170,758, filed Jun. 4, 2015, titled “Automatic Clustering of Malware Variants Based on Structured Control Flow,” which is herein incorporated by reference.
This invention was made with United States government support under Contract No. DE-ACO5-000R22725 awarded by the United States Department of Energy. The United States government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62170758 | Jun 2015 | US |