The present invention relates to computer programs and, in particular, to a computer program for detecting malicious computer programs (malware) such as computer viruses and the like.
As computers become more interconnected, malicious computer programs have become an increasing problem. Such malicious programs include “viruses”, “worms”, “Trojan horses”, “backdoors”, “spyware”, and the like. Viruses are generally programs attached to other programs or documents to activate themselves within a host computer to self-replicate and attach to other programs or documents for further dissemination. Worms are programs that self-replicate to transmit themselves across a network. Trojan horses are programs that masquerade as useful programs but contain portions to attack the host computer or leak data. Backdoors are programs that open a computer system to external entities by subverting local security measures intended to prevent remote access or control over a network. Spyware are programs that transmit private user data to an external entity. These and similar programs will henceforth be termed “malware”.
A common technique for detecting malware is to scan suspected programs for sequences of instructions or data that match “signature” sequences extracted from known malware types. When a match is found, the user is signaled that a malware program has been detected so that the malware may be disabled or removed.
Many signature detection systems may be defeated by relatively simple code obfuscation techniques that changed the signature of the malware without changing the essential function of the malware code. Such techniques may include changing the static ordering of the instructions by using jump instructions (code transposition), substituting instructions of the signature with different synonym instructions providing the same function (synonym insertion), and the introduction of nonfunctional code (“dead code”) that does not modify the functionality of the malware.
Co-pending U.S. patent application entitled: “Method And Apparatus To Detect Malicious Software”, assigned to the same assignee as the present invention, and hereby incorporated by reference, describes a preprocessor that can reverse some types of malware obfuscation by converting the malware program instructions into a standard form. A search of the de-obfuscated malware for malware signatures is then used to detect malicious code. Such a system employs three processes: a control flow graph (CFG) builder that reorders the instructions according to their control flow, a synonym dictionary that replaces functionally identical sets of instructions with standard equivalents and a dead code remover that removes irrelevant instructions (e.g. “nop” instructions). Irrelevant jump instructions, being unconditional jump instructions that simply jump to the next instruction in the control flow, may also be eliminated.
Malware may be encrypted or compressed (packed), and may execute a decryption or unpacking program once the malware arrives in a host, to unpack or decrypt critical elements of the malware. The encryption or compression serves to hide features of the malware that might be detected by a malware signature detector, until the malware is being executed. A common and normally benign compression program may be used so that signature detection of the unpacking program of decryption program is impractically prone to false positive alerts.
One approach for detecting packed or encrypted programs is to run the signature checker continuously to attempt to find the unpacked program in memory in an unpacked state. This can be impractical for systems where many programs must be monitored.
The present invention provides a malware normalizer that may be part of a malware detection system that permits practical detection of encrypted and/or compressed malware programs. The detection of compressed or encrypted malware relies on an insight that a packed or encrypted program can be inferred by detection of a suspect program's execution of data previously written by the suspect program.
The invention also provides for improved de-obfuscation of code reordering and dead code insertion. Improved code reordering is obtained by examining the control flow graph for nodes which have: (1) at least one preceding edge which is an unconditional jump and (2) no “fall-through” edge, as will be defined below. Improved removal of dead code eliminates or supplements a standard “synonym dictionary” with a piecewise analysis of code “hammocks” that produce no net change of external variables.
Specifically then, the present invention may provide a malware normalization program that monitors memory locations written to during execution of a suspect program. Execution by the suspect program of the “written to” memory locations is used to trigger an analysis of the suspect program against malware signatures based on an assumption that any encrypted or compressed code is not decrypted or uncompressed.
Thus it is one feature of at least one embodiment of the invention to provide a reliable and automatic method of signature detection for encrypted or compressed malware.
The signature analysis may be limited to memory locations written to by the suspect program and within a loaded image of the suspect program.
It is another feature of at least one embodiment to simplify the task of signature matching by minimizing the code that must be examined.
The execution of the suspect program may be performed by a computer emulator limiting access by the suspect program to computer resources.
It is another feature of at least one embodiment of the invention to prevent suspect programs from affecting the host computer prior to their analysis.
The monitoring of execution of previously “written to” data may be repeated iteratively.
It is another feature of at least one embodiment of the invention to provide a system that may automatically work with nested levels of packing or encryption.
The invention may include a step of prescreening suspect programs according to an “entropy” of the loaded image suspect program, low entropy generally suggesting compression of a program.
It is therefore a feature of at least one embodiment of the invention to provide a method of reducing the need for full analysis of all suspect programs.
Alternatively or in addition, the invention may include the step of prescreening suspect programs through a static execution of the suspect program detecting an execution of previously “written to” addresses.
It is thus a feature of at least one embodiment of the invention to allow the invention to be used to prescreen programs for possible self-generation.
The invention may further provide a deobfuscation of the decrypted or uncompressed program to correct for instruction reordering before analyzing the program for malware signatures.
It is thus another feature of at least one embodiment of the invention to provide a system that may work with deobfuscation techniques that address code reordering.
The deobfuscation of code reordering may examine the execution order of the instructions and, when a given instruction has no fall-through edge and at least one preceding instruction that is an effective unconditional jump, replace the one effective unconditional jump with the given instruction.
It is thus another feature of at least one embodiment of the invention to provide an improved method of correcting for code reordering obfuscation that may work with complex control flow graphs where multiple branches lead to a single instruction.
The invention may further remove non-functional instructions before checking for malware signatures. In a preferred embodiment, the nonfunctional instructions are identified by finding “hammocks” of instructions within the execution order of the instructions, monitoring data written to during execution of the hammocks; and removing the instructions of the hammock as non-functional instructions when execution of the hammock does not change external data.
It is another feature of at least one embodiment of the invention to provide a method of semantic “dead code” removal that unlike synonym techniques may work with novel obfuscation patterns that may not be in a synonym dictionary.
These particular features and advantages may apply to only some embodiments falling within the claims and thus do not define the scope of the invention.
a and 6b are examples of control flow graphs of the program of
Referring now to
The received executable files 12 may be received by a scanner program 18 incorporating a malware normalizer 20 of the present invention which normalizes the code of the executable files 12 and then provides it to a signature detector program 22 that compares the normalized executable files 12 to a set of standard, previously prepared, malware signatures 24.
Referring now to
Depending on the determination by the prescreening block 26 the executable file may be passed along to an unpacking program 28 or bypassed, as indicated by bypass path 30, without unpacking to the reordering program 31.
At the unpacking program 28, as will be described further below, executable file 12 is allowed to unpack (decompress) or decrypt itself (if the executable file 12 is packed or encrypted). As used henceforth the terms “pack” and “unpacking” shall be considered to refer also to “encrypt” and “decrypt” and similar functions performed by self-generating code, for example, including optimization, that generally alter the signature of the executable file 12. The unpacking process of unpacking program 28 may be repeated iteratively, as indicated by path 32, so as to unpack executable files 12 that have been packed multiple times. The unpacking program 28 may produce a detection signal 33 when the detection of self-generating code is desired (as opposed to the detection of malware).
At the moment the unpacking or decryption is complete, the unpacked executable file 12 is forwarded to a reordering program 31. If the executable file 12 does not have packing it is passed directly to the reordering program 31 without modification.
The reordering program 31 reorders the instructions of the executable file 12, as received from the unpacking program 28 into a standard form, as will be described, and then passes the reordered executable file 12 to the dead code remover program 34. The dead code remover program 34 removes “semantic nops” being nonfunctional code (not necessarily limited to nop instructions) to provide as an output a normalized executable file 12 that is passed to the signature detector program 22 for comparison to normalized malware signatures 24.
Referring still to
Referring now to
As shown in
At some point in the execution of the executable file 12, if the executable file 12 is packed, an unpacker program 54 in the executable file 12 will be invoked performing writes 50 to internal memory addresses of code that is being unpacked. These memory addresses are also tracked per process block 58 of the unpacking program 28 to further define the unpack area 56 which will grow, logically bounded by a first instruction 60 and a last instructions 62 although unpack area 56 need not be absolutely continuous within that range.
At decision block 64 of the unpacking program 28, occurring during the execution of each instruction of the executable file 12, the unpacking program 28 checks to see if there has been a jump in the control flow 48 to the unpack area 56 indicating that previously written data is now being executed as instructed. This jump is assumed to signal the conclusion of the unpacking process and the beginning of execution of the malware. At this time, a signal 33 is produced indicating that compression was detected.
At iteration block 66, the unpacking program 28 checks to see if the executable file 12 has concluded execution such as may be detected by movement of the control flow 48 out of the loaded image 42 or by a steady state looping such as may be detected, for example, by analyzing a fixed number of executed instructions. So long as the executable file 12 appears to be continuing execution, the iteration block 64 repeats process blocks 36, 58, and 64 creating a new unpack area 56 within the loaded image and monitoring the control flow 48 for a jump into the new unpack area 56. This process is continued to accommodate possible multiple packing operations.
At the conclusion all the iteration, as indicated by process block 68 of the unpacking program 28, the unpacked code, being for example the unpack area 56 of the final iteration or the union of all unpack areas 56 of all iterations, is sent to the reordering program 31.
Referring now to
Referring specifically to
As shown in
The edge 82 connecting instructions 72 and 76 is a conditional jump instruction and the edge 84 connecting instructions 72 and 76 is an unconditional jump instruction.
Per
In this case, and as shown by process block 94, the executable file 12 is edited by the reordering program 31 to remove the unconditional jump instruction 75 and replace it with its target 76 as shown in
When there is more than one unconditional jump predecessor for a given node (and that node has no fall-through edges) an arbitrary unconditional jump instruction may be eliminated. In a preferred embodiment, the unconditional jump instruction that is eliminated is the last unconditional jump predecessor in the order of the control flow graph. In a more sophisticated embodiment, conditional jump instructions that always jump are detected and treated as unconditional jump instructions.
Referring now to
Referring to
Generally hammocks will occur with structured “if”, “while”, and “repeat” statements but may also occur in other contexts.
Per process block 104 of the dead code remover program 34, the execution of the instructions within the hammock 98 (for example using the emulator or sandbox described above) is monitored keeping track of each write 106 performed by an instruction in the hammock 98, for example, by enrolling those written values and their addresses in a buffer table 108 to be refreshed at each hammock 98. If a given address receives a multiple write, the last written value is the one held in the table 108. The table 108 also preserves the original values 112 for each of the written values 110.
This population of the table 108 may also be performed by a static analysis of the instructions of the hammock 98.
At the conclusion of the execution of the hammock 98, that is when the hammock 98 is exited from at node 102, per process block 107, the original values 112 and written values 110 are compared. If they are identical, then the hammock represents nonfunctional or dead code insofar as there has been no net change in any variable.
Referring again to
It is specifically intended that the present invention not be limited to the embodiments and illustrations contained herein, but include modified forms of those embodiments including portions of the embodiments and combinations of elements of different embodiments as come within the scope of the following claims.
This application claims the benefit of U.S. provisional application 60/915,253 filed May 1, 2007 hereby incorporated by reference.
This invention was made with United States government support awarded by the following agencies: NAVY/ONR N00014-01-1-0708ARMY/SMDC W911NF-05-C-0102 The United States has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
60915253 | May 2007 | US |