Protecting computers from security threats, such as malware, is a concern for modern computing environments. Malware includes unwanted software that attempts to harm a computer or a user. Different types of malware include trojans, keyloggers, viruses, backdoors and spyware. Malware authors may be motivated by a desire to gather personal information, such as social security, credit card, and bank account numbers. Thus, there is a financial incentive motivating malware authors to develop more sophisticated methods for evading detection. In addition, various techniques, such as packing, polymorphism, or metamorphism can create a large number of variants of a malicious or unwanted program. Thus, it is difficult for security analysts to identify and investigate each new instance of malware.
The present disclosure describes malware detection using multiple classifiers including static and dynamic classifiers. A static classifier applies a set of metadata classifier weights to static metadata of a file. Examples of dynamic classifiers include an emulation classifier and a behavioral classifier. The classifiers can be executed at a client to automatically identify the file as potential malware and to potentially take various actions. For example, the actions may include preventing the client from running the malware, alerting a user to the possible presence of malware, querying a web service for additional information on the file, performing more extensive automated tests at the client to determine whether the file is indeed malware, or recommending that the user submit the file for further analysis. Classifiers can also be executed at a backend service to evaluate a sample of the file, to prioritize new files for human analysts to investigate, or to perform more extensive analysis on particular files. Further, based on further analysis, a recommendation may be provided to the client to block particular files.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a particular embodiment, a method of identifying a malware file using multiple classifiers is disclosed. The method includes receiving a file at a client computer. The file includes static metadata. A set of metadata classifier weights are applied to the static metadata to generate a first classifier output. A dynamic classifier is initiated to evaluate the file and to generate a second classifier output. The method includes automatically identifying the file as potential malware based on at least the first classifier output and the second classifier output.
In another particular embodiment, a method of classifying a file is disclosed. The method includes receiving a file at a client computer. The method also includes initiating a static type of classification analysis on the file, initiating an emulation type of classification analysis on the file, and initiating a behavioral type of classification analysis on the file. The method includes taking an action with respect to the file based on a result of at least one of the static type of classification analysis, the emulation type of classification analysis, and the behavioral type of classification analysis.
In another particular embodiment, a system to classify a file is disclosed. The system includes a classifier report evaluation component and a hierarchical classifier component. The classifier report evaluation component receives and evaluates a plurality of classifier reports from a set of client computers. The hierarchical classifier component includes a metadata classifier to evaluate metadata of a file sampled by at least one of the client computers to generate a first classifier output. The hierarchical classifier component also includes a dynamic classifier to generate a second classifier output. The hierarchical classifier component also includes a classifier results output to provide an aggregated output related to predicted malware content of at least one file associated with at least one of the plurality of classifier reports.
Referring to
In the embodiment illustrated in
In operation, the client computer 102 receives a file 112 including static metadata. The static metadata classifier 104 applies a set of metadata classifier weights 114 to the static metadata of the file 112 to generate a first classifier output 116. In a particular embodiment, the set of metadata classifier weights 114 are stored locally at the client computer 102. Alternatively, the set of metadata classifier weights 114 may be stored at another location (e.g., a network location). One or more dynamic classifiers 106 are then initiated to evaluate the file 112 and to generate a second classifier output 118. Based on at least the first classifier output 116 and the second classifier output 118, the anti-malware engine 120 automatically determines whether the file 112 includes potential malware. When the file 112 includes potential malware, a user interface 138 may provide an indication of potential malware 140 to a user.
The static metadata classifier 104 applies the set of metadata classifier weights 114 to generate the first classifier output 116. The static metadata classifier 104 analyzes attributes of the file 112 to construct features. Examples of static metadata features at the client computer 102 include a checkpointID feature and a locality sensitive hash feature. The checkpointID feature includes what behavior caused the report to be generated. The locality sensitive hash feature is a locality sensitive hash where a small change in the executable binary of a file leads to a small change in the locality sensitive hash. Weights 114 for the static metadata classifier 104 are trained on a backend system (e.g., the backend service 124) using metadata reports from many clients and the associated analyst labels (e.g., malware, benign). Training a two-class (malware, benign software) classifier using logistic regression may provide very accurate results.
The trained classifier weights may then be downloaded to the client computer 102 and stored as the set of metadata classifier weights 114. Attributes are extracted from the file 112 and converted to static metadata features. The static metadata features are evaluated by the static metadata classifier 104. The first classifier output 116 from the static metadata classifier 104 indicates a measure related to how likely the file 112 is to be malware.
Thus, the set of metadata classifier weights 114 may be used to produce a statistical likelihood that particular metadata is associated with malware. This statistical likelihood is output from the static metadata classifier 104 as the first classifier output 116. In a particular embodiment, the static metadata is represented as a feature vector. The first classifier output 116 may be determined based at least in part on a dot product of the set of metadata classifier weights 114 and the feature vector.
Another type of static classifier that predicts a likelihood that an unknown file is malware is a static string classifier that evaluates strings found in an unknown file, such as the file 112. One type of static string classifier uses a bag of strings model where important strings discriminate benign files and malware files. These strings can be identified in a number of different ways using feature selection techniques based on different principles such as contingency tables, mutual information, or other metrics. Once the most informative strings have been identified, a classifier can then be trained based on the presence or absence of the strings from known examples of the desired classes. When an unknown file is encountered, the anti-malware engine 120 extracts all strings from the unknown file. The anti-malware engine 120 compares each of the feature selected strings to the strings extracted from the unknown file. If the classifier feature string occurs in the unknown file, this feature is set to TRUE. Otherwise, this feature is set to FALSE. Alternatively, the number of times the particular string occurs in the unknown file may also be used as a feature instead of or in addition to the absence or presence of the string. The static string classifier then produces an output related to the likelihood that the unknown file is malware.
Another type of static classifier that predicts a likelihood that an unknown file, such as the file 112, is malware is a static code classifier. For example, the static code classifier may be based on blocks of code used by the file 112.
As shown in
In a particular embodiment, the emulation classifier 108 simulates execution of the file 112 in an emulation environment. The emulation environment protects the client computer 102 from being infected while the file 112 is tested in the emulation environment. In the emulation environment, the anti-malware engine 120 observes the behavior exhibited by the tested file 112 as it “runs” in the emulation environment. The behavior the file 112 exhibits will be very similar to the behavior it would exhibit if the file 112 were to run in the real system (e.g., the client computer 102). If the file 112 is found to be malware, this technique allows the anti-malware engine 120 to block the file before the file is allowed to execute. In a particular embodiment, the first classifier output 116 from the static metadata classifier 104 may be used to determine the length of time that the emulation classifier 108 is run.
The anti-malware engine 120 can observe which system APIs are invoked by the malware and what parameters are passed to these APIs. For example, the emulation classifier 108 may determine a set of application programming interfaces (APIs) invoked at the emulation environment. In a particular embodiment, features used by the emulation classifier 108 include API and parameter combinations, unpacked strings, and n-grams of API sequence calls. At least one of the APIs may be associated with malware. If the emulation classifier 108 predicts that the file 112 is malware, the installation and execution of the file 112 may be blocked.
The behavioral classifier 110 may be composed of one or more classifiers that analyze an unknown file, such as file 112, during installation and execution. In a particular embodiment, the behavioral classifier 110 analyzes the file 112 during installation to identify one or more installation behavioral features associated with malware. When there is a request to install an unknown file (e.g., the file 112) on the client computer 102, the behavioral classifier 110 predicts whether the file 112 is malware or benign based on behavior exhibited by the file 112 during installation. If the behavioral classifier 110 predicts that the file 112 is malware before the installation process has completed, the behavioral classifier 110 may be able to alert the operating system in time to prevent the malware from being installed, thereby preventing infection of the client computer 102.
In another particular embodiment, the behavioral classifier 110 analyzes the file 112 during run-time to identify one or more run-time behavioral features associated with malware. After the file 112 has been installed, the behavioral classifier 110 can attempt to predict if the file 112 is malware based on its normal behavior. If the behavioral classifier 110 predicts that the file 112 is malware, the execution of the file 112 can be halted.
The behavioral classifier 110 can also be used to predict whether the file 112 is malware based on other types of behavior. For example, the behavioral classifier 110 may monitor an operating system firewall or a corporate network firewall and prohibit the execution of the file 112 based on external network behavior.
Based on at least the first classifier output 116 and the second classifier output 118, the anti-malware engine 120 may take an action with respect to the file. For example, the action may include providing an indication of potential malware 140 to a user via the user interface 138. Alternatively, the action may include blocking execution of the file 112 or blocking installation of the file 112. In another embodiment, the action may include querying a web service for additional information about the file 112. For example, the anti-malware engine 120 may submit client predicted malware content 122 to the backend service 124. The client predicted malware content 122 may include classifier information and metadata related to the file 112. The backend service 124 may perform additional emulation type classification analysis to determine whether the file 112 includes malware. In the embodiment shown, the backend service 124 includes a hierarchical classification component 128, including a backend metadata classifier component 130, one or more backend dynamic classifiers 132, and a classifier results output component 134. Based on an analysis by at least one of the components 130 and 132, the backend service 124 may provide server predicted malware content 136 to the client computer 102. For example, the server predicted malware content 136 may indicate that the file 112 contains malware. Alternatively, the server predicted malware content 136 may indicate that the file 112 does not contain malware.
In a particular embodiment, there are two backend static metadata classifiers: Zero-Day Backend Static Metadata Classifier (ZDBSMC) and Aggregated Backend Static Metadata Classifier (ABSMC). The ZDBSMC is designed to detect a new malware entry the first time it is encountered. Examples of ZBSMC and ABSMC features include a checkpointID feature, a locality sensitive hash feature, a packed feature, and a signer feature, among other alternatives. The checkpointID feature includes what behavior caused the report to be generated. The locality sensitive hash feature is a locality sensitive hash where a small change in the executable binary of a file leads to a small change in the locality sensitive hash.
An anti-malware system can be executed on many client machines at various locations. These anti-malware engines can generate classifier reports that describe either static attributes, dynamic behavioral (both emulated and real system) attributes, or a combination of both static and dynamic behavioral attributes. These reports can optionally be transmitted to a backend service implemented on one or more backend servers. The backend service can determine whether or not to store the classifier reports from the anti-malware engines.
Backend anti-malware services attempt to identify new forms of malware and request samples of new malware that are encountered by client computers. However, many forms of malware are polymorphic or metamorphic, meaning that these files sometime mutate so that each instance (i.e. variant) of the malware is unique. If the backend anti-malware service waits to collect a sample of polymorphic or metamorphic malware based on post processing of the metadata reports, variants of polymorphic or metamorphic malware may be detected from metadata reports, but the unique samples may not be seen again on another computer.
If the static, emulation and/or behavioral classifiers predict that the unknown file is malware, the classification output probability from the classifier(s) on the client can be sent to the backend service 124 along with the other metadata. If the unknown file is predicted to be malware by the client and the backend service 124 has either never received a particular report for the unknown file or has not received the desired number of reports related to the particular file, then the backend service 124 can automatically request that the sample be collected from the client computer, such as the client computer 102. The client computer 102 may also use the classification output probability to decide whether or not to automatically push a sample of the file 112 to the backend service 124.
Referring to
The metadata classifier 256 evaluates metadata sampled by at least one of the client computers to generate a first classifier output. For example, the metadata may include static metadata or other metadata (e.g., dynamic metadata). As an example, behavioral metadata and emulation metadata may be transferred to the backend service 206. If a sample file has been previously collected, a more extensive metadata classifier 256 may be run (e.g., static metadata, code, or string classifiers). The dynamic classifier 258 generates a second classifier output. In a particular embodiment, the dynamic classifier 258 is run if a sample has been previously collected. The classifier results output 260 provides an aggregated output 262 related to predicted malware content of at least one file associated with at least one of the plurality of classifier reports (e.g., the first classifier report 228 and the second classifier report 250). In a particular embodiment, each of the classifier reports may include at least one of a filename, an organization, and a version.
The classifiers 256 and 258 at the backend service 206 may be similar to the classifiers that are executable at client computers (e.g., the first client computer 202 and the second client computer 204). For example, the metadata classifier 256 of the backend service 206 can classify new reports that are collected from the anti-malware engines running on the client (e.g., anti-malware engine 224 on the first client computer 202 and anti-malware engine 246 on the second client computer 204).
In operation, the backend service 206 receives classifier reports from one or more client computers. In the embodiment illustrated, the client computers include the first client computer 202 and the second client computer 204. The first client computer 202 includes a static metadata classifier 208, one or more dynamic classifiers 210, and an anti-malware engine 224. The dynamic classifiers 210 include an emulation classifier 212 and a behavioral classifier 214.
The first client computer 202 receives a file 218 including at least static metadata (e.g., the file 218 may also contain dynamic metadata). The static metadata classifier 208 applies a set of metadata classifier weights 216 to the static metadata from the file 218 to generate a first classifier output 220. The dynamic classifiers 210 are then initiated to evaluate the file 218 and to generate a second classifier output 222. Based on at least the first classifier output 220 and the second classifier output 222, the anti-malware engine 224 automatically determines whether the file 218 includes potential malware.
The second client computer 204 operates substantially similarly to the first client computer 202. The second client computer 204 includes a static metadata classifier 230, one or more dynamic classifiers 232, and an anti-malware engine 246. The dynamic classifiers 232 include an emulation classifier 234 and a behavioral classifier 236. The second client computer 204 receives a file 240 including static metadata. The static metadata classifier 230 applies a set of metadata classifier weights 238 to the static metadata from the file 240 to generate a first classifier output 242.
In a particular embodiment, the set of metadata classifier weights 238 are stored locally at the second client computer 204. Alternatively, the set of metadata classifier weights 238 may be stored at another location. For example, the set of metadata classifier weights 238 may be stored at a network location and shared by the first client computer 202 and the second client computer 204.
The dynamic classifiers 232 are initiated to evaluate the file 240 and to generate a second classifier output 244. Based on at least the first classifier output 242 and the second classifier output 244, the anti-malware engine 246 automatically determines whether the file 240 includes potential malware.
Based on at least the classifier outputs 220, 222, 242 and 244, the anti-malware engines 224 and 246 submit client predicted malware content 226, 248 to the backend service 206. The client predicted malware content 226 from the first client computer 202 may be included in the first classifier report 228. Similarly, the client predicted malware content 248 from the second client computer 204 may be included in the second classifier report 250.
Backend static malware classification may have some advantages over the client classifiers. For example, the backend metadata classifier 256 can aggregate the metadata from multiple reports. Additional aggregated features may include the number of different filenames, organizations, and versions, among other alternatives. For example, the same malware binary may use a different filename, organization, or version. An additional feature is the entropy (randomness) of the different filenames. If the filename is completely random for the same executable binary, which can be identified by a hash of the binary version of the file, such as files 218 or 240, this is often an indication of malware. Furthermore, if the checkpointID and dynamic metadata are completely random, this may be an indication of malware. As another example, additional computational processing can be used on the backend. Very fast dedicated computers can be used to analyze an unknown file on the backend server. This may allow for additional analysis of the unknown file.
Once the backend service 206 has analyzed the classifier reports (and, optionally, the unknown file) one or more of the classifier output probabilities can be returned to the client computer so that the client computer can decide whether or not to continue the installation or execution of the unknown file. In addition, when a classifier report is submitted to the backend service 206, one or more of the backend classifier output values can be used to automatically request that the file be collected immediately from the client computer or collected in the future when the file is again observed.
For an enterprise, information technology (IT) managers may desire the ability to enable full logging of files exhibiting “suspicious” static, emulation, and behavioral events. IT managers log host computer events, firewall events for monitoring network activity, etc. to investigate potential malware on their clients. An anti-malware engine can maintain a history of the behavior for the unknown files, i.e. files that are not signed by companies on a cleanlist. The anti-malware engine can provide the ability to log the behavior of clean files so that the IT managers can learn to identify clean behavior. The option to log behavior events to a SQL database may be desirable. Another feature would be to add a new set of security events to handle the behavioral events so that a backend security service could manage these events.
For a home or a small business environment, users could enable full behavior logging for “suspicious” behavioral events. Users could submit plain text versions of the logs to anti-malware forums for feedback. If suspicious behavior is detected on the client, the user could also have the option of submitting the full behavior logs to the anti-malware engine manufacturer in real-time which are obfuscated for personal information and compressed, encrypted, etc. The backend service 206 could provide a type of enhanced, behavioral reputation service similar to a diagnosis provided after a crash. The backend service could offer an enhanced diagnostic security service based on these logs which might not be available on the client in real-time. In addition to the home users, the enterprise users would also use this backend service for enhanced security. These logs would then be the basis for training future versions of behavioral based signatures and classifiers.
In both of these scenarios, the end user would have control over submitting the logs and would gain better security through improved diagnostics. Thus, the initial detection of suspicious behavior on the client based on signatures would provide the first level of detection. The backend could potentially offer more robust behavioral analysis and detection.
Another way to collect training data is to reconstruct the overall behavior event sequence for any file given partial telemetry monitoring logs. This may involve sampling and returning random, contiguous blocks of behavioral events. The backend would receive these small blocks of contiguous events from multiple clients and reconstruct the overall behavioral event patterns from these small contiguous blocks of events. This may enable a better understanding of the overall behavior of the files in the near term and enable design of better signatures and classifiers.
Referring to
The method includes initiating a dynamic classifier to evaluate the file 304 and to generate a second classifier output 314, at 312. For example, the dynamic classifier may include the emulation classifier 108 of
The method also includes automatically identifying the file 304 as a potential malware file based on at least the first classifier output 310 and the second classifier output 314, as shown at 316. It should be noted that the classifiers may be run in sequence or in parallel. For example, a static classifier and an emulation classifier may be run in parallel. In a particular embodiment, the classifiers may be run in parallel using different central processing unit (CPU) cores. The method ends at 314.
Referring to
The method includes initiating an emulation classifier to evaluate the file 404 and to generate a second classifier output 414, as shown at 412. For example, the emulation classifier may include the emulation classifier 108 of
The method includes initiating a behavioral classifier to evaluate the file 404 and to generate a third classifier output 422, as shown at 420. For example, the behavioral classifier may include the behavioral classifier 110 of
The method also includes automatically identifying the file 404 as potential malware based on at least the first classifier output 410, the second classifier output 414, and the third classifier output 422, as shown at 424. For example, the file 404 may be identified as malware using the anti-malware engine 120 of
Referring to
The method includes receiving a file 504 (e.g., an unknown file) at a client computer, at 502. Alternatively, a plurality of files may be received. For example, the file 504 may include the file 112 of
For example, the action 514 may include blocking execution of the file 504, at 516, or blocking installation of the file 504, as shown at 518. As another example, the action 514 may include providing an indication that the file 504 includes potential malware via a user interface, at 520. For example, the indication may include the indication of potential malware 140 provided to a user via the user interface 138 of the client computer 102 illustrated in
As an additional example, the action 514 may include querying a web service for additional information about the file 504, at 522. For example, the client computer 102 of
Referring to
When the file is not identified as malware, the method proceeds to a static malware classification system, at 616. If the static malware classification system predicts that the file is malware, at 618, then the installation and execution of the file is blocked, at 620. Otherwise, the method proceeds to the emulation malware classification system, at 622.
If the emulation malware classification system predicts that the file is malware, at 624, then the installation and execution of the file is blocked, at 626. Otherwise, the method proceeds to the behavioral malware classification system, at 628. The classifier features from the static malware classification system is provided to the emulation malware classification system, and the classifier features from the emulation malware classification system is provided to the behavioral malware classification system. Thus, one or more features from a previous classifier are passed to the next classifier. For example, static metadata features from the static malware classification system (e.g., checkpointID, file name) may be passed to the emulation malware classification system. Further, one or more statistical outputs from the static malware classification system may be passed to the emulation malware classification system. In addition, one or more features and the classifier outputs from the static malware classification system and the emulation malware classification system are provided to the behavioral malware classification system.
Referring to
In the embodiment illustrated, the file may also be analyzed using other static classifiers, at 722. The outputs from the static malware classification system, the static string classifier, and the static code classifier are provided to a hierarchical malware classification system, at 724. The hierarchical malware classification system determines an overall static classification output 726.
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
The computing device 1510 typically includes at least one processing unit 1520 and system memory 1530. Depending on the exact configuration and type of computing device, the system memory 1530 may be volatile (such as random access memory or “RAM”), non-volatile (such as read-only memory or “ROM,” flash memory, and similar memory devices that maintain the data they store even when power is not provided to them) or some combination of the two. The system memory 1530 typically includes an operating system 1532, one or more application platforms 1534, one or more applications 1536 (e.g., the classifier applications described above with reference to
The computing device 1510 may also have additional features or functionality. For example, the computing device 1510 may also include removable and/or non-removable additional data storage devices, such as magnetic disks, optical disks, tape, and standard-sized or miniature flash memory cards. Such additional storage is illustrated in
The computing device 1510 also contains one or more communication connections 1580 that allow the computing device 1510 to communicate with other computing devices 1590, such as one or more client computing systems or other servers, over a wired or a wireless network. The one or more communication connections 1580 are an example of communication media. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. It will be appreciated, however, that not all of the components or devices illustrated in
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software component executed by a processor, or in a combination of the two. A software component may reside in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disk, a removable disk, a compact disc read-only memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an integrated component of a computing device or a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.
Those of skill would further appreciate that the various illustrative logical blocks, configurations, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, configurations, modules, circuits, or steps have been described generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
A software module may reside in computer readable media, such as random access memory (RAM), flash memory, read only memory (ROM), registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium.
Although specific embodiments have been illustrated and described herein, it should be appreciated that any subsequent arrangement designed to achieve the same or similar purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all subsequent adaptations or variations of various embodiments.
The Abstract of the Disclosure is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, various features may be grouped together or described in a single embodiment for the purpose of streamlining the disclosure. This disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter may be directed to less than all of the features of any of the disclosed embodiments.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the disclosed embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope possible consistent with the principles and novel features as defined by the following claims.