Attribute-based detection of malicious software and code packers

Information

  • Patent Application
  • 20240338446
  • Publication Number
    20240338446
  • Date Filed
    June 17, 2024
    6 months ago
  • Date Published
    October 10, 2024
    2 months ago
Abstract
A system and method for detecting malware using hierarchical clustering analysis. Unknown files classified by clustering and in view of known malicious and known safe files. A search is made for similar files using the probabilistic MinHash LSH algorithm applying a Jaccard measure. Machine learning models and detection rules are used to enhance classification accuracy.
Description
TECHNICAL FIELD

The invention pertains to computer systems and the detection of malware and unauthorized activity within those systems.


BACKGROUND

Malicious software penetrates and harms computer systems without the knowledge or consent of the owners. Malware is an ongoing problem in computer security. Attackers have the advantage over defenders because only one vulnerability needs to be found to compromise a system. It is widely understood that keeping system software updated greatly improves security, yet even fully updated systems are vulnerable to zero-day attacks. Despite best efforts to detect exploits quickly and update client systems, cybersecurity defenders always lag behind attackers to some degree. The longer the lag, the more vulnerable client networks become, even when best practices for cybersecurity have been implemented. Estimates suggest that financial losses by companies subjected to attacks amount to billions of dollars each year.


One of the traditional approaches to detecting malicious programs is to compare the signatures of the files under investigation. When antivirus companies detect a new sample, they analyze it and create signatures that are released as an update to clients. At the same time, analysis and updating usually take a long time, during which malware continues to pose a threat.


In addition, modern malware has several polymorphic layers, uses code packaging, obfuscation and other methods to complicate detection and analysis. For a given sample, there can be hundreds or thousands of automatically generated variants.


Faster and more efficient methods of detecting and classifying new samples are needed to compensate for the ever-increasing number of malware variants.


SUMMARY

A system is disclosed for detecting malware and the use of code packing algorithms based on static attributes using partial learning and real-time learning. One problem with clustering is scalability. Classical clustering algorithms usually have a complexity higher than linear, for example, DBSCAN and Hierarchical clustering algorithms have time complexity O(n2) and O(n3) respectively, where n is the number of objects.


This problem is solved by using a probabilistic algorithm for finding similar objects, MinHash Locality Sensitive Hashing (LSH), which has sublinear complexity. MinHash, a min-wise independent-permutations locality-sensitive hashing scheme, is a technique for quickly estimating how similar two sets are. It is used to search for groups of objects with a distance below the specified threshold tLSH, which is set during initialization. Further, on the obtained subsets of objects, referred to as “rough clusters,” the already selected exact clustering algorithm is launched. Costly computations on all data are not needed. Instead, accurate clustering is performed only on a small sample at the cost of an insignificant loss of accuracy. If the sample size is limited, linear complexity is achieved from the number of features.


Rough clustering does not require specifying the number of clusters, which may be unknown at the outset. Instead, the method relies on a threshold for distance between objects and agreement on the distance thresholds when performing coarse and fine clustering.


In an embodiment, a method for malware detection in a computing environment is implemented by a microprocessor, a malware collection, and a safe collection. The implementation includes loading test files comprising known safe and known malicious files and performing static analysis of the test files without unpacking them to generate a non-vectorized set of strings and opcodes. Attributes of the test files are filtered based on attribute statistics of the test files. Test files are clustered using a probabilistic algorithm based on similarities calculated with a Jaccard measure. An unknown file is obtained for analysis and similar files are searched for from among the test files using a probabilistic MinHash LSH algorithm that applies the Jaccard measure. The unknown file is entered into an existing cluster, or a new cluster is formed using at least one clustering model derived from the test files. The unknown file is classified based on the results of the clustering and determining whether the classification indicates the use of a packer. Dynamic analysis is performed on the unknown file only if it is classified as packed.


In an alternative method, an unknown file is obtained for analysis. A search is made for similar files from among test files using a probabilistic MinHash LSH algorithm that applies a Jaccard measure. The unknown file is entered into an existing cluster, or a new cluster is formed using at least one clustering model derived from the test files. The unknown file is classified based on the results of the clustering. A determination is made whether the classification indicates the use of a packer. Dynamic analysis is performed on the unknown file only if it is classified as packed. The test files have been analyzed with a static analyzer without unpacking them to generate a non-vectorized set of strings and opcodes and attributes of the test files have been filtered based on attribute statistics of the test files. The test files have been clustered using a probabilistic algorithm based on similarities calculated with the Jaccard measure.


In another embodiment, a system for malware detection for an unknown file in a computing environment includes a microprocessor, an unknown file, a malware file collection, and a safe file collection. The system includes a static analyzer and a first file attributes filter, under program control by a microprocessor. The static analyzer is configured to receive as input the unknown file, the malware collection, or the safe file collection.


Other components of the system include a dynamic analyzer and a second file attributes filter and an n-gram builder under program control by the microprocessor, the dynamic analyzer configured to receive as input the unknown file, the malware collection, or the safe collection. The microprocessor is further configured for program control of a file attributes weight analysis unit comprising an attributes weights assessment unit.


The system further includes a machine-learning clustering unit comprising a clustering model based on a Jaccard measure, in communication with a file attributes analysis unit. The machine learning clustering unit is further configured for applying a file similarity assessment based on probabilistic Min Hash LSH algorithm that applies the Jaccard measure. Other components include a machine learning classifier configured for receiving the results of the machine learning clustering unit and a library, in communication with the classifier, comprising a plurality of machine learning or detection rules. The unknown file can be a packed file and the classifier identifies the unknown file as packed or not packed. The dynamic analyzer operates only on files identified as packed files.


In alternative implementations, labels are assigned to clusters only if all files belonging to the cluster have a label of the same class. In an embodiment, If all files belonging to a cluster do not have a label of the same class, then the file cluster is not used for classifying the unknown file. In some embodiments, the step of filtering attributes comprises using a frequency filter, which can comprise a frequency in safe files, in malicious files, in the entire sample, and if present in a certain number of objects of both classes. In some embodiments, when a cluster is formed without a label, the cluster's members are not classified. In an embodiment, an unknown file is classified as malicious if the class of the unknown unknown file does not indicate the use of the packer.





SUMMARY OF FIGURES


FIG. 1 shows the components and interactions of a system embodying the invention.



FIG. 2 shows exemplary method steps for implementing the invention.



FIG. 3 shows exemplary results obtained by way of an embodiment of the invention.



FIG. 4 shows exemplary results obtained from an embodiment of the invention used to identify packing algorithms.



FIG. 5 shows the components and interactions of an alternative system embodying the invention.



FIG. 6 shows exemplary steps for an alternative implementation of the invention.





DETAILED DESCRIPTION

A system for implementing a scalable partial learning algorithm is applied to computer files and clusters for automatic attribute-based detection of malicious software and code packers.


Malicious software can be detected with high accuracy and code families can be automatically highlighted based on strings extracted from files. N-gram opcodes are used to define the code packing algorithm to be used. This system can be used for automatic marking of new software samples and identification of families, which greatly simplifies and speeds up the analysis.


The system employs machine learning malware classification based on static and dynamic file analysis; clustering files based on groups of files, grouped by statistical similarity of file attributes; filtering attributes by the input in the classification; and detecting packer type using the code structure of malicious files.


In an embodiment, agglomerative (hierarchical) clustering is used as the clustering algorithm. The algorithm operates as follows:

    • 1. All objects are initially considered to be clusters.
    • 2. There is a pair of clusters with a minimum distance between them.
    • 3. The found pair is combined into 1 cluster.
    • 4. Repeat 2 and 3.


The process is repeated until the number of clusters after point 3 reaches the minimum threshold, or until the distance between clusters in point 2 does not exceed the threshold.


Since objects are sets, the Jaccard measure is used as a measure of similarity. It is calculated by the following formula, where A and B are sets, and at least one is not empty, or







sim

(

A
,
B

)

=


A

B


A

B






where A and B are sets, and at least one is not empty, or sim(A,B)=1. Accordingly, the distance is







dist

(

A
,
B

)

=

1
-


A

B


A

B







to create a compact representation of sets and a MinHash algorithm was used for quick approximate calculation of distances. The main problem with clustering is scalability. Classical clustering algorithms usually have a complexity higher than linear, for example, DBSCAN and Hierarchical clustering algorithms have time complexity O(n2) and O(n3) respectively, where n is the number of objects.


This problem is solved by using a probabilistic algorithm for finding similar objects, MinHash LSH, which has sublinear complexity.


Significant acceleration occurs due to the use of LSH (Locality sensitive hashing) as a coarse clustering algorithm. LSH performs sublinear searches for objects with Jaccardsim above the set threshold. This algorithm with a high probability hashes similar input objects into the same “baskets.” Objects that have similarity above the set thresholds are considered to be similar. This method requires preliminary calculations that are linear in time, while the search for similar objects is constant in time. The algorithm is carried out with relatively swiftness. For example, expected times for preprocessing are about 17 seconds, while obtaining rough clusters takes about 12 seconds.


LSH is used to search for groups of objects with a distance below the specified threshold tLSH, which is set during initialization. Further, on the obtained subsets of objects, which will be referred to as “rough clusters,” a preselected exact clustering algorithm is launched. This reduces the need to perform costly computations on all available data because accurate clustering can be performed on a small sample with an insignificant loss of accuracy. If the sample size is limited, linear complexity can be achieved from the number of features.


Rough clustering does not require specifying the number of clusters, which is initially unknown. Instead, only a threshold for distance between objects is required. Agreement on the distance thresholds allows optimal results when performing coarse and fine clustering.


In an embodiment, partial training is used. The peculiarity of partial training is that not all objects have a target label. An obtained data set and the marked objects are used to define labels for unknown objects. To do this in a rule-based manner, a label is assigned to the cluster. An example of such labeling would be, for example, if most of the marked objects of the cluster belong to one class, and all objects in the cluster without a label belong to this class.


Usually, when using partial training, the proportion of tagged objects is small. Alternatively, however, a large, tagged sample may be used to classify a relatively small test set. In some instances, the formation of a cluster without a label is possible. In such a case, its members are not classified.


In an embodiment, real-time learning is implemented. MinHash LSH is also used to quickly determine which known objects a new one looks like. This, as with clustering, avoids calculating distances for all known features by working with a small set.


When a new object arrives, it can be added to an existing cluster, form a new one, or remain unclustered. By remaining unclustered, the object does not belong to existing clusters and does not form new ones. In the first two cases, if the labels of the resulting clusters are known, the object can be immediately classified.


Thus, the proposed system makes it possible to automatically maintain up-to-date information on a large number of families of executable files, to single out new families, and to classify at least some of the new samples.


In an embodiment, a number of simple frequency filters are used: by frequency in safe files, in malicious files, in the entire sample, and if it is present in a certain number of objects of both classes.


Malware detection according to the invention exploits the properties of strings. For example, strings extracted from a file can be used to define a family. The strings.exe application from the sysinternals set of utilities is used to retrieve the strings. This utility scans the transmitted file for the presence of embedded UNICODE (or ASCII) strings, the length of which is 3 or more characters by default.


For example, distances between files and the same family are calculated for three known test file families, “7-zip,” “GIMP,” and “CPU-Z” and different values of the minimum number of characters. This determines the optimal minimum number of characters per line, for example, five printable characters. A lower value extracts more noise lines, and the distance between files of the same family grows large. With a larger value, fewer rows are retrieved and some of the information is lost, which is also undesirable because sets are at issue and their elements are checked for matches.


As an algorithm for exact clustering in this part of the work, Hierarchical clustering is used with the following parameters: the maximum distance between clusters for union−distance_threshold=−0.5, the method for calculating the distances between clusters is linkage=average is the average value of the distances between objects from two clusters. Hierarchical clustering describes an algorithm that groups similar objects into groups, or “clusters.” The algorithm's endpoint is a set of clusters, where each cluster is distinct from the other clusters and the objects within each cluster are similar to each other. For MinHash LSH, the threshold was set according to the fine clustering threshold: tLSH=distance_threshold=0.5.


To classify files without a target label their cluster needs to be determined and whether the given family is safe or malicious. Labels are assigned to clusters only if all files belonging to the cluster have a label of the same class. In an embodiment, clusters that do not meet this criterion are excluded from consideration and not used for classification of new objects.



FIG. 1 shows an implementation of system 100 for malware detection according to an embodiment of the invention using clustering. Unknown file 102 is analyzed in view of malware collection 104 and safe applications collection 106. Static analyzer of packed samples 108 takes as input the unknown file 102, malware collection 104, and safe applications collection 106. A file attributes filter 110 is applied and the output is delivered for machine learning clustering based on static attributes 112. Clustering model 114 is applied to the samples as well as attributes weights 116. The output of clustering 112 is communicated to machine learning classifier 120, which classifies the results by comparison with machine learning models and detection rules 122. In parallel, dynamic analyzer 124 also receives unknown file 102, malware collection 104, and safe application collection 106. File attributes filter 126 is applied and output is delivered to n-gram builder 128. The output of n-gram builder 128 is communicated for machine learning clustering based on dynamic attributes 130. Dynamic clustering comprises application of clustering model 132 and attributes weights 134. The output of clustering 130 is communicated to machine learning classifier 120, which classifies the results by comparison with machine learning models and detection rules 122. In an embodiment, both static and dynamic analysis, attribute weights 116, 134 are communicated to their respective file attribute filters 110, 126.



FIG. 2 shows the steps 200 of a method carried out in accordance with an embodiment of the invention. At step 202, files are loaded from safe and malicious file collections. At step 204, static analysis is performed on files in collections without unpacking and forming a non-vectorized set of strings and opcodes of the file. Then at step 206 attributes are filtered based on static file attributes. A search is conducted at step 208 for groups of similar objects with a distance below a specified threshold using a probabilistic algorithm for finding similar objects MinHash LSH, which has linear complexity. Objects are then clustered at step 210 with a focus on identified groups using a machine learning mode based on a hierarchical clustering algorithm. With this training completed, an unknown file is selected for analysis at step 212. Static analysis of the unknown file is performed at step 214. A search for files similar to the unknown file is performed using a probabilistic algorithm at step 216. From the results of this search, a determination is made whether to enter the unknown file into an existing cluster or forming a new cluster and the unknown file is classified at step 218. After classification, a decision is made based on whether the class of the unknown file indicates the use of file packing at step 220. If “yes,” new file attributes are extracted using a dynamic analyzer at step 222. Alternatively, step 222 comprises unpacking the file with a corresponding algorithm. In either case, the new file attributes are returned to step 216 for searching for similar files. If step 220 results in a “no” answer, the unknown file is marked as malicious according to the classification results at step 224 and its performance of malicious functions is blocked.


In an embodiment, the size of the training sample is 74,180 non-empty files after filtering: 33826 malicious and 40354 safe. The size of the test sample is 19953:9969 malicious and 9984 safe. Thus, the share of files that could be classified is 53.4%. Results 300 are shown in FIG. 3. The statistics obtained for clusters in this exemplary training sample are shown in table 302, where the results are grouped by totals in row 310, “safe” in row 312, “malicious” in row 314, and both classes in row 316. The share of classified files in clusters containing both types of files, which can be described as false positives, is only 0.18%.


The results of using the resulting structure for the classification of new objects are shown in table 304, showing statistics on files. Here the numbers of total files, safe files, and malicious files are shown correlated to rows for existing clusters 320, new clusters 322, and total classified 324. Table 306 of FIG. 3 shows an error matrix where row 330 shows the number of safe files classified as safe and malicious, while row 332 shows the number of malicious files classified as safe and malicious.


The disclosed method classified 11.5% of the files. At the same time, the classification accuracy is 97.0%, and the proportion of false positives is 7.6%.


Statistical differences can be explained by the fact that for the training sample, a single workstation was used as a source of safe files, and a labeled virus collection was used as a source of malicious files. The test sample consists entirely of random files.


In an embodiment, the system and method are used for detection of executable file packers. Executable file packing refers to compression of executable code. Packing allows code to be modified without changing the underlying function of the file. In other words, packing changes what executable code looks like without changing anything about a file's function. The clustering results for strings and n-grams of the sequence of opcodes differ significantly. N-grams of the sequence of opcodes are alternatively referred to as “n-grams of opcodes” or simply “opcodes.” This difference in clustering results is expressed both in the number of files that could be clustered, and in the difference between the clusters obtained by rows and opcodes.


The results obtained are explained by the fact that when extracting n-grams of opcodes, unpacking of the packed code was not performed. A more detailed analysis showed that with this approach the clustering of packed files occurs not according to the software family, but according to the packing algorithm used. If code has not been packed, then clustering occurs by family, as in the case of using strings.


Code wrapping is a common malware technique used to avoid detection and complicate analysis. Malware “wrapped” with a legitimate file is configured so that upon execution, it extracts or installs the legitimate file along with the malware. Conventional programs also use packaging for size reduction and protection. This fact makes it difficult to detect malware directly, however, the task of determining the packaging algorithm used is in itself useful.


For clustering in this embodiment, hierarchical clustering is used with parameters: tLSH=distance_threshold=0.7, linkage=‘single’ is the minimum distance between objects from two clusters. Cluster tags were assigned if more than half of its members belong to the same class.


In this exemplary embodiment, 43182 malicious files comprise a training sample. The results obtained for the 10 most common packers in the available data are presented in FIG. 4, which shows table 400 presenting results from using clustering for identifying packing algorithms 402. Column 404 shows the packers analyzed, column 046 shows the total number of files, and column 408 shows the number of clusters per packer. For each packer, table 400 further shows the total number of files in the received clusters 410 and the number of files using the packer in clusters 412. Classification accuracy for each packer is stated in column 414, while the correct portion of clustered files is given in column 416.


Out of 43182 files, 38989 were in clusters. Of these, 38001 (97.5%) are classified correctly. The results obtained demonstrate the possibility of determining the packing algorithm with high accuracy by the proposed method.



FIG. 5 shows system 500 for malware detection according to an embodiment of the invention using clustering. Unknown file 502 is analyzed in view of malware collection 504 and safe applications collection 506. Static and dynamic analyzer 508 takes as input the unknown file 502, malware collection 504, and safe applications collection 506. File attributes analysis unit 510 comprising attribute weights assessment unit 512 is applied and the output is delivered for machine learning clustering unit 518. Clustering model based on Jaccard measure 516 is applied to the samples as well as file similarity assessment based on probabilistic MinHash LSH algorithm applying the Jaccard measure 514.


Attributes Database 513 is a database that serves as a repository for the attributes and their associated weights. Attributes database 513 enables quick retrieval and updating of attribute data, supporting the dynamic nature of the system as it learns and evolves with new data.


The output of machine learning clustering unit 518 is communicated to machine learning classifier 520, which classifies the results by comparison with machine learning models and detection rules 522.


The components shown in FIG. 5 (and FIG. 1) comprise software modules under microprocessor control, and are built in a suitable programming language, such as Python, adapted for file processing and applying machine-learning tools.


File attributes analysis unit 510, for example, is configured to analyze file attributes. This file attributes analysis unit 510 evaluates both static file attributes derived from static analysis of files and dynamic file attributes derived from dynamic analysis. Attributes weights assessment unit 512 is a subunit of file attributes analysis unit 510. Subunit 512 conducts an assessment and assigns weights to a plurality of file attributes. The weights assigned reflect the significance of each attribute in distinguishing between malware and benign applications. The weights also distinguish between packed and unpacked files. This is done by, for example, evaluating the frequency and uniqueness of attributes in relevant datasets.


Machine learning clustering unit 518 receives similarity assessments and carries out the actual clustering of files. Clustering unit 518 uses machine learning techniques to dynamically learn from and adapt to the characteristics of malware and benign files as indicated by the clustering model. Subunits of clustering unit 518 include unit 514, uses a probabilistic MinHash Locality Sensitive Hashing (LSH) algorithm applying the Jaccard measure. Another subunit of clustering unit 518 is clustering model based on Jaccard measure 516. The Jaccard measure integration into the clustering model allows for a mathematically robust approach to grouping files. By considering the weighted attributes, this model clusters files based on their overall similarity.


Machine learning classifier 520 uses the results from clustering unit 518 to classify unknown files. Classifier 520 compares these results against known machine-learning models and detection rules stored within database 522 to determine if an unknown file is likely to be malware.



FIG. 6 shows an implementation 600 of the system shown in FIG. 5, using the same or similar software and hardware tools. At 602, test files comprising known safe and known malicious files are loaded for analysis. Static analysis of file collections is performed without unpacking the files at 604. Also at 604, a non-vectorized set of strings and opcodes is generated. Then at 606, attributes of the test files are filtered based on attribute statistics of those test files. The test files are clustered at 608 using a probabilistic algorithm based on similarities calculated with a Jaccard measure.


The implementation continues at 610 when an unknown file is obtained for analysis. A search is made at 612 for similar files to the unknown file using the probabilistic MinHash LSH algorithm applying a Jaccard measure. The unknown file is entered into an existing cluster using a clustering model derived from the test files at 616. At 618, a determination is made whether the class of the unknown file indicates the use of a file packer. If a packer is indicated, at 620 new file attributes are extracted using a dynamic analyzer. Alternatively, at 620 the file is unpacked with an unpacking algorithm. If no packer is indicated at 618, the unknown file is marked as malicious according to results of classification at 622. Also at 622, the unknown file can also be blocked from execution.

Claims
  • 1. A method for malware detection in a computing environment, implemented by at least one microprocessor, a malware collection, and a safe collection, the method comprising: loading test files comprising known safe and known malicious files;performing static analysis of the test files without unpacking them to generate a non-vectorized set of strings and opcodes;filtering attributes of the test files based on attribute statistics of the test files;clustering the test files using a probabilistic algorithm based on similarities calculated with a Jaccard measure;obtaining an unknown file for analysis and searching for similar files from among the test files using a probabilistic MinHash LSH algorithm that applies the Jaccard measure;entering the unknown file into an existing cluster or forming a new cluster using at least one clustering model derived from the test files; andclassifying the unknown file based on the results of the clustering and determining whether the classification indicates the use of a packer;wherein dynamic analysis is performed on the unknown file only if it is classified as packed.
  • 2. The method of claim 1, wherein labels are assigned to clusters only if all files belonging to the cluster have a label of the same class.
  • 3. The method of claim 1, wherein if all files belonging to a cluster do not have a label of the same class, then the file cluster is not used for classifying the unknown file.
  • 4. The method of claim 1, wherein the step of filtering attributes comprises using a frequency filter.
  • 5. The method of claim 4, wherein the frequency filter comprises a frequency in safe files, in malicious files, in the entire sample, and if present in a certain number of objects of both classes.
  • 6. The method of claim 1, wherein when a cluster is formed without a label, the cluster's members are not classified.
  • 7. The method of claim 1, wherein the unknown file is classified as malicious if the class of the unknown unknown file does not indicate the use of the packer.
  • 8. A system for malware detection for an unknown file in a computing environment with at least one microprocessor, an unknown file, a malware file collection, and a safe file collection, the system comprising: a static analyzer and a first file attributes filter, under program control by the at least one microprocessor, the static analyzer configured to receive as input the unknown file, the malware collection, or the safe file collection;a dynamic analyzer and a second file attributes filter and an n-gram builder under program control by the at least one microprocessor, the dynamic analyzer configured to receive as input the unknown file, the malware collection, or the safe collection;wherein the at least one microprocessor is further configured for program control of a file attributes weight analysis unit comprising an attributes weights assessment unit:a machine-learning clustering unit comprising a clustering model based on a Jaccard measure, in communication with file attributes analysis unit;wherein the machine learning clustering unit further configured for applying a file similarity assessment based on probabilistic Min Hash LSH algorithm that applies the Jaccard measure;a machine learning classifier configured for receiving the results of the machine learning clustering unit; anda library, in communication with the classifier, comprising a plurality of machine learning or detection rules;wherein the unknown file is a packed file and the classifier identifies the unknown file as packed or not packed; andwherein the dynamic analyzer operates only on files identified as packed files.
  • 9. The system of claim 8, wherein the dynamic analyzer under program control by the at least one microprocessor is configured to extract new file attributes from the packed file.
  • 10. The system of claim 8, wherein the file attributes analysis unit is coupled to an attributes database.
  • 11. The system of claim 10, wherein the clustering model based on the Jaccard measure is coupled to the attributes database.
  • 12. The system of claim 11, wherein the file attributes analysis unit is configured to access the attributes database to update attribute data.
  • 13. The system of claim 8, wherein the machine learning classifier is configured to classify the unknown file as malicious if the class of the unknown unknown file does not indicate the use of the packer.
  • 14. A method for malware detection in a computing environment, implemented by at least one microprocessor, the method comprising: obtaining an unknown file for analysis and searching for similar files from among test files using a probabilistic MinHash LSH algorithm that applies a Jaccard measure;entering the unknown file into an existing cluster or forming a new cluster using at least one clustering model derived from the test files; andclassifying the unknown file based on the results of the clustering and determining whether the classification indicates the use of a packer;wherein dynamic analysis is performed on the unknown file only if it is classified as packed;wherein the test files have been analyzed with a static analyzer without unpacking them to generate a non-vectorized set of strings and opcodes and attributes of the test files have been filtered based on attribute statistics of the test files; andwherein the test files have been clustered using a probabilistic algorithm based on similarities calculated with the Jaccard measure.
  • 15. The method of claim 14, wherein labels are assigned to a cluster only if all files belonging to the cluster have a label of the same class.
  • 16. The method of claim 14, wherein if all files belonging to the cluster do not have a label of the same class, then the cluster is not used for classifying the unknown file.
  • 17. The method of claim 14, wherein the step of filtering attributes comprises using a frequency filter.
  • 18. The method of claim 17, wherein the frequency filter comprises a frequency in safe files, in malicious files, in the entire sample, and if present in a certain number of objects of both classes.
  • 19. The method of claim 14, wherein when a cluster is formed without a label, the cluster's members are not classified.
  • 20. The method of claim 14, wherein the unknown file is classified as malicious if the class of the unknown unknown file does not indicate the use of the packer.
Continuation in Parts (1)
Number Date Country
Parent 17449608 Sep 2021 US
Child 18744788 US