This application claims priority to Republic of Korea Patent Application No. 10-2018-0106470, filed on 6 September 2018 in the Korean Intellectual Property Office, the entire contents of which is hereby incorporated by reference in its entirety.
The present invention relates to verification of a malicious code machine learning classification model, and particularly, to an apparatus and a method for verifying a malicious code machine learning classification model, which may ensure verification and reliability of a machine learning classification model by deriving predictive information for a file suspected of maliciousness by various machine learning models such as CNN and DNN and determining the similarity for the malicious suspicious file by performing multi-layer cyclic verification that performs single or multiple similarity discrimination based on results after static and dynamic analysis of the malicious suspicious file for verification of the predictive information derived at this time.
The quantity of new or variant malicious codes is increasing day by day, and there is a limit in many ranges including manpower, a temporal part, etc., in analyzing the increased quantity manually. Therefore, there are various modeling and analysis methods using machine learning. However, there is a problem of securing the reliability of predictive information discriminated by the machine learning.
Accordingly, a variety of studies are needed to verify the reliability of a machine learning model for classifying malicious codes and secure reliability for a prediction result.
At The present invention has been made in an effort to provide an apparatus for verifying a malicious code machine learning classification model for verifying a machine learning model that classifies malicious codes through inter-file multi-layer cyclic verification and ensuring reliability for a prediction result of the machine learning model.
The present invention has also been made in an effort to provide a method for verifying a malicious code machine learning classification model for verifying a machine learning model that classifies malicious codes through inter-file multi-layer cyclic verification and ensuring reliability for a prediction result of the machine learning model.
An exemplary embodiment of the present invention provides an apparatus for verifying a malicious code machine learning classification model, which includes: a main feature processing subsystem performing feature extracting and processing functions in an input file; and a multi-layer cyclic verification subsystem performing multi-layer verification in order to determine whether the file is normal or malicious based on the extracted and processed features.
In the apparatus for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, the main feature processing subsystem may include a feature extraction module extracting features related to static analysis information which may be obtained without execution of the file and features related to dynamic analysis information which may be obtained through execution of the file, and a main feature processing module selecting and categorizing main features which may be used at the time of performing a malicious action among the extracted features related to the static analysis information and features related to the dynamic analysis information.
In the apparatus for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, the multi-layer cyclic verification subsystem may include a main feature relative comparison module comparing the selected main features with the main features of the normal files and the main features of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate, an operation sequence based comparison modeling module comparing the features related to the operation sequence among the selected main features with the features related to the operation sequence of the normal files and the features related to the operation sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate, a function sequence based comparison modeling module comparing the features related to the function sequence among the selected main features with the features related to the function sequence of the normal files and the features related to the function sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate, and a determination unit determining whether the malicious suspicious file is normal or malicious by computing the final normal similarity rate and the final malicious similarity rate based on the normal similarity rate and the malicious similarity rate calculated by the main feature relative comparison module, the normal similarity rate and the malicious similarity rate calculated by the operation sequence based comparison modeling module, and the normal similarity rate and the malicious similarity rate calculated by the function sequence based comparison modeling module and by comparing the final normal similarity rate and the final malicious similarity rate.
In the apparatus for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, the main feature relative comparison module may perform an operation of acquiring the number of categories whose contents match each other by comparing contents of the main features classified for each selected category with the contents of the main features of the normal files and the contents of the main features of the malicious files, respectively, an operation of generating feature vectors by setting the categories whose contents match each other to 1 and setting the categories whose contents do not match each other to 0 based on the comparison result, an operation of computing a similarity rate for each feature by comparing the features of the categories whose contents match each other with the main features of the normal files and the main features of the malicious files, respectively in unit of block based on the number of categories whose contents match each other, and an operation of calculating the normal similarity rate for the normal file and the malicious similarity rate for the malicious file based on the feature vectors and the similarity rate for each feature.
In the apparatus for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, the operation sequence based comparison modeling module may perform an operation of converting the features related to the operation sequence among the selected main features into N-gram, an operation of generating an action vector through feature hashing for the features related to the operation sequence converted into the N-gram, and an operation of comparing the generated action vectors with action vectors related to the operation sequence of the normal files and action vectors related to the operation sequence of the malicious files in unit of block and calculating the normal similarity rate and the malicious similarity rate.
In the apparatus for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, the function sequence based comparison modeling module may perform an operation of preprocessing the features related to the function sequence among the selected main features, an operation of converting the preprocessed features related to the function sequence into N-gram, and an operation of comparing the features related to the function sequence converted into the N-gram with the features related to the function sequence of the normal files converted into the N-gram and the features related to the function sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate.
The apparatus for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention may further include, a machine learning model verification unit verifying the reliability of the machine learning modeling module by comparing a result of predicting whether the file is normal or malicious, which is predicted through the machine learning modeling module with a result of determining whether the file is normal or malicious, which is output from the multi-layer cyclic verification subsystem.
Another exemplary embodiment of the present invention provides a method for verifying a malicious code machine learning classification model, which includes: (a) performing feature extracting and processing functions in an input file; and (b) performing multi-layer verification in order to determine whether the file is normal or malicious based on the extracted and processed features.
In the method for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, step (a) may include (a-1) extracting features related to static analysis information which may be obtained without execution of the file and features related to dynamic analysis information which may be obtained through execution of the file, and (a-2) selecting and categorizing main features which may be used at the time of performing a malicious action among the extracted features related to the static analysis information and features related to the dynamic analysis information.
In the method for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, step (b) may include (b-1) comparing the selected main features with the main features of the normal files and the main features of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate, (b-2) comparing the features related to the operation sequence among the selected main features with the features related to the operation sequence of the normal files and the features related to the operation sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate, (b-3) comparing the features related to the function sequence among the selected main features with the features related to the function sequence of the normal files and the features related to the function sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate, and (b-4) computing the final normal similarity rate and the final malicious similarity rate based on the normal similarity rates and the malicious similarity rates calculated in steps (b-1) to (b-3) and determining whether the malicious suspicious file is normal or malicious by comparing the final normal similarity rate and the final malicious similarity rate.
In the method for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, step (b-1) may include acquiring the number of categories whose contents match each other by comparing contents of the main features classified for each selected category with the contents of the main features of the normal files and the contents of the main features of the malicious files, respectively, generating feature vectors by setting the categories whose contents match each other to 1 and setting the categories whose contents do not match each other to 0 based on the comparison result, computing a similarity rate for each feature by comparing the features of the categories whose contents match each other with the main features of the normal files and the main features of the malicious files, respectively in unit of block based on the number of categories whose contents match each other, and calculating the normal similarity rate for the normal file and the malicious similarity rate for the malicious file based on the feature vectors and the similarity rate for each feature.
In the method for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, step (b-2) may include converting the features related to the operation sequence among the selected main features into N-gram, generating an action vector through feature hashing for the features related to the operation sequence converted into the N-gram, and comparing the generated action vectors with action vectors related to the operation sequence of the normal files and action vectors related to the operation sequence of the malicious files in unit of block and calculating the normal similarity rate and the malicious similarity rate.
In the method for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention, step (b-3) may include preprocessing the features related to the function sequence among the selected main features, converting the preprocessed features related to the function sequence into N-gram, and comparing the features related to the function sequence converted into the N-gram with the features related to the function sequence of the normal files converted into the N-gram and the features related to the function sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate.
The method for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention may further include, after step (b), verifying the reliability of the machine learning modeling module by comparing a result of predicting whether the file is normal or malicious, which is predicted through the machine learning modeling module with the result determined in step (b).
According to an exemplary embodiment of the present invention, an apparatus and a method for verifying a malicious code machine learning classification model can verify a machine learning model that classifies malicious codes, thereby ensuring reliability for a prediction result of the machine learning model.
The objects, specific advantages, and new features of the present invention will be more clearly understood from the following detailed description and the exemplary embodiments taken in conjunction with the accompanying drawings.
Terms or words used in the present specification and claims should not be interpreted as being limited to typical or dictionary meanings, but should be interpreted as having meanings and concepts which comply with the technical spirit of the present disclosure, based on the principle that an inventor can appropriately define the concept of the term to describe his/her own invention in the best manner.
In the present specification, when reference numerals refer to components of each drawing, it is to be noted that although the same components are illustrated in different drawings, the same components are denoted by the same reference numerals as possible.
The terms “first”, “second”, “one surface”, “other surface”, etc. are used to distinguish one component from another component and the components are not limited by the terms.
Hereinafter, in describing the present invention, a detailed description of related known art which may make the gist of the present invention unnecessarily ambiguous will be omitted.
Hereinafter, an exemplary embodiment of the present invention will be described in detail with reference to the accompanying drawings.
An apparatus 100 for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention illustrated in
The machine learning modeling module 108 predicts predictive information for the file suspicious of maliciousness, that is, whether the file suspicious of maliciousness is a normal file or a malicious file based on various machine learning models including a Convolutional Neural Network (CNN), a Deep Neural Network (DNN), and the like.
Referring to
Referring to
The multi-layer cyclic verification subsystem 104 includes a main feature relative comparison module 204 performing multiple analysis using main meta information, an operation sequence based comparison modeling module 206 performing comparison based on features related to an operation sequence of files, a function sequence based comparison modeling module 208 performing comparison based on features related to a function sequence of the files, and a determination unit 210 determining whether the malicious suspicious file is normal or malicious by computing a final normal similarity rate and a final malicious similarity rate based on a normal similarity rate and a malicious similarity rate calculated by the main feature relative comparison module 204, the normal similarity rate and the malicious similarity rate calculated by the operation sequence based comparison modeling module 206, and the normal similarity rate and the malicious similarity rate calculated by the function sequence based comparison modeling module 208 and comparing the final normal similarity rate and the final malicious similarity rate.
Referring to
1) The machine learning modeling module 108 outputs the prediction result by predicting whether the malicious suspicious file is a normal file or a malicious file through various machine learning algorithms such as DNN/CNN.
2) The main feature processing subsystem 102 extracts static and dynamic features from the malicious suspicious file and selects main features among the extracted static and dynamic features in order to verify the prediction result of the machine learning modeling module 108.
3) The multi-layer cyclic verification subsystem 104 performs multi-layer cyclic verification using the selected main features. The multi-layer cyclic verification subsystem 104 outputs a determination result and a similarity rate indicating whether the malicious suspicious file is the normal file or the malicious file.
4) The machine learning model verification unit 106 verifies reliability for the prediction result of the machine learning modeling module 108 by checking a similarity between a value obtained through the multi-layer verification by the multi-layer cyclic verification subsystem 104 and the determination result output by the machine learning modeling module 108.
Referring to the accompanying drawings, the operation of the malicious code machine learning classification model verification apparatus 100 according to an exemplary embodiment of the present invention will be described below in detail.
First, the machine learning modeling module 108 performs modeling through algorithms such as CNN and DNN and predicts and outputs normal or abnormal (malicious) results to malicious suspicious files requested for analysis.
As illustrated in
As illustrated in
Detailed items of the main features are shown in Table 1.
Referring to
In detail, the multi-layer cyclic verification subsystem 104 performs a total of three similarity comparison operations of main feature relative comparison by the main feature relative comparison module 204, operation sequence based comparison by the operation sequence based comparison modeling module 206, and function sequence based comparison by the function sequence based comparison modeling module 208 and the determination unit 210 computes the final normal similarity rate and the final malicious similarity rate by applying specific weights to performed results, respectively. For example, the determination unit 210 acquires the final normal similarity rate and the final malicious similarity rate by applying a weight 20% to the result of the main feature relative comparison, a weight of 40% to the result of the operation sequence based comparison, and a weight of 40% to the result of the function sequence based comparison.
According to the present invention, since it is determined whether the corresponding file is normal or malicious by assigning a higher weight to action based comparison such as the operation sequence and the function sequence than relative comparison of the features, a reliable result may be derived. In addition, the determination unit 210 compares the final normal similarity rate with the final malicious similarity rate and determines the malicious suspicious file as the normal file or the malicious file based on the large similarity rate.
The operation of the multi-layer cyclic verification subsystem 104 will be described below in detail.
Referring to
Next, the main feature relative comparison module 204 sets the category whose contents exactly match to 1 based on the comparison result in operation S500 and sets the category whose contents do not exactly match to 0 to generate a feature vector according to the category (operation S502). For example, if feature 2, feature 6, and feature 8 exactly match as the result of comparing the selected main features (target file features in
Next, the main feature relative comparison module 204 performs classification according to the similarity for each category (operation S504), and compares features of categories whose contents match with the main features of the normal files and the main features of the malicious files, respectively in units of block through Fuzzy hash comparison according to the number of categories whose contents match to compute the similarity rate for each feature (operation S506). For example, when the number of categories whose contents match is 6, in order to enhance accuracy, the features of the categories whose contents match are compared with the main features of the normal file and the malicious files in which the number of categories whose contents match is 6, respectively in unit of block to compute the similarity rate for each feature.
Next, the main feature relative comparison module 204 calculates the similarity rate for the normal file based on the feature vectors and the similarity rate for each feature (operation S508) and calculates the similarity rate for the malicious file (operation S510).
In
In order to calculate the normal similarity rate and the malicious similarity rate as illustrated in
Meanwhile, the feature based similarity score 604 is computed as follows.
When the features exactly match each other, one point is assigned and when the features do not exactly match each other, the score is not assigned. Further, when the features that are mainly regarded in normality or maliciousness match each other at the time of computing the score, additional addition of (×2) is assigned.
Even though the features do not exactly match each other, the additional addition is assigned to the important feature in discriminating whether the file is normal or malicious. Accordingly, for important features in discriminating whether the file is normal or malicious even when the features do not match each other, a similarity rate of fuzzy hash, i.e., the feature based similarity rate (e.g., reference numeral 600) is reflected in the addition.
As illustrated in
A normal similarity rate 608 is computed by (the sum 605 of the feature based similarity score 604/a maximum score value which may be obtained from the normal file)×100.
A malicious similarity rate 610 is computed by (the sum 607 of the feature based similarity score 606/a maximum score value which may be obtained from the malicious file)×100.
The maximum score value which may be obtained from the normal file is (10 (the number of features other than the main feature among the normal file features)×1)+(5(the number of main features among the normal file features)×2)=20.
The maximum score value which may be obtained from the malicious file is (3 (the number of features other than the main feature among the malicious file features)×1)+(12(the number of main features among the malicious file features)×2)=27.
Accordingly, in the case of
Referring to
Next, the operation sequence based comparison modeling module 206 generates a hash table having a size of 4096 bytes through feature hashing for the features related to the operation sequence converted into the N-gram and since a value may be excessively large or small by an operation frequently called at the time of generating the hash table, the operation sequence based comparison modeling module 206 generates an action vector by changing the value to −1, 0, and 1 through normalization (operation S702).
Next, the operation sequence based comparison modeling module 206 compares the generated action vectors with action vectors related to the operation sequence of the normal files and action vectors related to the operation sequence of the malicious files in unit of block and calculates the normal similarity rate and the malicious similarity rate (operation S704).
Referring to
Next, the function sequence based comparison modeling module 209 converts the features related to the pre-processed function sequence into N-grams in order to easily determine the sequence (operation S802) and compares the features related to the function sequence converted into the N-gram with the features related to the function sequence of the normal files converted into the N-gram and the features related to the function sequence of the malicious files, respectively by using a Cosine similarity technique to calculate the normal similarity rate and the malicious similarity rate (operation S804).
Referring to
In an exemplary embodiment of the present invention, assuming that the similarity rate is calculated as illustrated in
Referring back to
For example, the machine learning modeling module 108 predicts that the malicious suspicious file is malicious and when predicted model determination accuracy is 94%, a probability that identification will be unsuccessful is 6% and the malicious code machine learning classification model verification apparatus 100 according an exemplary embodiment of the present invention performs verification therefor.
In an exemplary embodiment of the present invention, the multi-layer cyclic verification subsystem 104 determines that the malicious suspicious file is malicious and computes the malicious similarity rate as 90.1% and the machine learning modeling module 108 predicts that the malicious suspicious file is malicious and since both result values are malicious, and as a result, the malicious suspicious file is finally determined to be malicious.
The machine learning model verification unit 106 outputs a verification result that the prediction result of the machine learning modeling module 108 is reliable when the prediction result of the machine learning modeling module 108 is the same as the result determined by the multi-layer cyclic verification subsystem 104 and outputs a verification result that the prediction result of the machine learning modeling module 108 is not reliable when the prediction result of the machine learning modeling module 108 is not the same as the result determined by the multi-layer cyclic verification subsystem 104.
In an exemplary embodiment of the present invention, since the prediction result of the machine learning modeling module 108 and the result determined in the multi-layer cyclic verification subsystem 104 are the same as each other as being malicious, the machine learning model verification unit 106 outputs the verification result that the prediction result of the machine learning modeling module 108 is reliable.
Meanwhile,
Referring to
The method for verifying a malicious code machine learning classification model according to an exemplary embodiment of the present invention will be described in detail with reference to
In step S900, the feature extraction module 200 extracts features related to the static analysis information that may be obtained without execution of the malicious suspicious file and features related to the dynamic analysis information that may be obtained through execution of the malicious suspicious file.
In step S902, the main feature processing module 202 selects and categorizes main features which may be used at the time of performing the malicious action among the extracted features related to the static analysis information and features related to the dynamic analysis information.
In step S904, the main feature relative comparison module 204 compares the selected main features with the main features of the normal files and the main features of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate.
In step S906, the operation sequence based comparison modeling module 206 compares the features related to the operation sequence among the selected main features with the features related to the operation sequence of the normal files and the features related to the operation sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate.
In step S908, the function sequence based comparison modeling module 208 compares the features related to the function sequence among the selected main features with the features related to the function sequence of the normal files and the features related to the function sequence of the malicious files, respectively to calculate the normal similarity rate and the malicious similarity rate.
In step S910, the determination unit 210 computes the final normal similarity rate and the final malicious similarity rate based on the normal similarity rates and the malicious similarities calculated in steps S904, S906, and S908 and determines whether the malicious suspicious file is normal or malicious by comparing the final normal similarity rate and the final malicious similarity rate.
In step S912, the machine learning modeling module 108 predicts whether the malicious suspicious file is normal or malicious based on the machine learning model.
In step S914, the machine learning model verification unit 106 compares the result predicted by the machine learning modeling module 108 in step S912 with the result determined in step S910 to verify the reliability of the machine learning modeling module 108.
Meanwhile, step S904 includes comparing the contents of the main features classified for each selected category with the contents of the main features of the normal file and the contents of the main features of the malicious files, respectively to obtain the number of categories whose contents match each other (S500 in
Step S906 includes converting the features related to the operation sequence among the selected main features into N-gram (S700 of
Step S908 includes preprocessing the features related to the function sequence among the selected main features (S800 of
While the present invention has been particularly described with reference to detailed exemplary embodiments thereof, it is to specifically describe the present invention and the present invention is not limited thereto and it will be apparent that modification and improvement of the present invention can be made by those skilled in the art within the technical spirit of the present invention.
Simple modification and change of the present invention all belong to the scope of the present invention and a detailed protection scope of the present invention will be clear by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2018-0106470 | Sep 2018 | KR | national |