The present disclosure relates to a learning apparatus, a determination system, a learning method, and a non-transitory computer readable medium.
In recent years, machine learning, as represented by deep learning, has been actively studied and applied to various fields. For example, machine learning is being used to detect malware that continues to grow on the Internet every year.
As related art, for example, Patent Literature 1 and 2 are known. Patent Literature 1 discloses a technique for learning a communication feature amount of malware in order to detect malware. In addition, Patent Literature 2 discloses a technique for creating a normal model by unsupervised machine learning in order to detect an abnormality of a facility.
As disclosed in Patent Literature 1, a related technique uses machine learning to detect malware and learn a large number of features of the malware. However, in the related technique, there is a problem that it is sometimes difficult to create a learning model capable of accurately determining whether a file is malware.
In view of such a problem, an object of the present disclosure is to provide a learning apparatus, a determination system, a learning method, and a non-transitory computer readable medium capable of creating a learning model that can improve an accuracy of determining whether a file is malware.
A learning apparatus according to the present disclosure includes: pseudo learning means for creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware; and determination learning means for creating a determination learning model for determining whether a file is malware based on the created pseudo learning model and feature data indicating a feature of the malware.
A determination system according to the present disclosure includes: pseudo learning means for creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware; determination learning means for creating a determination learning model for determining whether a file is malware based on the created pseudo learning model and feature data indicating a feature of the malware; and determination means for determining whether or not an input file is the malware based on the created determination learning model.
A learning method according to the present disclosure includes: creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware; and creating a determination learning model for determining whether a file is malware based on the created pseudo learning model and feature data indicating a feature of the malware.
A non-transitory computer readable medium storing a learning program according to the present disclosure causes a computer to execute: creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware; and creating a determination learning model for determining whether a file is malware based on the created pseudo learning model and feature data indicating a feature of the malware.
According to the present disclosure, it is possible to provide a learning apparatus, a determination system, a learning method, and a non-transitory computer readable medium capable of creating a learning model that can improve an accuracy of determining whether a file is malware.
Example embodiments will be described below with reference to the drawings. The following descriptions and drawings have been omitted and simplified as appropriate for clarification of the description. In each of the drawings, the same elements are denoted by the same reference signs, and repeated descriptions are omitted as necessary.
As a related technique, a method for determining whether a file is malware using a learning model (a mathematical model) using deep learning will be investigated. In the method using the learning model, a large amount of feature data (numerical data) indicating features of malware and normal files are prepared, and a learning model is created using them. By learning a large amount of feature data of malware and normal files as supervised data, “features” common to the malware can be found and unknown malware can be determined. Note that malware is software or data that performs unauthorized (malicious) operations on a computer or a network, such as computer viruses or worms. A normal file (goodware) is a file other than malware, and is software or data that normally operates on a computer or a network without performing an unauthorized (malicious) operation.
The “feature data” indicating the feature of the malware is data obtained by digitizing the number of occurrences of a string pattern appearing in common with many kinds of malware, whether or not the malware matches a certain rule (e.g., “a certain file on computer is being operated”), etc. It is necessary to manually prepare in advance a list of string patterns and select rules to be used which are necessary for the creation of the feature data.
The inventor has found a problem that it is not possible to accurately determine whether a file is malware if a learning model obtained by such a related learning method is used. That is, when an unknown sample is evaluated using a learning model obtained by the related learning method, it is almost always determined to be “malware”. This is due to the lack of normal file samples compared to malware samples, and the inability to effectively learn the features of the normal files. For example, compared to about 2.5 million malware samples, only about 500,000 of the normal file samples, which is about ⅕ of the number of malware samples, can be prepared. A certain number of samples of the malware can be collected from existing databases of malware and information provided on the Internet. However, it is difficult to collect a large number of normal files, because there are hardly any such existing databases or information provided on the Internet regarding the normal files that are operating normally.
The above problem is also caused by algorithmic features of deep learning. Specifically, when there is a difference between the number of samples of malware and that of normal files, it is more likely that a file will be determined to be whichever one has a greater number of samples. Therefore, the learning model tends to determine a file to be “malware” having a greater number of samples. For example, when learning is performed using the feature data of malware only, a learning model that always determines a file to be “malware” is obtained. Therefore, in the related learning method, feature data of a normal file is essential in order to accurately determine whether a file is malware or a normal file.
Furthermore, the above problem is caused by the difficulty in acquiring the features of the “normal files”. That is, malware has common features such as “access to a specific file” and “call a specific Application Programming Interface (API)”. However, the normal files do not have such rules and do not have common features. It is therefore difficult to determine a normal file with the learning model created using the related learning method.
Thus, if a learning model created by the related learning method is used, it is not possible to accurately determine whether a file is malware. In order to address this issue, in the following example embodiments, even when the number of samples of normal files is small and it is difficult to acquire the features of the normal files, it is possible to accurately determine whether a file is malware.
The pseudo learning unit 11 creates a pseudo learning model (a first learning model) based on pseudo feature data indicating a pseudo feature of a normal file (goodware). For example, the pseudo feature data is data that covers possible values of feature data within a possible range. The determination learning unit 12 creates a determination learning model (a second learning model) for determining whether a file is malware based on the pseudo learning model created by the pseudo learning unit 11 and the feature data indicating a feature of the malware.
As shown in
Thus, in the example embodiments, the learning model is created in two stages: one stage in which a pseudo learning model is created based on the pseudo feature data of the normal file; and another stage in which the determination learning model is created based on the feature data of the malware. Thus, it is not necessary to learn the features of the normal files which are difficult to acquire, and a learning model capable of improving the accuracy of determining whether a file is malware can be created.
A first example embodiment will be described below with reference to the drawings.
As shown in
The malware memory apparatus 300 is a database apparatus for storing a large amount of malware as samples for learning. The malware memory apparatus 300 may store previously collected malware or may store information provided on the Internet. The determination learning model memory apparatus 400 stores determination learning models (or simply called learning models) for determining whether a file is malware. The determination learning model memory apparatus 400 stores the determination learning models created by the learning apparatus 100, and the determination apparatus 200 refers to the stored determination learning models for determining whether a file is malware.
The learning apparatus 100 is an apparatus for creating the determination learning model trained with the feature of malware as a sample. The learning apparatus 100 includes a control unit 110 and a memory unit 120. The learning apparatus 100 may also include an input unit, an output unit, etc. as a communication unit to communicate with the determination apparatus 200, the Internet, or the like, or as an interface with a user, an operator, or the like, if necessary.
The memory unit 120 stores information necessary for the operation of the learning apparatus 100. The memory unit 120 is a non-volatile memory unit (storage unit), and is, for example, a non-volatile memory such as a flash memory or a hard disk. The memory unit 120 includes a feature setting memory unit 121 for storing feature setting information necessary for creating feature data and pseudo feature data, a pseudo feature data memory unit 122 for storing the pseudo feature data, a pseudo learning model memory unit 123 for storing pseudo learning models, and a feature data memory unit 124 for storing the feature data. The memory unit 120 further stores a program or the like necessary for creating the learning model by machine learning.
The control unit 110 is for controlling the operations of each unit of the learning apparatus 100, and is a program execution unit such as a CPU (Central Processing Unit). The control unit 110 reads the program stored in the memory unit 120 and executes the read program to implement each function (processing). As this function, the control unit 110 includes, for example, a pseudo feature creation unit 111, a pseudo learning unit 112, a learning preparation unit 113, a feature creation unit 114, and a determination learning unit 115.
The pseudo feature creation unit 111 creates pseudo feature data indicating the pseudo feature of a normal file. The pseudo feature creation unit 111 creates the pseudo feature data of the normal files by referring to the feature setting information in the feature setting memory unit 121, and stores the created pseudo feature data in the pseudo feature data memory unit 122. The pseudo feature creation unit 111 creates the pseudo feature data so as to cover possible values of the feature data based on the feature setting information such as a feature creation rule. Note that the pseudo feature creation unit 111 may acquire the created pseudo feature data.
The pseudo learning unit 112 performs pseudo learning as initial learning performed in advance of the learning of the malware. The pseudo learning unit 112 creates the pseudo learning model based on the pseudo feature data of the normal files stored in the pseudo feature data memory unit 122, and stores the created pseudo learning model in the pseudo learning model memory unit 123. The pseudo learning unit 112 creates the pseudo learning model by training a machine learner using a Neural Network (NN) with the pseudo feature data of the normal files as pseudo supervised data.
The learning preparation unit 113 performs preparation necessary for learning the determination learning model. The learning preparation unit 113 refers to the malware memory apparatus 300 to prepare samples of malware and selects the samples of the malware for learning. The learning preparation unit 113 may prepare and select the sample based on a predetermined standard, or may prepare and select the samples according to an input operation of the user or the like.
The feature creation unit 114 creates feature data indicating the features of the malware. The feature creation unit 114 refers to the feature setting information of the feature setting memory unit 121, creates the feature data of the selected malware, and stores the created feature data in the feature data memory unit 124. The feature creation unit 114 extracts the feature data of the selected malware based on the feature setting information such as the feature creation rule.
The determination learning unit 115 learns the feature data of the malware as final learning after the initial learning. The determination learning unit 115 creates the determination learning model based on the pseudo learning model stored in the pseudo learning model memory unit 123 and the feature data of the malware stored in the feature data memory unit 124, and stores the created determination learning model in the determination learning model memory apparatus 400. The determination learning unit 115 creates the determination learning model by training a machine learner by a neural network to add the feature data of the malware as the supervised data to the pseudo learning model.
The determination apparatus 200 determines whether or not a file provided by the user is malware. The determination apparatus 200 includes an input unit 210, a determination unit 220, and an output unit 230. The determination apparatus 200 may also include a communication unit to communicate with the learning apparatus 100, the Internet, or the like, if necessary.
The input unit 210 acquires a file input from the user. The input unit 210 receives the uploaded file via a network such as the Internet.
The determination unit 220 determines whether or not the input file is malware or a normal file based on the determination learning model created by the learning apparatus 100. The determination unit 220 refers to the determination learning model stored in the determination learning model memory apparatus 400 and determines whether features of the input file are close to the features of the malware or the features of the normal files.
The output unit 230 outputs a result of determining whether the input file is malware obtained by the determination unit 220 to the user. The output unit 230 outputs the result of determining whether the file is malware via a network such as the Internet, in a manner similar to the input unit 210.
The pseudo feature data is data within a predetermined range (scale) of data in which the feature data can fall in the feature data element. For example, a minimum value and a maximum value indicating the range of the feature data elements are defined by the feature setting information in the feature setting memory unit 121.
The pseudo feature data is data plotted at predetermined intervals as possible values of the feature data in the feature data element.
As shown in
Next, as shown in
Next, the learning apparatus 100 creates feature data of malware (S205). That is, the feature creation unit 114 extracts the feature amount of the malware to be learned as a sample and creates the feature data of the malware. Next, the learning apparatus 100 creates the determination learning model (S206). That is, the determination learning unit 115 additionally trains the pseudo learning model with the feature data of the malware to create the determination learning model.
As shown in
As shown in
Next, the determination apparatus 200 refers to the determination learning model (S302) and determines the file based on the determination learning model (S303). The determination unit 220 refers to the determination learning model created as shown in
Next, the determination apparatus 200 outputs the result of determining whether a file is malware or a normal file (S304). For example, the output unit 230 displays the result of determining whether a file is malware or a normal file to the user via the web interface, as in S301. For example, “File is malware” or “File is a normal file” is displayed. In addition, a possibility (probability) that the file may be determined to be malware or a normal file from the distance between the feature amount of the file and the feature data of the determination learning model may be displayed.
As described above, in this example embodiment, the learning is performed in two stages: one stage of “creation of a pseudo learning model by learning pseudo feature data”; and a stage of “creation of a determination learning model by feature data of actual malware”. In particular, a determination learning model is created without using a sample or feature data of a normal file. A pseudo learning model can be created by using data covering a range of values (integer values) that feature data can fall in as “pseudo feature data of a normal file” and creating a pseudo learning model only with the pseudo feature data, thereby making it possible to create a pseudo learning model that determines all the files to be “normal files”. Further, the pseudo learning model additionally trained with the feature data of the malware is created as the “determination learning model”, and the feature of the malware is learned by overwriting the pseudo learning model to create the determination learning model. In this manner, the malware can be accurately determined using the determination learning model.
Next, a second example embodiment will be described. In this example embodiment, another configuration example of the learning apparatus according to the first example embodiment will be described. That is, as shown in
For example, the learning apparatus 100a includes the pseudo feature creation unit 111 and the pseudo learning unit 112 in a control unit 110a, and includes a feature setting memory unit 121a and a pseudo feature data memory unit 122 in a memory unit 120a. The learning apparatus 100a creates a pseudo learning model, and stores the created pseudo learning model in a pseudo learning model memory apparatus 410 in a manner similar to that in the first example embodiment.
The learning apparatus 100b includes the learning preparation unit 113, the feature creation unit 114, and the determination learning unit 115 in the control unit 110b, and includes a feature setting memory unit 121b and a feature data memory unit 124 in a memory unit 120b. The learning apparatus 100b creates a determination learning model using a pseudo learning model or the like of the pseudo learning model memory apparatus 410 in a manner similar to that in the first example embodiment.
With such a configuration, a pseudo learning model can be created in advance, and then a determination learning model can be created using the pseudo learning model at the timing of learning malware. The pseudo learning model can be reused as a common model to create the determination learning model.
Note that the present disclosure is not limited to the example embodiments described above, and may be changed as necessary without departing from the scope thereof. For example, the system may be used not only to determine a file provided by a user but also to determine an automatically collected file. Furthermore, the system may be used not only for determining whether a file is malware or a normal file but also for determining whether a file is other abnormal files or normal files.
Each configuration in the above example embodiments may composed of hardware or software, or both of them, or may be composed of one piece of hardware or software, or may be composed of a plurality of pieces of hardware or software. The function (processing) of each apparatus may be implemented by a computer including a CPU, a memory or the like. For example, a program for performing the method (the learning method or determination method) in the example embodiments may be stored in the memory apparatus, and each function may be implemented by executing the program stored in the memory apparatus by the CPU.
These programs can be stored and provided to a computer using any type of non-transitory computer readable media. Non-transitory computer readable media include any type of tangible storage media. Examples of non-transitory computer readable media include magnetic storage media (such as floppy disks, magnetic tapes, hard disk drives, etc.), optical magnetic storage media (e.g. magneto-optical disks), CD-ROM (compact disc read only memory), CD-R (compact disc recordable), CD-R/W (compact disc rewritable), and semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc.). The program may be provided to a computer using any type of transitory computer readable media. Examples of transitory computer readable media include electric signals, optical signals, and electromagnetic waves. Transitory computer readable media can provide the program to a computer via a wired communication line (e.g. electric wires, and optical fibers) or a wireless communication line.
Although the present disclosure has been described with reference to the above example embodiments, the present disclosure is not limited to the above example embodiments. Various changes can be made to the configurations and details of this disclosure that can be understood by those skilled in the art within the scope of this disclosure.
The whole or part of the exemplary embodiment disclosed above can be described as, but not limited to, the following supplementary notes.
A learning apparatus comprising:
pseudo learning means for creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware; and
determination learning means for creating a determination learning model for determining whether a file is malware based on the created pseudo learning model and feature data indicating a feature of the malware.
The learning apparatus according to Supplementary note 1, wherein
the pseudo feature data is data of a feature data element that the feature data can have.
The learning apparatus according to Supplementary note 2, wherein
the pseudo feature data is data within a range of data that the feature data can fall in the feature data element.
The learning apparatus according to Supplementary note 2 or 3, wherein
the pseudo feature data is data plotted at predetermined intervals in the feature data element.
The learning apparatus according to any one of Supplementary notes 2 to 4, wherein
the feature data element includes the number of occurrences of a predetermined string pattern.
The learning apparatus according to any one of Supplementary notes 2 to 5, wherein
the feature data element includes the number of accesses to a predetermined file.
The learning apparatus according to any one of Supplementary notes 2 to 6, wherein
the feature data element includes the number of calls of a predetermined application interface.
The learning apparatus according to any one of Supplementary notes 1 to 7, wherein
the determination learning means creates the determination learning model by adding the feature data to the pseudo learning model.
The learning apparatus according to Supplementary note 8, wherein
the determination learning means creates the determination learning model by overwriting the pseudo feature data with the feature data in the pseudo learning model.
A determination system comprising:
pseudo learning means for creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware;
determination learning means for creating a determination learning model for determining whether an input file is malware based on the created pseudo learning model and feature data indicating a feature of the malware; and
determination means for determining whether or not the input file is the malware based on the created determination learning model.
The determination system according to Supplementary note 10, wherein
the determination means makes the determination based on the feature of the file and the feature data in the determination learning model.
A learning method comprising:
creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware; and
creating a determination learning model for determining whether a file malware based on the created pseudo learning model and feature data indicating a feature of the malware.
The learning method according to Supplementary note 12, wherein
the pseudo feature data is data of a feature data element that the feature data can have.
A learning program for causing a computer to execute: creating a pseudo learning model based on pseudo feature data indicating a pseudo feature of goodware; and
creating a determination learning model for determining whether a file is malware based on the created pseudo learning model and feature data indicating a feature of the malware.
The learning program according to Supplementary note 14, wherein
the pseudo feature data is data of a feature data element that the feature data can have.
This application is based upon and claims the benefit of priority from Japanese patent application No. 2019-175847, filed on Sep. 26, 2019, the disclosure of which is incorporated herein in its entirety by reference.
Number | Date | Country | Kind |
---|---|---|---|
2019-175847 | Sep 2019 | JP | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/031781 | 8/24/2020 | WO |