The present invention relates to the field of resource usage optimization. More particularly, the present invention relates to a system and method for automated policy implementation that optimizes between multi-objective tasks with contradicting constrains, using reinforcement learning.
Many technological fields, such as medical tests and diagnostics, detection of various illnesses or medical conditions, diagnostic test, maintenance facilities for vehicles, ships, drones and planes require testing and analysis, in order to optimize the usage of resources, in order to achieve multi-objective tasks with contradicting constrains. However, the attempt to satisfy all constraints is very difficult, and heuristic solutions result in sub-optimal policies. For example, medical examination and diagnostics may require several tests, some of which may be very expensive (e.g., MRI), without real need. The same applies when there is a need to decide which maintenance measures should be taken by garages in order to keep vehicles maintained properly and prevent failures.
Supervised learning solutions (e.g., classification algorithms) are not ideal for these scenarios because the partial information available halfway through the process is not suitable for the training of the models. Reinforcement learning is far more suitable for such scenarios, but to-date no solution exists for solving multi-objective problems, in which some of the objectives are contradictory (e.g., accuracy vs. resource usage).
Other systems that require such optimization are computerized systems and data networks. For example, malware detection is a lasting problem for organizations, often with significant consequences [2]. Portable Executable (PE) files (the PE format is a file format for executables, object code, DLLs and others used in 32-bit and 64-bit versions of Windows operating systems) are one of the most significant platforms for malware to spread. PEs are common in the Windows operating systems and are used by executables and Dynamic Link Libraries (DLLs), among others. The PE format is essentially a data structure which holds all the necessary information for the Windows loader to execute the wrapped code.
Malware constantly evolve as attackers try to evade detection solutions, the most common of which is the anti-virus. Anti-virus solutions mostly perform static analysis of the software's binary code to detect pre-defined signatures, a trait that renders them ineffective in recognizing new malware even if similar functionality has been recorded. Obfuscation techniques such as polymorphism and metamorphism [33] further exacerbate the problem.
In recent years, the need to deal with the continuously evolving threats led to significant developments in the malware detection field. Instead of searching for pre-defined signatures within the executable file, new methods attempt to analyze the behavior of the portable executable (PE) file. These methods often rely on statistical analysis and Machine Learning (ML) techniques as their decision making mechanism, and can generally belong to one of two families: static analysis and dynamic analysis [18].
Static analysis techniques [29] employ an in-depth look at the file, without performing any execution. Solutions implementing static analysis can be either signature-based or statistics-based. Signature-based detection is the more widely used approach [6] because of its simplicity, relative speed and its effectiveness against known malware. However, signature-based detection has three major drawbacks:
Statistics-based detection mainly involves the extraction of features from the executable, followed by training of a machine learning classifier. The extracted features vary and may include executable file format descriptions [19], code descriptions [23], binary data statistics [17], text strings [5] and information extracted using code emulation or similar methods [33]. This method is considered more effective than its signature-based counterpart in detecting previously unknown malware, mostly due to using machine learning (ML) [3, 5, 7, 10, 23], but tends to be less accurate overall [20]. For this reason, organizations often deploy an ensemble of multiple behavioral and statistic detectors, and then combine their scores to produce final classification. This classification process can be achieved through simple heuristics (e.g., averaging) or by more advanced ML algorithms [12].
However, the ensemble approach has two significant drawbacks. First, using an ensemble requires that organizations run all participating detection tools prior to classifying a file, so as to make scoring consistent and because most ML algorithms (like those often used to reach the final ensemble decision) require a fixed-size feature set. Running all detectors is intensive, time and resources consuming and is often not necessary for clear-cut cases, so computing resources are wasted. Moreover, the introduction or removal of a detector often requires that the entire ML model will be retrained. This limits flexibility and the organization's ability to respond to new threats.
The second drawback of the ensemble approach is the difficulty of implementing the organizational security policy. When using ML-based solutions for malware detection, the only “tool” available for organizations to set their policy is the final confidence score: files above a certain score are blocked, while the rest are allowed to enter. Under this setting, it is difficult to define the cost of a false-negative compared to that of a false-positive, or to quantify the cost of running additional detectors. In addition of being hard to define, such security policies are also hard to refine: minor changes in the confidence score threshold may result in large fluctuations of performance (e.g., significantly raising the number of false-alarms).
Deep Reinforcement Learning (DRL)
Reinforcement Learning (RL) is an area of machine learning that addresses decision making in complex scenarios, possibly when only partial information is available. The ability of RL algorithms to explore the large solution spaces and devise highly efficient policies to address them (especially when coupled with deep learning) was shown to be highly effective in areas such as robotics and control problems [21], genetic algorithms [26], and achieving super-human performance in complex games [25].
RL tasks normally consist of both an agent and an environment. The agent interacts with the environment E in a sequence of actions and rewards. At each time-step t, the agent selects an action at from A={a1, a2, . . . , ak} that that both modifies the state of the environment and also incurs a reward rt, which is either positive or negative (the term “cost” is used to describe negative rewards). The goal of the agent is to interact with the environment in a way that maximizes future rewards Rt=ΣtT rt in time-span {t . . . T}, where T is the index of the final action (i.e., classification decision). A frequent approach for selecting the action to be taken at each state is the action-value function Q(s,a) [27]. The function approximates the expected returns one should take action a as a state s. While the methods are varied, RL algorithms which use Q-functions aim to discover (or closely approximate) the optimal action-value function Q* which is defined as where π is Q*(s,a)=maxπ|Rt|st=s, at=a, π| the policy mapping states to actions [27].
Since estimating Q for every possible state-action combination is highly impractical [14], it is common to use an approximator Q(s,a;θ)≈Q*(s,a) where θ represents the parameters of the approximator. Deep Reinforcement learning (DRL) algorithm performs this approximation using neural nets, with θ being the parameters of the network.
While RL algorithms strive to maximize the reward based on their current knowledge (i.e., exploitation), it is important to also encourage the exploration of other additional states. Many methods for maintaining this exploration/exploitation balance have been proposed, including importance sampling [22], &greedy sampling [30] and Monte-Carlo Tree search [24]. The method of the present invention uses &greedy sampling.
Actor-Critic Algorithms for Reinforcement Learning
Two common problems in the application of DRL algorithms: (1) the long time they need to converge due to high variance (i.e., fluctuations) in gradient values, and (20 the need to deal with action sequences with a cumulative reward of zero (zero reward equals zero gradients, hence no parameter updates). These challenges can be addressed by using actor-critic methods, consisting of a critic neural net that estimates the Q-function and an actor neural net that updates the policy according to the critic neural net.
Using two separate networks has been shown to reduce variance and accelerate model convergence during training. In an experiments performed, the Actor-Critic with Experience Replay (ACER) algorithm [32] was used. Experience replay [13] is a method for re-introducing the model to previously seen samples in order to prevent catastrophic forgetting (i.e., forgetting previously learned scenarios while tacking new scenarios).
The evolving threat of malware creates an incentive for organizations to diversify their detection capabilities. As a result, organizations often install multiple solutions [11] and run them all for every incoming file. This approach is both costly—in computing resources, processing time, and even the cost of electricity—and often unnecessary since most files can be easily classified.
A logical solution to this problem can be using a small number of detectors for clear-cut cases and a larger ensemble for difficult-to-analyze files. However, this solution is hard to implement for two reasons. The first challenge is assigning the right set of detectors for each file. Ideally, one would like this set to be sufficiently large to be accurate but also as small as possible so it is computationally-efficient. Striking this balance is a complex task, especially when a large number of detectors are available. The second challenge is the fact that different organizations have different preferences when facing the need to balance between detection accuracy, error-tolerance, and the cost of computing resources. Using these preferences to guide detector selection is difficult.
The conventional existing ensemble solutions require running all detectors prior to classification. This requirement is a result of the supervised learning algorithm (e.g., SVM, Random Forest) often used for this purpose. As a result, conventional solutions are unable to address the first challenge and are extremely constrained in addressing the second. challenge
Even without considering the issue of computational cost (which is moot due to the use of all detectors for each file), obtaining the right balance between different types of classification errors (false positive (FP) and false negative (FN)) remains a challenge. Usually, the only “tool” available for managing this trade-off is the confidence threshold, a value in the range of [0, 1], designating the level of certainty by classifier of the file being malicious. However, small changes in this value can cause large fluctuations in detection performance. Also recent studies [8] suggest that the confidence score is not a sufficiently reliable indicator.
Other methods use many classifiers in order to increase the detection level. However, these methods are time and hardware consuming.
It is therefore an object of the present invention to provide efficient reinforcement learning-based framework for automated policy implementation, while optimizing between multi-objective tasks with contradicting constrains.
It is another object of the present invention to provide efficient reinforcement learning-based framework for defining the contradicting constrains as a problem with efficient solution.
It is a further object of the present invention to provide efficient reinforcement learning-based framework for automatically learning a security policy that best fits organizational requirements.
It is still another object of the present invention to provide a reinforcement learning-based framework for managing a malware detection platform consisting of multiple malware detection tools.
It is yet another object of the present invention to provide a reinforcement learning-based framework for automatically learning a security policy that best fits organizational requirements.
Other objects and advantages of the invention will become apparent as the description proceeds.
An automatic computer implemented method for making classification decisions to provide a desired policy that optimizes multi-objective tasks with contradicting constrains, comprising the steps of:
Various processing modules may be sequentially queried for indications, while after each sequential step, deciding whether or not to perform further analysis by other processing modules.
The reinforcement learning algorithms may be designed to operate, based on partial data, without running all processing modules in advance.
A single processing module may be interactively selected, while during each iteration, the performance of the selected detector is evaluated, to determine whether the benefit of using additional processing modules is likely to be worth the cost of using the additional processing modules.
The selection of processing modules may be dynamic, while using different modules combinations for different scenarios.
The time required to run a processing module may represent the approximated cost of its activation.
The computational cost of using a processing module may be calculated as a function of its level of confidence.
The security policy may be managed using different cost/reward combinations.
The detector combinations may be not chosen in advance but iteratively, with the confidence level of the already-applied detectors used to guide the next step chosen by the policy.
An agent trained in a first environment may have transferability feature to function in a second environment, based on training in the first environment.
An automatic computer implemented method for making classification decisions to provide a desired policy reflecting organizational priorities, that optimizes between multi-objective tasks with contradicting constrains, comprising the steps of:
Various detectors may be sequentially queried for each file, while after each sequential step, deciding whether or not to further analyze the file or to produce final classification.
The reinforcement learning algorithms may be designed to operate, based on partial knowledge, without running all detectors in advance.
A single detector may be interactively selected, while during each iteration, the performance of the selected detector is evaluated, to determine whether the benefit of using additional detectors is likely to be worth the computational cost of the additional detectors.
The selection of detectors may be dynamic, while using different detector combinations for different scenarios.
The states that characterize the environment may consist of all possible score combinations by the participating detectors.
The initial state for each incoming file may be a vector entirely consisting of −1 values and after various detectors are chosen to analyze the files, entries in the vector are populated with the confidence scores they provide.
The rewards reflect the organizational security policy, may be the tolerance for errors in the detection process and the cost of using computing resources.
The time required to run a detector may represent the approximated cost of its activation.
The cost function of the computing time may be defined as
The cost to be considered may be adapted to include one or more of the following additional resources:
memory usage;
CPU runtime;
cloud computing costs;
electricity consumption.
The detectors may be selected from the group of pefile, byte3g, opcode2g, and manalyze.
The computational cost of using a detector may be calculated as a function of correct/incorrect file classification.
The computational costs of the detectors may be defined, based on the average execution time of the files that were used for training.
The reward for correct classification and the cost of incorrect classification may be set to be equal to the cost of the running time.
The security policy may be managed using different cost/reward combinations.
In one aspect, the detector combinations are not chosen in advance but iteratively, with the confidence score of the already-applied detectors used to guide the next step chosen by the policy.
The computational environment may include malware detection in data files.
The computational environment may include medical data files.
The reward for correct classification and the penalty for correct classification may be time dependent.
The reward for correct classification may be fixed and the penalty for correct classification may be time dependent.
An agent trained in a first environment may have transferability feature to function in a second environment, based on training in the first environment.
The environment may include one of the following:
A computerized system for making classification decisions to provide a desired policy that optimizes multi-objective tasks with contradicting constrains, comprising:
The above and other characteristics and advantages of the invention will be better understood through the following illustrative and non-limitative detailed description of preferred embodiments thereof, with reference to the appended drawings, wherein:
The present invention may be implemented on malware detection and proposes a method for automated security policy implementation, using reinforcement learning.
The reinforcement learning-based framework that manages malware detection consisting of multiple malware detection tools. For each file, the proposed method sequentially queries various detectors, and after each step, decides whether or not to further analyze the file or to produce final classification. The decision-making process of the proposed automated security policy implementation is governed by a pre-defined reward function that awards points for correct classifications and applies penalties for misclassification and for heavy consumption of computing resources.
The use of reinforcement learning offers a solution to both problems. Firstly, this type of algorithms enables practitioners to assign clear numeric values to each classification outcome, as well as to quantify the cost of computing resources. These values reflect the priorities of the organization, and can be easily adapted and refined as required.
Secondly, once these values have been set, the reinforcement learning algorithm automatically attempts to define a policy (i.e., strategy) that maximizes them. This policy is likely to reflect organizational priorities much more closely than the use of a confidence threshold. Finally, since reinforcement learning algorithms are designed to operate, based on partial knowledge, there is no need to run all detectors in advance. Instead, the proposed algorithm interactively selects a single detector, evaluates its performance and then determines whether the benefit of using additional detectors is likely to be worth their computational cost. Also, the selection of detectors is dynamic, with different detector combinations used for different scenarios.
The proposed method has two advantages over existing ensemble-based solutions. First, it is highly efficient, since easy-to-classify files are likely to only require the use of less-powerful (i.e. efficient) classifiers. One can therefore maintain near-optimal performance at a fraction of the computing cost. Secondly, organizations can clearly and deliberately define and refine their security policy. This goal is achieved by enabling practitioners to explicitly define the costs of each element of the detection process, i.e., correct or incorrect classification and the associated resource consumption.
The proposed method is able to achieve near-optimal accuracy of 96.21% (compared to an optimum of 96.86%) at approximately 20% of the running time of this baseline.
In addition, it allows conducting an extensive analysis of multiple security policies, designed to simulate the needs and goals of different organizational types. The proposed method has been found to be robust, and analyzes the effect of various policy preferences on detection accuracy and resource consumption.
Moreover, the proposed method enables releasing the dataset used in the evaluation for general use. In addition to the files themselves, it enables releasing for each file the confidence scores and meta-data of each of the malware detectors used.
The main goal of the present invention is to automatically learn a security policy that best fits organizational requirements. Specifically, a deep neural network is trained to dynamically determine when there is sufficient information to classify a given file, and when more analysis is needed. The policy produced by the present invention is shaped based on the values (i.e., rewards and costs) assigned to correct and incorrect file classifications, as well as to the use of computing resources. An RL framework explores the efficacy of various detector combinations and continuously performs cost-benefit analysis, so as to select optimal detector combinations.
The main challenge in selecting detector combinations can be modelled as an exploration/exploitation problem. While the cost (i.e., computing resources consumption) of using a detector can be very closely approximated in advance, its benefit (i.e., the usefulness of the analysis) can only be known in retrospect. RL algorithms perform well in scenarios with high uncertainty where only partial information is available, a fact that makes them highly suitable for the task at hand.
The states that characterize the environment consist of all possible score combinations by the participating detectors. More specifically, for a malware detection environment consisting of K detectors, each possible state will be represented by a vector:
V={v1, v2, . . . vK}, with the value of vx being set by:
Therefore, the initial state for each incoming file is a vector entirely consisting of −1 values. As various detectors are chosen to analyze the files, entries in the vector are populated with the confidence scores they provide. All scores are normalized to a [0,1] range, where a confidence value of 1 indicates full certainty of the file being a malware and 0 indicates full certainty in its being benign. An example of a possible state vector can be seen at
The number of possible actions directly corresponds to the number of available detectors in the environment. For an environment consisting of K detectors, the number of actions will be K+2: one action for the activation of each detector, and two additional actions called “malicious” and “benign”. Each of the two additional actions produces a classification decision for the analyzed file, while also terminating the analysis process.
The rewards should be designed so that they reflect the organizational security policy, namely the tolerance for errors in the detection process and the cost of using computing resources:
Two types of detection errors should be considered: false-positives (FP), which is the flagging of a benign file as malicious (i.e., a “false alarm”), and false-negative (FN), which is the flagging of a malicious file as benign. In addition to the negative rewards incurred by misclassification, it is also possible to provide positive reward for cases where the algorithm was correct.
Computing resources: The time required to run a detector has been chosen as the approximated cost of its activation. In addition to being a close approximator of other types of resources use (e.g., CPU, memory), the run time is a clear indicator of an organization's ability to process large volumes of incoming files. Hence, reducing the average time required to process a file allows organizations to process more files with less hardware.
When designing the reward function for the analysis runtime, it is required to address the large difference in this measure between various detectors.
As shown in Table 2 below, average running times can vary by orders of magnitude (from 0.7 s to 44.29 s, depending on the detector). In order to mitigate these differences and encourage the use of the more “expensive” (but also more accurate) detectors, the cost function of the computing time is defined as follows:
While only considering running time as the computing resource whose cost needs should be considered, the method proposed by the present invention can be easily adapted to include additional resources, such as memory usage, CPU runtime, cloud computing costs and electricity consumption. The proposed method allows organizations to easily and automatically integrate all relevant costs into their decision making process.
Dataset Malware Detection Analysis
The dataset used by present invention consists of 24,737 PE files, equally divided between malicious and benign PE files. Since it was impossible to determine the creation time of each file, all files were collected from the repositories of the network security department of a large organization in October 2018. Each file was analyzed using four different malware detectors.
The selection of detectors was guided by three objectives:
Following the above-mentioned objectives, four detectors were selected to be included in the present invention dataset: pefile, byte3g, opcode2g, and manalyze.
Pefile detector: This detector uses seven features extracted from the PE header: DebugSize, ImageVersion, latRVA, ExportSize, ResourceSize, VirtualSize2, and NumberOfSections, presented in [19]. Using those features, a Decision Tree classifier was trained to produce the classification.
byte3g: This detector uses features extracted from the raw binaries of the PE file [17]. Firstly, it constructs trigrams (3-grams) of bytes. Secondly, it computes the trigrams term-frequencies (TF), which are the raw counts of each trigram in the entire file. Thirdly, the document-frequencies (DF) are calculated, which represent the rarity of a trigram in the entire dataset. Lastly, since the amount of features can be substantial (up to 2563), the top 300 DF-valued features were used for classification. Using the selected features, a Random Forest classifier with 100 trees was trained.
opcode2g: This detector uses features based on the disassembly of the PE file [16]. First, it disassembles the file and extracts the opcode of each instruction. Secondly, it generates bigrams (2-grams) representation of the opcodes. Thirdly, both the TF and DF values are computed for each bigram. Lastly, once again it selects the 300 features with the highest DF values. Using the selected features, a Random Forest classifier with 100 trees was trained.
manalyze: This detector is based on open-source heuristic scanning tool named Manalyze3. It offers multiple types of static analysis capabilities for PE files, each implemented in a separate “plugin”. In the present invention version the following capabilities were included: packed executables detection, ClamAV and YARA signatures, detection of suspicious import combinations, detection of cryptographic algorithms, and the verification of Authenticode signatures. Each plugin returns one of three values: benign, possibly malicious, and malicious. Since Manalyze does not offer an out-of-the-box method for combining the plugin scores, a Decision Tree classifier with the plugins' scores as features was trained.
Detectors Performance Analysis
The performance of the various detectors was analyzed and compared. The effectiveness of various detector combinations was explored.
Overall Detector Performance
At the beginning, analysis of the upper bound on the detection capability of the four detectors was performed. Table below 1 presents a breakdown of all files in the present invention's dataset as a function of the number of times they were incorrectly classified by the various detectors. All detectors were trained and tested using 10-fold cross-validation. Incorrect classification is defined as a confidence threshold above 0.5 for a benign file or one that is equal or smaller than 0.5 for a malicious file.
The results presented in Table 1 show that approximate 73% of all files are classified correctly by all detectors, while only 0.65% (160 files) is not detectable by any method.
This analysis leads to two conclusions: a) Approximately ˜26.5% of the files in the dataset potentially require using of multiple detectors to achieve correct classification; b) only a small percentage of files (1.6%) is correctly classified by a single classifier, which means that applying all four detectors for a given file is hardly ever required. These conclusions support the hypothesis that a cost-effective approach for using only a subset of possible detectors.
Absolute and relative detector performance: The goal of this analysis is first to present the performance (i.e., detection rate) of each detector, and then determine whether any classifier is dominated by another (thus making it redundant, unless it is more computationally efficient). The analysis was begun by presenting the absolute performance of each detector. As can be seen in Table 2 above, the accuracy of the detectors ranges between 82.88%-95.5%, with the more computationally-expensive detectors generally achieving the better performance.
Next it was attempted to determine whether any detector is dominated by another. For each detector, the files it misclassified were analyzed, in order to determine whether they would be correctly classified by another detector. The results of this analysis, presented in Table 4 below, show that no detector is being dominated. Moreover, the large variance in the detection rates of other detectors for misclassified files further indicates that an intelligent selection of detector subsets (where the detectors complement each other) can yield high detection accuracy.
At the next stage, the confidence score distribution of the various detectors was analyzed. The goal of this analysis is to determine whether the detectors are capable of nuanced analysis; It has been hypothesized that detectors which produce multiple values on the [0,1] scale (rather than only “0”s and “1”s) might enable the DRL approach to devise more nuanced policies for selecting detector combinations. The results of the analysis are presented in
At the next stage, a comprehensive analysis on the performance and time consumption for all possible detector combinations is performed, and presented in Table 3 above. To evaluate the performance of each combination, the confidence score was aggregated using three different methods, presented in [12]. The first method “or” classifies a file as malicious if any of the participating detectors classifies it as such (yields a score of 0.5 and above). This method mostly improves the sensitivity, but at the cost of higher percentage of false-positive indications that leads to more benign files classified as malware. The second method “majority” classifies corresponding to the majority of detectors classifications. This means that if most of the detectors classify a file as malware, it will be classified as malware and vice versa. The third method “stacking” learns to combine the classification confidence scores by training a ML model using these scores as its features. In the evaluation, two stacked models were used, Decision Tree (DT) and Random Forest (RF), which were evaluated using 10-fold cross-validation.
The analysis shows that in the case of majority, the optimal performance is not achieved by combining all classifiers, but rather, only three of them. Furthermore, some detector combinations (manalyze, pefile, byte3g) outperform other detector sets, while also being more computationally efficient. The results further support the assumption that an intelligent selection of detector combinations is highly important.
For each file, the times were measured in an isolated computer process on a dedicated machine to prevent other processes interruptions. In addition, the machines executing the detectors were identical and utilized the same hardware and firmware specifications.
The method proposed by the present invention represents an attempt to draft a security policy by performing a cost-benefit analysis that takes into account the resources required to use various detectors. The performance of the proposed method was evaluated in several scenarios and its effectiveness was demonstrated. Moreover, it was shown that simple adjustments to the present invention algorithm's reward function (which reflects the organization's priorities) leads to significant changes in the detection strategy. This method is more effective (and intuitive) than conventional existing methods.
Three VMware ESXi servers were used, each containing two processing units (CPUs). Each server had a total of 32 cores, 512 GB of RAM and 100 TB of SSD disk space. Two servers were used to run the environment and its detectors, while the remaining server housed the DRL agent. In the experiments, two detectors were deployed in each server. This deployment setting can easily be extended to include additional detectors or replicated to increase the throughput of existing ones. The main goal in setting up the environment was to demonstrate a large scale implementation which is both scalable and flexible, thus ensuring its relevance to real-world scenarios.
Setup
The following settings were used in all the experiments:
Experimental Results
The proposed method has two major advantages:
a) it can produce near-optimal performance at reduced computational cost;
b) using rewards allows to easily define and tune the security policies by assigning a “personalized” set of detectors for each file.
To test the robustness of the proposed method, as well as its ability to generalize, five use-cases with varying emphasis on correct/incorrect file classifications and computational cost were defined. The rewards composition of each use-case is presented in Table 5 below, along with its overall accuracy and mean running time.
The computational cost of using a detector is not calculated independently, but rather as a function of correct/incorrect file classification. Additionally, the computational costs of the malware detectors were defined, based on the average execution time of the files that were used for training. This allows the algorithm to converge faster. The experiments show that this type of setting outperforms other approaches for considering computational cost, as it strongly ties the invested resources to the classification outcome.
Experiment 1: In this experiment, both the reward for correct classification and the cost of incorrect classification were set to be equal to the cost of the running time. On one hand, this setting “encourages” the proposed system to invest more time analyzing incoming files and also provides higher rewards for the correct classification of more challenging files. On the other hand, the detector is discouraged from selecting detector configurations that are likely to reduce its accuracy for a given file. Additionally, the method proposed by the present invention will not be inclined to consume additional resources for difficult—to classify cases, where the investment of more time is unlikely to provide additional information.
Experiment 2: The setting of this experiment is similar to that of experiment 1, except for the fact that the cost of incorrect classifications is 10× higher than the reward for correct classifications. It has been assumed that this setting will cause the algorithm to be more risk-averse and invest additional resources for the classification of challenging files.
Experiments 1 and 2 were not designed to assign high priority to resource efficiency, but instead, they focus on accuracy. The remaining experimental settings were designed to assign higher preference to the efficient use of resources.
Experiments 3-5: In this set of experiments, policies where the rewards assigned to correct classification were fixed, while the cost of incorrect classification depends on the amount of computing resources spent has been examined, so as to reach the classification decision. Three variants of this method were explored, where the cost of incorrect classification remains the same, but the rewards for correct classifications are larger by one and two orders of magnitude (1, 10, and 100).
This set of experiments has two main goals: first, since only the cost of an incorrect classification is time-dependent, experiments 3-5 are expected to be more efficiency-oriented. The aim is to determine the size of this improvement and its effect on the accuracy of the proposed method. Secondly, there was an interest in exploring the effect of varying reward/cost ratios on the policy generated by the proposed method. Since scenarios in which the reward for correct classifications is either significantly smaller or larger than the cost of incorrect classifications, have been explored, the expectation was to obtain better understanding of the proposed decision mechanism.
A summary of the results is presented in Table 5, while a detailed comparison of the results obtained by the various experiments is shown in Tables 7-10. In addition, a detailed breakdown of the detector combinations used by each of the generated DRL policies is shown in Table 6.
Generally, the results show that the proposed method is capable of generating highly effective detection policies. The policies generated in experiments 1-2 outperformed all the methods presented in the baseline, except for the top-performing policy, which is a combination of all classifiers and the Random Forest algorithm. While this baseline method marginally outperforms the proposed method (98.86% to 96.81% and 96.79% for experiments 1 and 2 respectively), it is also slightly more computationally expensive (49.74 seconds on average compared with 49.63 and 49.58 for experiments 1 and 2 respectively). These results are as expected, since the defined policies for experiments 1 and 2 were geared towards accuracy, rather than efficiency.
Each of the policies generated by experiments 3-5 achieves a different accuracy/efficiency balance. Moreover, each of the three policies was able to reach accuracy results that are equal to, or better than, those of the corresponding baselines at a much lower cost.
The policy generated by experiment 3 reached an accuracy of 96.21% with a mean time of 10.5 seconds, compared with its closest baseline “neighbor” which achieved an accuracy of 96.3% in a mean time of 48.28 seconds (almost five times longer). Similarly, the policy produced by experiment 4 achieved the same accuracy as its baseline counterpart (pefile, opcode2g) while averagely requiring only 3.68 seconds, compared to the baseline's 45 seconds (92% improvement). The policy generated by experiment 5 averagely requires 0.728 seconds per file, which is comparable to the time required by the baseline method “pefile”. However, the method proposed by the present invention achieves higher accuracy (91.22% vs 90.6%).
The experiments clearly show that security policy can be very effectively managed using different cost/reward combinations. Moreover, it is clear that using DRL offers much greater flexibility in the shaping of the security policy, than the simple tweaking of the confidence threshold (the only currently available method for most ML-based detection algorithms).
When analyzing the behavior of the policies (i.e., the detector selection strategy), it was found that they behaved just as could be expected. The policies generated by experiments 1 and 2 explicitly favored performance over efficiency, as the reward for correct for correct classification was also time-dependent. As a result, they achieve very high accuracy but only a marginal improvement in efficiency.
For experiments 3-5, the varying fixed cost that was assigned to the correct classifications played a deciding role in creating the policy. In experiment 3, the relative cost of a mistake was often much larger than the reward for a correct classification. Therefore, the generated policy is cautious, while achieving relatively high accuracy (with high efficiency). In experiment 5, the cost of an incorrect classification is relatively marginal, a fact that motivates the generated policy to prioritize speed over accuracy. The policy generated by experiment 4 offers the middle ground, reaching a slightly reduced accuracy compared with experiment 3, but managing to do so in about 33% of the running time.
The main advantage of the proposed method is therefore the ability to craft a “personalized” set of detectors for each file.
Malware Detection Techniques
The vast variety of ways for representing a PE file allows using different features for malware detection classification models.
The most common and simple way of representing a PE file is by calculating its hash value [9]. Hash values are generated using special function, namely hash functions, that maps data of arbitrary size onto data of a fixed size (commonly represented by numbers and letters). This method is frequently used by anti-virus engines to “mark” and identify malware, as computing hashes is considered fast and efficient.
Additionally, a PE file can be represented using its actual binary data. For example, using byte n-grams (an n-gram is a data structure, originated in computational linguistics, represented by a contiguous sequence of n items usually drawn from a text or speech) to classify malwares has been suggested by [17]. Thus, instead of generating n-grams out of words or characters, [17] suggested generating n-grams out of bytes, while examining different sizes of n-grams ranging from 3 to 6, as well as three feature selection methods. They conducted numerous experiments with four types of models: artificial neural network (ANN), Decision Tree (DT), naïve Bayes (NB) and Support Vector Machine (SVM). DT was able to achieve the best accuracy of 94.3% with less than 4% of false-positives.
Another type of features is generated using the disassembly of a PE file [16]. A disassembler is a computer program that translates code from machine language to the assembly programming language. The translated code includes, among other things, operation codes (opcodes) which are computer instructions that defines the operations to be performed, and often include one or more operands which the instructions will work upon. The use of opcode n-grams to classify malwares was suggested by [16]. They examined different sizes of n-grams ranging from 3 to 6, as well as three feature selection methods. To classify the files, they used several models such as ANN, DT, Boosted DT, NB and Boosted NB. The best results achieved by the DT and the Boosted DT models, with more than 93% accuracy, less than 4% false-positives and less than 17% false-negatives.
Lastly, the PE format (i.e., metadata) can be used to represent the PE file [1, 5, 19]. The format of PE files has a well-defined structure, which includes information necessary to the execution process, as well as some additional data (such as versioning info and creation date). For example, seven features have been used by [19], extracted from the PE headers to classify malicious files: DebugSize, ImageVersion, latRVA, ExportSize, ResourceSize, VirtualSize2, and NumberOfSections were used by several classifiers. To evaluate performance, various classification models have been used, including: IBK, Random Forest, J48, J48 Graft, Ridor and PART.
Their results showed similar performance for all classifiers, reaching an accuracy of up-to 98.56% and a false-positive rate as lower as 5.68%.
Reinforcement Learning in Security Domains
Reinforcement learning is used for various security domains such as adversarial learning and malware detection. For adversarial learning malware detection evading, the system of [1] used RL by trying to attack static PE anti-malware engines by equipping the agent with a set of malicious functionality-preserving operations. The agent learns through a series of games played against the anti-malware engine. In the malware detection domain, the system of [3] showed a proof of concept for adaptive rule-based malware detection employing learning classifier systems, which combines a rule-based expert system. They used VirusTotal as a PE file malware classifier and different static PE file feature using RL algorithm to decide whether PE is malicious or not.
the system of [15] used RL for classifying the type of malware using features used by anti-viruses. Another example used in the malware detection domain was used by the system of [31] for optimizing mobile application malicious behavior on mobile devices by controlling the offloading rate of application traces to the security server. They proposed an offloading strategy based on deep Q-network technique with a deep convolutional neural network to improve the detection speed.
The RL-based method proposed by the present invention for malware detection dynamically and iteratively assigns various detectors to each file, while constantly performing cost-benefit analysis to determine whether the use of a given detector is “worth” the expected reduction in classification uncertainty. The entire process is governed by the organizational policy, which sets the rewards/costs of correct and incorrect classifications and also defines the cost of computational resources.
When compared to existing ensemble-based solution, the proposed method has two main advantages. Firstly, it is highly efficient, since easy-to-classify files are likely to require the use of less powerful classifiers, which allowed maintaining near-optimal performance at a fraction of the computing cost. As a result, it is possible to analyze a much larger number without increasing hardware capacity. Secondly, organizations can clearly and easily define and refine their security policy by explicitly defining the costs of each element of the detection process: correct/incorrect classification and resource usage. Since the value of each outcome is clearly quantified, organizations can easily try different values and fine-tune the performance of their models to comply with the desired outcome.
Although the above examples were directed to malware detection, the proposed method can be implemented in many technologic fields, such as medical tests, detection of various illnesses or medical conditions, diagnostic test, maintenance facilities for vehicles, ships, drones and planes require testing and analysis, in order to optimize the usage of resources, in order to achieve multi-objective tasks with contradicting constrains.
The proposed method may be used to perform several initial tests of modules or components of a system, in order to detect the possibility of an existing problem, and to decide whether or not to conduct additional (and usually, more expensive) tests, as required to meet predetermined needs or organizational policy. This can be applied to any diagnostic process that requires decision making and optimization between contradicting constrains, in order to fulfill a desired policy.
For example, the field of medical diagnostics required performing medical tests, in order to decide which treatment should be given to a patient. However, some test are cheaper and less accurate, while other tests are more expensive and accurate. In this case, the doctor should decide which test are essential for obtaining a good indication regarding the patient's condition. The method proposed by the present invention allows doctors to automatically obtain a minimal set of optimal tests to be performed, in order to obtain fast and accurate diagnostic indication regarding a patient condition, while eliminating unnecessary expensive tests.
In fact, the method proposed by the present invention may be applied almost to any diagnostic field. For example, which tests and inspections should be made to obtain accurate assessment regarding the mechanical condition of a vehicle, an airplane or a ship and which maintenance operations should be taken in order to keep them safe and operative. According to the present invention, a garage can perform some initial test to detect the possibility of some problem, and then conduct additional (and more expensive) tests, only if required.
Also, several sensors may be activated for diagnostic purposes. For example, it is possible to activate various sensors in a mobile device (such as GPS, angle, speed, temperature etc.) to obtain a desired indication. However, each sensor provides some data but consumes different power from the battery. Applying the method proposed by the present invention allows obtaining the desired indication with sufficient accuracy, with the right balance to save battery power. The same applies for unmanned drones that need to detect errors during flight. The proposed method helps optimally allocating their limited resources.
The proposed method can be used for other possible applications, such as detection of malicious websites, fraud detection, evaluating credit risks, routine inspections, optimizing the operation of distributed micro-power grids and predicting power demands along with their timing, traffic and transportation control for optimizing traffic volume and in any environment that requires multi-objective optimization.
According to another embodiment, the proposed system allows to transfer learning from agent to agent. Accordingly, an agent trained in a first environment has transferability feature to function in a second environment, based on its training in the first environment. This save the need to train the agent again and allows use the agent's training to operate in the other environment with minimal adaptation.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IL2020/050527 | 5/14/2020 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62848608 | May 2019 | US |