The present invention relates to detecting peer-to-peer (P2P) botnets; more particularly, to an unsupervised algorithm of finding out a lot of flows having similar behaviors for marking out known or unknown botnets.
Existing related prior arts for finding botnets mostly focus on pre-defined rules. Warning will be issued only if the rules are met. Unknown malwares are not marked out and filtered. For example, a prior art provides a method of identifying P2P botnet by using a statistical analysis of small flows. This prior art analyzes Neflow log to classify network flows into in-flow sets and out-flow sets. Sliding-window is used as a base to determine similar behaviors of botnets. However, thresholds are required and pre-defined for determining botnet activity. The threshold might be various for each botnet. Furthermore, a technical process of combined sessions for determining similarity is not revealed. U.S. Pat. No. 8,762,298 B1 is ‘Machine learning based botnet detection using real-time connectivity graph based traffic features’, which mainly detects command and control (C&C) botnets. In a graph-based way, whether any IP communicates with C&C servers or not is determined. However, this prior art requires the help of historical information to accurately determine whether any malicious behavior occurs or not. U.S. Patent 20170251005 A1 is ‘Techniques for botnet detection and member identification’, which is a method for determining whether a host communicates with botnet member or not. Botnet members are recorded in a historical data table. If a host communicates with more than one botnet member, it is suspicious about malicious behavior. Another prior art provides a method of detecting malicious behaviors bases on credibility for a network having high-volume flows. This prior art is an online method of detecting malicious behaviors. Netflow features are directly used to calculate the p-value with a known malicious behavior matrix. If the p-value lies within a certain range, the host most likely behaves maliciously. Another prior art provides a method of detecting botnet based on Netflow and DNS log. Through a monitoring technology of abnormal flows, collected Netflow data are quickly processed through correlational analysis. Yet, this prior art has a disadvantage of further using the DNS log after using the Netflow log. Another prior art provides a method of detecting abnormal flows. A fixed sliding-window is used for online detection. Under a certain trigger condition, abnormal flows are detected. Yet, the prior art has a disadvantage of defining detection condition in advance but not finding the flows having similar behaviors, since a large number of behavior patterns of the same kind are most likely caused by botnet activities. Another prior art provides a method, a device and a processor for detecting botnet. An average total of packet bytes and an average total of bytes per second are calculated as communication features. Grouping rules are preset for clustering. Yet, the prior art has disadvantages of not using the features retrieved from the Netflow log, the behavior features of botnet viruses, and the setting of grouping thresholds, for detecting botnet.
From the above prior arts, it is known that current methods for botnet detection mostly use features of flows directly for finding similarity without combining flows into sessions in advance. Therefore, current researches are all based on experimental data as well as ISCX, CTU13 etc. There are few relative studies on P2P botnet analysis with actual mass flows. Another prior art provides a method of cooperating detection of botnet based on FedMR. But, the step of Ranking and Association is hard to practice in a cooperating way. It does not provide complete processes. Hence, the prior arts do not fulfill all users' requests on actual use.
The main purpose of the present invention is to provide a method of building session information to analyze botnet behaviors for detecting P2P botnets on Netflow.
Another purpose of the present invention is to use megadata for development to be implemented on MapReduce platform, where the present invention is verified to withstand a level of Netflow log up to 1 tera-bytes with real data.
Another purpose of the present invention is to provide a complete two-month log of actual network flows of a university for test along with a real blacklist for validation, where the present invention proves that its reliability is higher than 95% for effectively strengthening the protection of nation information security.
To achieve the above purposes, the present invention is a method of detecting P2P botnet based on Netflow sessions, comprising steps of session extraction, filtering, grouping, and reverse lookup, where a Netflow log is inputted; each record in the log is a unidirectional flow; data inputted from said log comprises a timestamp, a source IP (Src IP, IP=Internet Protocol address), a destination IP (Dst IP), a port number and a packet total; a time-interval threshold is used to be a standard to combine the unidirectional flows into bidirectional sessions; a flow and another flow followed adjacently in a communication between two IPs are defined as in the same period and combined into a session when a time interval between the two flows does not exceed the time-interval threshold; features of the two flows of the session are combined and computed to obtain a plurality of the features highlighting communication behaviors; feature ranking is processed with the features of the session to obtain outstanding ones of the features through information gain to obtain a feature vector (FV) of the session to process subsequent detection; the filtering comprises two sub-steps, including whitelist filtering and flow loss-response filtering; a whitelist and a loss rate are used to be standards to filter out normal flows and non-P2P communication-behavior flows; the grouping comprises three levels of grouping, including a first level of SuperSession grouping, a second level of SessionGroup grouping and a third level of BehaviorGroup grouping; a group of IPs are defined as carrying suspicious virus of P2P botnet according to virus behaviors of P2P botnet along with a distance threshold and a group total threshold; and a blacklist is used to directly and indirectly process verification to obtain a suspicious IP list through reverse lookup. Accordingly, a novel method of detecting P2P botnet on Netflow is obtained.
The present invention will be better understood from the following detailed description of the preferred embodiment according to the present invention, taken in conjunction with the accompanying drawings, in which
The following description of the preferred embodiment is provided to understand the features and the structures of the present invention.
Please refer to
(a) Session extraction [11]: Unidirectional Netflow data are combined into bidirectional data according to source IP (Src IP, IP=internet protocol address), destination IP (Dst IP), port number and time-interval threshold for highlighting communication features between IPs.
(b) Filtering [12]: Two sub-steps, whitelist filtering [121] and flow loss-response (FLR) filtering [122], are included. A whitelist and a loss rate are used as standards for filtering out normal flows and flows of non-P2P communication behaviors.
(c) Grouping [13]: The grouping [13] comprises three levels of grouping, including a first level of SuperSession grouping [131], a second level of SessionGroup grouping [132] and a third level of BehaviorGroup grouping [133]. A group of IPs are defined as IPs carrying suspicious virus of P2P botnet based on virus behaviors of P2P botnet, a distance threshold and a group total threshold.
(d) Reverse lookup [14]: A blacklist is used to directly and indirectly process verification for obtaining a suspicious IP list through reverse lookup.
Thus, a novel method of detecting P2P botnet based on Netflow sessions is obtained.
The above steps are processed step by step for detecting botnet. The following are details and data formats.
In step (a), the Netflow log is inputted where each record in the log is a unidirectional flow ; and data inputted from the log comprises a timestamp, a Src IP, a Dst IP, a port number and a packet total. However, the unidirectional flows do not highlight communication features. Therefore, in step (a) Session extraction [11], a time-interval threshold is used as a standard for combining the unidirectional flows into bidirectional sessions. The time-interval threshold comprises a Transmission Control Protocol (TCP) sub-threshold of 22 seconds (sec); and a User Datagram Protocol (UDP) sub-threshold of 21sec. When a time interval between a flow and another flow followed adjacently in a communication between two IPs does not exceed the time-interval threshold, the two flows are defined as in the same period and combined into a session. Features of the two flows of the session are combined and computed to obtain the features highlighting communication behaviors of the session. The features of the session are processed through feature ranking with information gain to obtain outstanding features of the session. The following Table 1 shows a table of a feature vector (FV). The present invention processes ranking to 20 features, where 14 features (*) are selected to form the FV of the session for subsequent detections. The total of the features selected is flexible and any combination of features is available for the subsequent detections.
Therein, the present invention calculates the total of in-flows and out-flows to define a rate of FLRs of the sessions for determining P2P communication behaviors. In step (b) Filtering [12], two sub-steps are processed. At first, the sub-step of whitelist filtering [121] processes filtering with a whitelist to delete the sessions of known benign IPs, such as domain name system servers (DNS Server) or well-known web sites. Then, the sub-step of FLR filtering [122] filters the sessions of communication behaviors not having P2P features. A pseudo code of the two sub-steps for MapReduce platform is shown in
The pseudo code of the sub-step of whitelist filtering [121] is shown in
A first part of the pseudo code of the sub-step of FLR filtering [122] is shown in
A second part of the pseudo code of the sub-step of FLR filtering [122] is shown in
A third part of the pseudo code of the sub-step of FLR filtering [122] is shown in
The present invention processes the three levels of grouping in step (c) Grouping [13] by using the following features of P2P botnet: (1) the repeating connections with peers; (2) the connections with other peers; and (3) similar communication behaviors between P2P botnets. To obtain similar communication behaviors, a formula of Euclidean distance is used to calculate a distance between the FVs of two of the sessions. In fact, any formula of space measurement for calculating a distance between two data dimensions is available. The three levels of grouping are processed based on a total of the sessions having similar communication behaviors with the distances exceeding a distance threshold (which is 3 in default).
As described above, in the first level of SuperSession grouping [131] in step (c) Grouping [13], the repeating communications with peers as a feature of P2P botnet is used for grouping. In
The pseudo code of the first level of grouping of step (c) Grouping [13] is shown in
In the second level of SessionGroup grouping [132] in step (c) Grouping [13], the communications with other peers as a feature of P2P botnet is used for grouping. In
The pseudo code of the second level of grouping of step (c) Grouping [13] is shown in
At last, in the third level of BehaviorGroup grouping [133] in step (c) Grouping [13], the feature of similar communication behaviors between P2P botnets is used for grouping. In
The pseudo code of the third level of grouping of step (c) Grouping [13] is shown in
The mode of operation is described above according to the present invention. The following is an experiment for the feasibility of the present invention by using an actual Netflow log. the present invention processes verification with the coordination of the VirusTotal service to directly and indirectly determine whether the IPs selected out are suspicious IPs or not. The present invention uses a 61-day Netflow log of a university (a total of 242 giga-bytes (GB) for 930915 IPs) inputted in a base of per-week records as a unit for detection. The FLR has to be higher than 0.225 and the distance threshold is set to be 2. The grouping [13] clusters and updates representative FVs only when a total of items in a clustered group is more than 3. The Netflow log and the detection parameters are shown in Table 2 as follows:
For verification, the BehaviorGroups generated after the third level of grouping are directly verified with their Src IPs by using the blacklist (from VirusTotal, but not limited). If more than five ones of the Src IP in the BehaviorGroups are existed in VirusTotal, all IPs in the entire BehaviorGroups are regarded as suspicious IPs behaving maliciously. After the three levels of grouping, the clustered groups have similar FVs. It means that, although the behaviors of some IPs do not make them included in the VirusTotal blacklist, these IPs behave the same as malicious IPs. Therefore, they are still regarded as IPs behaving maliciously. The data set obtained after the above processes of filtering and grouping is verified directly and indirectly; and the result, including per-week data size, IP total, etc., is shown in Table 3. Detected IP Total is the total of IPs in all the BehaviorGroups after removing the repeated ones; Directed IP Total is the total of IPs directly existed in VirusTotal; and Verified IP Total is the total of IPs in all the BehaviorGroups determined as behaving maliciously after removing the repeated ones. As seen in the result, the precisions are all above 90 percent, which proves the effectiveness of detection according to the present invention.
Currently, every nation regards information security as an important national security issue. The present invention provides a method for detecting P2P botnet on Netflows with an unsupervised algorithm. The unsupervised algorithm is based on Netflow. Session information is built by analyzing botnet behaviors to find a lot of flows having similar behaviors. Thus, known or unknown botnets can be marked out. The present invention uses megadata for development and is implemented on MapReduce platform. The whole process is more complete than existing prior arts. A complete two-month log is provided for experiment. By the result, the present invention is actually verified to withstand a level of Netflow log up to 1 tera-bytes. The log of actual flows of a university is provided for experiment along with a real blacklist for validation. Accordingly, the present invention proves that its reliability (more than 95%) is higher than the other prior arts for effectively strengthening the protection of nation information security.
To sum up, the present invention is a method of detecting P2P botnet based on Netflow sessions, where an unsupervised algorithm based on Netflow is used to build session information by analyzing botnet behaviors for finding a lot of flows having similar behaviors; known or unknown botnets can be marked out; and the present invention proves that its reliability (more than 95%) is higher than the other prior arts for effectively strengthening the protection of nation information security.
The preferred embodiment herein disclosed is not intended to unnecessarily limit the scope of the invention. Therefore, simple modifications or variations belonging to the equivalent of the scope of the claims and the instructions disclosed herein for a patent are all within the scope of the present invention.