A portion of the disclosure of this patent document and its attachments contain material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyrights whatsoever.
Exemplary embodiments generally relate to electrical computers and, more particularly, to heuristic prediction.
Heuristics may be used to solve difficult problems. Computer science uses heuristic algorithms to produce acceptable solutions to challenging problems. Heuristics, for example, could be used to improve computer networks.
The features, aspects, and advantages of the exemplary embodiments are better understood when the following Detailed Description is read with reference to the accompanying drawings, wherein:
The exemplary embodiments will now be described more fully hereinafter with reference to the accompanying drawings. The exemplary embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. These embodiments are provided so that this disclosure will be thorough and complete and will fully convey the exemplary embodiments to those of ordinary skill in the art. Moreover, all statements herein reciting embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future (i.e., any elements developed that perform the same function, regardless of structure).
Thus, for example, it will be appreciated by those of ordinary skill in the art that the diagrams, schematics, illustrations, and the like represent conceptual views or processes illustrating the exemplary embodiments. The functions of the various elements shown in the figures may be provided through the use of dedicated hardware as well as hardware capable of executing associated software. Those of ordinary skill in the art further understand that the exemplary hardware, software, processes, methods, and/or operating systems described herein are for illustrative purposes and, thus, are not intended to be limited to any particular named manufacturer.
As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless expressly stated otherwise. It will be further understood that the terms “includes,” “comprises,” “including,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Furthermore, “connected” or “coupled” as used herein may include wirelessly connected or coupled. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
It will also be understood that, although the terms first, second, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first device could be termed a second device, and, similarly, a second device could be termed a first device without departing from the teachings of the disclosure.
The software algorithm 30 analyzes the set 34 of heuristic rules. The software algorithm 30 evaluates an accuracy of each heuristic rule using ground truth. For example, the software algorithm 30 may determine heuristic H1 is correct 80% of the time. Given set 34 of heuristic rules, the software algorithm 30 may combine the heuristic rules to arrive at an answer 36 to the question 32 within a confidence level 38. The software algorithm 30, in other words, produces the answer 36 within a probability.
Some examples help explain the use of heuristics. Suppose the server 20 receives the input 26 as NetFlow information from a router. As those of ordinary skill in the art understand, NetFlow is a network protocol for collecting Internet Protocol traffic information. Other inputs could firewall logs, packet traces, and sensor readings from sensors. The input 26, in fact, may be any stream of data. Regardless, the question 32 might be “what services are running in the communications network 24?” The question 32 may also be “which machines (if any) are infected by a virus?” or “which machines are part of a botnet?” Other questions may include “did a human move through a field instrumented by motion sensors?” One problem, of course, is that it is hard or even impossible to get the right kind of information to produce the answer 36 to the question 32. If all the software algorithm 30 receives is the input 26, then the software algorithm 30 must develop the answer 36 using the given input 26. Another problem is that for any observation, there is a chance that the observation is normal behavior (e.g., a user that sends a lot of email messages verses a machine infected by a spam bot or a deer in the field instead of a human). The software algorithm 30 may thus assume that each heuristic returns a true or false (or a specific yes/no, such as “is machine X infected by a virus?”). The answer 36 may thus indicate that heuristic H1 is correct 80% of the time (in other words, if machine X is indeed infected, then heuristic H1 will detect that with 80% probability). By combining all the heuristics, the software algorithm 30 provides a combined estimate for the question 32. Continuing with the virus example, based on the combined heuristics, the software algorithm 30 determines with 99% probability that server Xis infected by a virus.
The server 20 receives the input 26 from the source device 22. The software algorithm 30 analyzes the input 26 (e.g., any stream 60 of data) to determine the answer 36 to the posed question 32. Because the question 32 has no straight forward or readily apparent answer, the software algorithm 30 retrieves the set 34 of heuristic rules to answer the question 32. The set 34 of heuristic rules is illustrated as being locally stored in the memory 52, but the set 34 of heuristic rules may be remotely accessed and maintained at any location in the communications network (illustrated as reference numeral 26 in
The above analysis may be used to passively detect any services running, for example, in an internal (e.g., corporate) network. The approach scales to large networks, has minimal overhead to the network being monitored, and by providing continuous monitoring, will detect any service that ever communicates with the external networks. The software algorithm 30 uses novel techniques to combine information from heuristics, each of which by itself is unreliable. By comparing the list of detected services against the list of legitimate known services, the network operators can quickly detect any new rogue services.
Knowledge of what services are running in their networks is critical for network and security administrators. For example, an infected computer (e.g., a “bot”) often runs services that listen for commands from their controllers. Furthermore, the security policy of the organization may disapprove certain services (e.g., known vulnerable services, P2P, etc.). Exemplary embodiments may thus be used to identify rogue servers and compromised computers in an internal network based on a combination of passive service discovery and historical comparison. Exemplary embodiments provide a simple yet effective method to continuously and accurately detect the entire population of servers in a given network. The network and security administrators may then validate the legitimate services and be alerted when suspicious services appear.
Exemplary embodiments may use NetFlow as the input 26. NetFlow is a known network protocol for collecting Internet Protocol traffic information. NetFlow is implemented in most routers and collects summarized traffic information using packet headers. More precisely, a network flow is defined as a unidirectional sequence of packets that share source and destination IP addresses; source and destination port numbers (for TCP or UDP, 0 for other protocols); and the IP protocol (e.g., TCP or UDP). A NetFlow record carries a wide variety of network-related information including: timestamp of the first packets received, duration, total number of packets and bytes, input and output interfaces, IP address of the next hop, source and destination IP masks and cumulative TCP flags in the case of TCP flows.
Some terms may be helpful. The server 20 may be a network application that provides a service by receiving request messages from clients and generating response messages. The server 20 may be hosted on a computer identified by its IP address and accepts requests sent to a specific port. Exemplary embodiments may utilize any servers, such as those using the UDP and TCP protocols, both temporary and permanent. Exemplary embodiments may include peer-to-peer transactions, even if the server 20 may be handling client requests for only a few minutes and for only specific clients. An end point is defined as a tuple {IP address, IP protocol (TCP or UDP), Port number} and may represent any client or any server. A network session may be a valid communication between one client end point and one server end point. A network transaction may be any set of flows between two end points during a time window smaller than the maximum age limit of a flow (such as 15 minutes). There may be two types of network transactions: unidirectional and bidirectional. Exemplary embodiments may assume that bidirectional transactions are always between a client and a server and that bidirectional transactions are always initiated by a client. Exemplary embodiments, however, need not make this assumption, so that bidirectional transactions may be between any two devices and bidirectional transactions may be initiated by any device.
The task of accurately detecting servers based solely on NetFlow is challenging. NetFlow may not keep track of the logic of network sessions between clients and servers. Specifically, exemplary embodiments may address the following challenges. 1) NetFlow may break up the logical request and reply flows into multiple separate flows, 2) NetFlow is made of unidirectional flows and therefore exemplary embodiments may need to identify the matching unidirectional flows to make up bidirectional flows and identify valid network sessions, and 3) identifying the server end point in a network session is not always easy.
Exemplary embodiments may solve the first and second challenges by matching and merging the net-flows as follows. First, for each collection period (usually 5 minutes) exemplary embodiments merge all network flows that have the same source and destination end points to eliminate any artificial breaking of unidirectional flows. Then to address the issue of combining unidirectional flows into network sessions, exemplary embodiments may first generate bidirectional flows by merging all flows collected during a given time window that have opposite source and destination end points. Exemplary embodiments may then separate valid from invalid bidirectional flows as follows. All UDP flows are considered to be valid. TCP flows are valid only if both the request and reply flows carry at least two packets and the TCP acknowledgement flag. So for example, if a server refuses a TCP connection handshake by sending a reset flag to the source end point, then the bidirectional flow recorded for this transaction will be seen as invalid.
The last step may be to identify client and server end points for every valid bidirectional flow. This task is challenging because the TCP flags in the request and reply flows are typically identical for valid bidirectional flows. Furthermore, the flow timestamps have proven to be sometimes unreliable and more often, the request and reply flows have identical time stamps due to the granularity of the time stamps.
To achieve this task, exemplary embodiments may develop a set of heuristics that determine if an end point is a server (or not). Each heuristic uses one or more characteristics of each bidirectional flow to make its decision. These heuristics were developed to cover a variety of intuitions gathered from network experts. Exemplary embodiments may then combine the outputs from the different heuristics using a Bayesian inference framework. Bayesian inference provides the advantage of keeping track of previous detection evidence by updating the accuracy of the server identification process over time. The heuristics may include:
Exemplary embodiments may combine the evidence provided by the heuristics to get the best estimate of which end points are servers and which ones are clients by using basic Bayesian inference. Exemplary embodiments may consider each end point that is present in at least one bidirectional flow. For each end point X, two hypotheses are possible:
The different heuristics are used to identify evidence E in the bidirectional flows. We use training data with known ground truth to determine P(E|H), that is, the probability of evidence E being present in a flow or set of flows given that hypothesis H is true. Finally, let P(H) be the prior probability of hypothesis H (e.g., based on prior evidence). Then, the probability of a hypothesis H given the evidence E, P(H|E), can be updated using the basic formulation of Bayesian inference:
where P(E)=ΣP(E|Hi)*P(Hi), where Hi are all the possible hypotheses.
In order to combine the evidence provided by the different heuristics, exemplary embodiments determine the accuracy of each heuristic, that is, the probabilities P(E|HS) and P(E|HC) for each type of evidence E. These conditional probabilities can be determined either by using expert knowledge or by learning them from labeled data where the server identities are know. Exemplary embodiments use data labeled by Argus (http://www.qosient.com/argus/, 2009) as the ground truth. This dataset consists of 34.8 million NetFlow traffic collected at the border of the University of Maryland during 30 minutes. This dataset will be explained in later paragraphs.
To fully implement the results from
The accuracy of the exemplary embodiments may be evaluated by addressing two related issues:
1. Generating correctly oriented bidirectional flows, and
2. accurately identifying server end points.
The first issue was evaluated by comparing the bidirectional flows generated from the same dataset using the software application 30 and using Argus, which is a packet-based bidirectional flow generator discussed above. The second issue was evaluated by comparing the list of network services discovered by the software application 30 and by the Passive Asset Detection Service (or “PADS”) from the same dataset. As those of ordinary skill in the art understand, PADS is a packet-based passive service discovery tool. The inventors assumed PADS and Argus to be more accurate than the software application 30 and able to produce a baseline dataset for evaluation since they both work from detailed packet data instead of high level flow data. The goal of the below paragraphs is to measure exactly how much accuracy is lost by working only with flow.
First, though, the dataset is discussed. The dataset used for this evaluation includes raw packet data captured at the border of the University of Maryland network during a 30 minute interval. A total of 154.9 million packets were collected and these packets were exchanged between 56,977 internal hosts and 1.57 million external hosts. The raw packet dataset was divided into 17 files prior to being processed by each tool, in order to reproduce the conditions of a production environment where flows are processed in batches of a few minutes. Before being processed by the software application 30, the packet data had to be translated to unidirectional NetFlow data in order to replicate the behavior of a router. The Softflowd product (www.mindrot.org/projects/softflowd/) and Nfcapd from the Nfdump package (P. Haag: Watch Your Flows with NJSen and NFDUMP, http://www.ripe.net/ripe/meetings/ripe-50/presentations/ripe50-plenary-tue-nfsennfdump.pdf) to complete this task. A total of 34.8 million unidirectional Netflows were generated. The flows can be partitioned into:
First note the discrepancies between the number of unidirectional and bidirectional flows produced by Argus and the software application 30. These differences come from the distinct cut off and aggregation rules that the two programs apply. In order to evaluate how accurately the software application 30 can decide on the orientation of flows, all bidirectional flows were aggregated regardless of their orientation using a key based on the hash value computed from the source/destination IP addresses, protocol and source/destination ports. The inventors then compared for each key if Argus and the software application 30 agreed or disagreed on the orientation of the bidirectional flow represented. This comparison led to three cases: 1) the software application 30 and Argus agreed, 2) the software application 30 and Argus disagreed, and 3) Argus output multiple orientations for a given bidirectional flow key, so the agreement was mixed. On the 5.94 unique bidirectional flow keys evaluated, the software application 30 and Argus agreed 79.60% of the time, disagreed 12.70% of the time and results were mixed for the remaining 7.70%.
To analyze further these results, the inventors first broke them down according to the output of the Bayesian inference that the software application 30 calculated to decide on the orientation of each bidirectional flows. The Bayesian inference output can be seen as a confidence value, where 0.5 means that the software application 30 could not decide on the orientation of the bidirectional flow, and 1 means the software application 30 was 100% sure on the orientation to apply. The percentage of agreement between Argus and the software application 30 shows that the accuracy increases with the probability provided by the Bayesian inference output. This empirical result shows that by using flows instead of packet data, the software application 30 has reduced accuracy but can still provide indication to a network administrator on the confidence of the results.
On the subset of flows where Argus could unequivocally decide on flow direction, the overall accuracy of the software application 30 to generate correctly oriented bidirectional flows is 86.24%, and 87.32% when the software application 30 is certain. The next step in the evaluation is to understand precisely how much combining heuristics through Bayesian inference helped to reach this result. The accuracy of each heuristic is individually measured and the heuristic output is compared against, respectively, Argus and the end decision taken by the Bayesian inference of the software application 30. The results indicate that heuristics H.1 to H.6 have an accuracy ranging from 74.37% to 78.28%, all below the accuracy of 79.60% of the Bayesian inference. H.0 provides very good accuracy (96.61% on the subset of flows where it could decide) but unfortunately only 14.38% of bidirectional flows were identified from non-identical timestamps. The inventors believe that this result strongly depends on the latency of the network where the data is recorded, since an important latency can create enough delays between request and reply flows for H.0 to be conclusive. The inventors conclude first that besides H.0, the heuristics strongly agree with the Bayesian inference decision, which indirectly means that heuristics rarely contradict each other. A second conclusion is that one main advantage of using Bayesian inference is to be able to bridge the indecision gap between heuristics. Indeed, the inventors calculated that the Bayesian inference output was inconclusive (i.e. it outputs a probability of 0.5) in only 0.01% of the case, which is below the rates of all heuristics taken individually.
The last part of the evaluation measured the accuracy of the service detection capability of the software application 30 by using PADS as the baseline. Since PADS detects only TCP services, the focus was only on TCP traffic. The numbers of unique TCP services detected are:
To investigate further the activity related to port 3050, the dataset provided by the software application 30 was queried for a weekly report of internal and external client, server and scanner activities. Clients and servers are extracted from valid bidirectional end points, while scanners are known from source end points producing a large number of unanswered request flows. If the inventors only had access to unidirectional flows, differentiating clients from scanners would not have been possible.
Identifying server end points allows us also to identify client and scanner end points. Specifically, a scanner is a client that tries to contact more than a given number of nonexistent servers. A threshold of 5 nonexistent servers in 5 minutes was used. The software application 30 was able to quickly spot compromised machines that attempted to infect their neighbors. For example,
The software application 30 offers network operators and security administrators access to bidirectional flows without the issue of having to instrument the network with new costly sensors. The above paragraphs showed how important the data produced by the software application 30 could be to immediately gain visibility over the organization's network. Moreover, running on top of NetFlow offers the important advantages of not being affected by encrypted traffic or by privacy issue related to deep packet inspection. Here, though, future work is discussed. As previously explained, the results discussed above were related to non-sampled flows. Results from other evaluations of passive detection techniques indicate that sampling has a limited impact on the overall accuracy. For example, Bartlett, et al. report that capturing only 16% of the data results only in an 11% drop in discovered servers. See G. Bartlett, J. Heidemann, and C. Papadopoulos, Understanding Passive and Active Service Discovery, Proc. 7th ACM SIGCOMM Conference on Internet Measurement, 2007, pp. 57-70. The inventors believe, however, that random flow sampling will likely break the correct detection of bidirectional activity. Future work, then, may precisely assess the effect of sampling on the detection accuracy of the different heuristics. Furthermore, the results discussed above were related to asymmetric routing. It was assumed in this study that NetFlow collectors covered the pathways for both requests and replies. In some organizational networks, replies and requests can sometimes take different routes for which there is no NetFlow collector deployed. Such architecture would again break the pairing of unidirectional flows into bidirectional flows. Finally, in these results, the software application 30 worked at the network layer and therefore heavily relied on port numbers. As a consequence, it can be difficult or impossible for a network operator to identify the application behind a service detected by the software application 30. This issue arises from the fact that some applications use random ports or hide behind well known ports. For example SKYPE® is famous for using port 80 or port 443, normally reserved to web traffic, in order to evade firewall protection (SKYPE® is a registered trademark of Skype Limited). Related work by Erman, et al. on flow-based traffic classification proved that it is possible to accurately identify applications using only NetFlow. See J. Erman, A. Mahanti, M. Arlitt, and C. Williamson, Identifying and Discriminating Between Web and Peer-to-Peer Traffic in the Network Core, Proc. of the 16th International Conference on World Wide Web, 2007, p. 892. Future work may involve developing additional heuristics for the software application 30 to be able to precisely classify traffic regardless of port number. These heuristics can work on 1) relationships between flow characteristics, such as the ratio between number of packets and number of bytes or the time distribution of flows, and 2) relationships between hosts. The inventors believe that discovering communication patterns between hosts would be useful not only to identify applications but also large communication structures such as those used by P2P networks or botnets.
Exemplary embodiments, in conclusion, describe a novel approach to combine server detection heuristics using Bayesian inference. Exemplary embodiments include a passive server discovery architecture and application that requires only NetFlow to run. The evaluation of the software application 30 in an academic network of 40,000 computers reveals that the Bayesian inference succeeds in improving the accuracy of the different heuristics and provides to network operators a meaningful confidence value for each server discovered. When this confidence is at its higher value, the software application 30 detects the correct orientation of 87% of the bidirectional flows processed, and identifies 93% of the servers. Finally, the different case studies show how the network visibility offered by the software application 30 provides a simple and efficient solution for network operators and security analysts to detect security compromises, to find undocumented and potentially vulnerable servers and to forensic security issues.
Exemplary embodiments may be physically embodied on or in a computer-readable storage medium. This computer-readable medium may include CD-ROM, DVD, tape, cassette, floppy disk, memory card, and large-capacity disks. This computer-readable medium, or media, could be distributed to end-subscribers, licensees, and assignees. These types of computer-readable media, and other types not mention here but considered within the scope of the exemplary embodiments. A computer program product comprises processor-executable instructions for using heuristics to answer difficult questions, as explained above.
While the exemplary embodiments have been described with respect to various features, aspects, and embodiments, those skilled and unskilled in the art will recognize the exemplary embodiments are not so limited. Other variations, modifications, and alternative embodiments may be made without departing from the spirit and scope of the exemplary embodiments.
Number | Name | Date | Kind |
---|---|---|---|
6182136 | Ramanathan et al. | Jan 2001 | B1 |
6321338 | Porras et al. | Nov 2001 | B1 |
6363489 | Comay et al. | Mar 2002 | B1 |
6453345 | Trcka et al. | Sep 2002 | B2 |
7164657 | Phaal | Jan 2007 | B2 |
7441429 | Nucci et al. | Oct 2008 | B1 |
7512980 | Copeland et al. | Mar 2009 | B2 |
8234238 | Keith, Jr. | Jul 2012 | B2 |
20020133586 | Shanklin et al. | Sep 2002 | A1 |
20040083092 | Valles | Apr 2004 | A1 |
20040187032 | Gels et al. | Sep 2004 | A1 |
20050086045 | Murata | Apr 2005 | A1 |
20050114327 | Kumamoto et al. | May 2005 | A1 |
20060294037 | Horvitz et al. | Dec 2006 | A1 |
20070248084 | Whitehead | Oct 2007 | A1 |
20080291915 | Foschiano | Nov 2008 | A1 |
20090287678 | Brown et al. | Nov 2009 | A1 |
20100150004 | Duffield et al. | Jun 2010 | A1 |
Entry |
---|
Building a Better NetFlow Cristian Estan cestan@cs.ucsd.edu SIGCOMM'04, Aug. 30-Sep. 3, 2004, Portland, Oregon, USA. |
Number | Date | Country | |
---|---|---|---|
20110153537 A1 | Jun 2011 | US |