1. Technical Field
The present invention relates to anomaly detection on packet switched communication systems. Particularly, the present invention is related to statistical methods for detecting network traffic anomalies due to network attacks or to communication system failures.
2. Description of the Related Art
Several types of attacks are known, such as: (distributed) denial of service ((D)DoS) attacks, scanning attacks, SPAM or SPIT attacks, and malicious software attacks.
Denial-of-Service (DoS) attacks and, in particular, distributed DoS (DDoS) attacks are commonly regarded as a major threat to the Internet. A DoS attack is an attack on a computer system or network that causes a loss of service or network connectivity to legitimate users, that is, unavailability of services. Most common DoS attacks aim at exhausting the computational resources, such as connection bandwidth, memory space, or CPU time, for example, by flooding a target network node by valid or invalid requests and/or messages. They can also cause disruption of network components or disruption of configuration information, such as routing information, or can aim at disabling an application making it unusable. In particular, the network components (e.g., servers, proxies, gateways, routers, switches, hubs, etc.) may be disrupted by malicious software attacks, for example, by exploiting buffer overflows or vulnerabilities of the underlying operating system or firmware.
A DDoS attack is a DoS attack that, instead of using a single computer as a base of attack, uses multiple compromised computers simultaneously, possibly a large or a very large number of them (e.g., millions), thus amplifying the effect. Altogether, they flood the network with an overwhelming number of packets which exhaust the network or application resources. In particular, the packets may be targeting one particular network node causing it to crash, reboot, or exhaust the computational resources. The compromised computers, which are called zombies, are typically infected by malicious software (worm, virus, or Trojan) in a preliminary stage of the attack, which involves scanning a large number of computers searching for those vulnerable. The attack itself is then launched at a later time, either automatically or by a direct action of the attacker.
(D)DoS attacks are especially dangerous for Voice over IP (VOID) applications, e.g., based on the Session Initiation Protocol (SIP). In particular, the underlying SIP network dealing only with SIP signaling packets is potentially vulnerable to request or message flooding attacks, spoofed SIP messages, malformed SIP messages, and reflection DDoS attacks. Reflection DDoS attacks work, as an example, by generating fake SIP requests with a spoofed (i.e. simulated) source IP address, which falsely identify a victim node as the sender, and by sending or multicasting said SIP requests to a large number of SIP network nodes, which all respond to the victim node, and repeatedly so if they do not get a reply, hence achieving an amplification effect.
SPAM attacks consist in sending unsolicited electronic messages (e.g., through E-mail over the Internet), with commercial or other content, to numerous indiscriminate recipients. Analogously, SPIT (SPam over Internet Telephony) attacks consist in sending SPAM voice messages in VoIP networks. Malicious software attacks consist in sending malicious software, such as viruses, worms, Trojan, or spyware, to numerous indiscriminate recipients, frequently in a covert manner. Scanning or probing attacks over the Internet consist in sending request messages in large quantities to numerous indiscriminate recipients and to collect the information from the provoked response messages, particularly, in order to detect vulnerabilities to be used in subsequent attacks. For example, in port scanning attacks, the collected information consists of the port numbers used by the recipients.
Attack detection techniques are known which utilize a description (signature) of a particular attack (e.g., a virus, worm, or other malicious software) and decide if the observed traffic data is consistent with this description or not; the attack is declared in the case of detected consistency.
Furthermore, anomaly detection techniques are known which utilize a description (profile) of normal/standard traffic, rather than anomalous attack traffic, and decide if the observed traffic data is consistent with this description or not; an attack or anomalous traffic is declared in the case of detected inconsistency.
Unlike attack detection techniques, anomaly detection techniques do not require prior knowledge of particular attacks and as such are in principle capable of detecting previously unknown attacks. However, they typically have non-zero false-negative rates, in a sense that they can miss to declare an existing attack. They also typically have higher false-positive rates, in a sense that they can declare anomalous traffic in the case of absence of attacks.
Anomaly detection techniques can essentially be classified into two categories: rule-based techniques and statistic-based or statistical techniques. Rule-based techniques describe the normal behavior in terms of certain static rules or certain logic and can essentially be stateless or stateful. In particular, such rules can be derived from protocol specifications.
On the other hand, statistical anomaly detection techniques describe the normal behavior in terms of the probability distributions of certain variables, called statistics, depending on the chosen data features or parameters.
Paper “Characteristics of network traffic flow anomalies,” P. Barford and D. Plonka, Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, San Francisco, Calif., November 2001, pp. 69-73, suggests that packet rate, byte rate, and flow rate (i.e., the number of packets, bytes, and flows per second) curves in time can be useful for detecting and classifying traffic anomalies, possibly through the wavelet transform techniques.
US-A-2003/0200441 describes a method for detecting (D)DoS attacks based on randomly spoofed (i.e., simulated) IP addresses. To reduce the number of IP addresses, they are first hashed by a hash function. The method consists of counting the relative number of different values of hashed IP addresses among a number of packets, which are inspected successively in time, and of comparing this number with a predetermined threshold. A (D)DoS attack is declared if the threshold is exceeded. The number of inspected packets is iteratively increased if a (D)DoS attack is not detected.
Article “Proactively detecting distributed denial of service attacks using source IP address monitoring”, T. Peng, C. Leckie, and K. Ramamohanarao, Proceedings of Networking 2004, Lecture Notes in Computer Science, vol. 3042, pp. 771-782, 2004, discloses a method according to which DDoS attacks can be (proactively) detected even near the sources of the attack by checking for an increase of new source IP addresses appearing, provided that the source IP addresses of the attack traffic are randomly spoofed. It should be noticed that according to this article the IP addresses are monitored in non-overlapping time intervals and the increase is measured with respect to a database of legitimate IP addresses collected during off-line training.
Paper “Mining anomalies using traffic feature distributions”, A. Lakhina, M. Crovella, and C. Diot, Proceedings of SIGCOMM '05, Philadelphia, Pa., Aug. 22-26, 2005, pp. 217-228, discloses a method comprising a step of computing the “sample entropy” of discrete packet features such as IP addresses and port numbers, in non-overlapping, relatively short time intervals (e.g., 5 min), to statistically model the multidimensional entropy data collected on multiple links in a communications network by using the principal component analysis, and then to verify if the current data is inconsistent with the model determined by checking if the squared prediction error, resulting from the principal component analysis exceeds a threshold. The Applicant observes that the sample entropy used is based on the well-known Shannon entropy. It is expected that the frequency distribution of the IP addresses or port numbers reflected in the sample entropy should change in the case of an attack traffic. The Applicant observes that the same method is later proposed in WO-A-2007/002838.
Article “Entropy based worm and anomaly detection in fast IP networks”, A. Wagner and B. Plattner, Proc. 14. IEEE International Workshops on Enabling Technologies Infrastructure for Collaborative Enterprises, Linköping, Sweden, June 2005, pp. 172-177, discloses a method considering discrete packet features such as IP addresses in relatively short time intervals (e.g., 5 min) and to compress a concatenation of all the IP addresses occurring in the interval by a lossless data compression algorithm, such as the Lempel-Ziv coding algorithm. It is expected that the compression ratio should be lower if there is an attack traffic in the interval, due to randomization of destination IP addresses.
Thesis “DDoS attack detection based on Netflow logs”, E. Haraldsson, Student thesis SA-2003.35, Swiss Federal Institute of Technology, Zurich, 2003, and thesis “Plug-ins for DDoS attack detection in realtime,” A. Weisskopf, Semester thesis SA-2004.19, Swiss Federal Institute of Technology, Zurich, 2004, disclose a number of packet statistics for the detection of DDoS attacks. The statistics examined by Haraldsson include the number of open or half-open (obtained from TCP flags) connections, the number of transmitted or received bytes per (grouped) IP address, the number of open ports per (grouped) IP address, and the histogram of the average packet sizes, while the statistics examined by Weisskopf include the histogram of the flow sizes in bytes over a time period and the activity of (grouped) IP addresses.
The Applicant has observed that the known solutions are not satisfactory with respect to the achieved false-negative and false-positive rates and the computational complexity and memory requirements. This could be due to the fact that it is difficult for the normal traffic in communications networks to be described by stable probability distributions. Moreover, it is difficult to define statistical models of communication systems that would give rise to sufficiently low false-positive and false-negative rates. It should be also noticed that the complexity of the proposed statistical methods may be unacceptably high for high-speed and high-volume communications networks.
The Applicant has noticed that there is a need in the field for achieving an anomaly detection method providing increased reliability and, preferably, reduced computational complexity and memory requirements. In accordance with a particular embodiment, the Applicant has observed that advantages can be obtained by monitoring the statistical behavior of symbolic packet features associated with two packet flow portions lying in corresponding time windows that are moving in time.
A symbolic packet feature is a discrete data that can be extracted from network packets and belongs to a symbolic data set. In a symbolic data set, the distance or closeness between two data values cannot be defined or is not considered to be relevant or meaningful. In particular, the employable symbolic features can be, e.g., the source and destination IP addresses, the source and destination port numbers, the transport protocol used, the source and destination email addresses, the source and destination SIP URIs (Universal Resource Identifiers), or the HTTP (Hypertext Transfer Protocol used on the World Wide Web) URIs.
An object of the present invention is a method of detecting anomalies as defined by the appended independent claim 1. Preferred embodiments of this method are defined in the dependent claims 2-21. According to another aspect, the present invention also relates to an apparatus for detecting anomalies in a packet switched communication system, such as defined in claim 22 and a preferred embodiment thereof defined in the dependent claim 23. A further object of the present invention is a packet switched communication system as defined by claim 24. In accordance with another aspect, the invention relates to a computer program product as defined by claim 25.
The characteristics and the advantages of the present invention will be better understood from the following detailed description of embodiments thereof, which is given by way of illustrative and non-limiting example with reference to the annexed drawings, in which:
Hereinafter, a communication system and several embodiments of a statistical anomaly detection method will be described. In particular, the anomalous traffic to be detected can be due to (D)DoS attacks, SPAM and/or SPIT attacks, scanning attacks, as well as malicious software attacks. It should be noticed that the teachings of the present invention can also be applied to detect anomalous traffic due to failures in hardware apparatuses or in software modules operating in the communication system.
The particular communication system 100 illustrated in
As known, the Open Systems Interconnection Basic Reference Model (OSI Reference Model or OSI Model for short) is a layered, abstract description for communications and computer network protocol design. It is also called the OSI seven layer model since it defines the following layers: application (7), presentation (6), session (5), transport (4), network (3), data link (2), and physical (1).
Layers 3 and 4 (the network and transport layers, respectively) include the following information of an IP packet: source IP address, TCP/UDP (Transmission Control Protocol/User Datagram Protocol) source port number, destination IP address, TCP/UDP destination port number, and transport protocol used (e.g., TCP or UDP). A series of packets having in common the above listed information is defined as a (network) “flow”.
Example of an Anomaly Detection Method
Subsequently, in a extracting step 203 (EXTRACT), samples (xi), of a symbolic packet feature x associated with the first flow portion PFP1 are extracted. Samples (xi)2 of the symbolic feature x of the second packet flow portion PFP2 are also extracted. A symbolic packet feature is a discrete data that can be extracted from network packets and belongs to a symbolic data set. In a symbolic data set, the distance or closeness between two data values cannot be defined or is not considered to be relevant or meaningful. Symbolic data are specified only by discrete values and not by the metric between the values. In particular, even if the data are expressed in terms of rational or integer numbers, but the underlying Euclidean metric is considered to be irrelevant, they can be treated as symbolic data.
In particular, the employable symbolic features can be, e.g., the source and destination IP addresses, the source and destination port numbers, the transport protocol used, the source and destination email addresses, the source and destination SIP URIs or the HTTP URIs.
According to a particular embodiment, the symbolic packet feature under consideration has a two-dimensional nature, i.e., its discrete values can be indexed by two indices. In particular, a symbolic packet feature can be a two-dimensional vector consisting of two discrete coordinates. For example, the symbolic packet feature may be a pair of source and destination IP addresses, port numbers, email addresses, or SIP URIs, respectively.
In accordance with another particular embodiment, the samples of the symbolic feature x can be obtained by quantizing the samples of a “numerical packet feature” into a set of discrete values (e.g., smaller than the set of the numerical features) and are then treated as symbolic features. A numerical packet feature is any quantity extracted from network packets that can be expressed as numerical data by a real, rational, or integer number. According to this definition, it is meaningful to measure the distance or closeness between numerical feature values by the Euclidean metric. Particularly, but not necessarily, the packet numerical feature may relate to and provide an indication about the traffic volume, i.e., the data amount transported by a packet or a packet flow portion. The definitions of some specific numerical packet features which can be extracted from packets are the following:
It is observed that the length ΔT essentially specifies the time resolution with which the traffic is monitored and analyzed and can be static or dynamic. The starting and ending times of the first and the last packet in a flow, respectively, as well as the total number of monitored flows can also be extracted in step 203. The basic numerical features described above, which are based on the information contained in layers 3 and 4 of packet headers are already available in commercial products used in IP networks such as routers and switches (e.g., the well-known Netflow data).
According to another embodiment, to reduce the number of samples of the symbolic packet features, the values of the symbolic packet feature can be grouped or hashed into a smaller set of values. For example, only some bits of a 32-bit IP address can be chosen. More generally, only some linear functions of bits of a 32-bit IP address can be chosen. As known to the skilled person, a (conventional) hash function is a reproducible method of turning original data, belonging to a symbolic data set, into hashed data belonging to a reduced symbolic data set, which contains a reduced number of discrete values. The hashed data is also called a digital “fingerprint” or digest of the original data. According to an example of the invention, conventional hash functions can be used. Alternatively, cryptographic hash functions can be used even if cryptographic properties are not required by the described example of the detection method.
Furthermore, it is noticed that the extracting step 203 can be performed by an hardware and/or software extractor module 106 included in each or only in some of the routers R1-R4 or in other network nodes that are arranged to extract and transmit the extracted symbolic or numerical features to the data collection module 102 of the detection apparatus 101 (
In a computing step 204 (CONCENTRATION), a first statistical concentration quantity or measure Cq1 of the symbolic feature x associated with the first packet portion PFP1 is computed on the basis of the corresponding symbolic feature samples (xi)1. Moreover, a second statistical concentration quantity or measure Cq2 of the symbolic feature x associated with the second packet portion PFP2 is computed on the basis of the corresponding symbolic feature samples (xi)2. The first and second concentration quantities are associated to and describe the traffic status of the corresponding first and second packet flow portions.
A statistical concentration quantity of a set of data is a measure of how the observed symbolic values are concentrated in a given symbolic data set. Particularly, a statistical concentration quantity is a real number that achieves its maximum value if all the data values are identical, and generally decreases as the data values become dispersed among a larger subset of values. According to a particular embodiment, the first statistical concentration quantity Cq1 and the second statistical concentration quantity Cq2 are a first concentration measure C1 of the sample probability distribution of the symbolic data feature and a second concentration measure C2 of the sample probability distribution of the symbolic data feature, respectively. The computing 204 of the first and second statistical concentration quantities can be performed, according to the example, by the statistical analysis module 104. Several methods to compute the statistical concentration quantity will be described with reference to further embodiments of the example of
In a further computing step 205 (VARIATION), a variation quantity Δ is computed from the first statistical concentration quantity Cq1 and the second statistical concentration quantity Cq2. The variation quantity Δ measures a statistical variation or change between the first statistical concentration quantity Cq1 associated with the first packet flow portion PFP1 and the second statistical concentration quantity Cq2 associated with the second packet flow portion PFP2. Preferably, the expected value of the variation quantity Δ should be relatively small if the first packet flow portion PFP1 and the second packet flow portion PFP2 are both drawn from a same probability distribution.
Particularly, the variation quantity Δ can be related to a difference between the first statistical concentration quantity Cq1 and the second statistical concentration quantity Cq2. Preferably, the variation quantity Δ is obtained from said first C1 and second C2 concentration measures of the probability distribution. The computation of the variation quantity Δ can also be carried out by the statistical analysis module 104.
The variation quantity Δ is compared, in a comparison step 206 (COMPARE), with a comparison value, such as a threshold Thr. According to said comparison step 206, if the threshold value Thr is exceeded, then an anomaly is detected (branch Yes) and an alarm signal ALARM is generated in an alarm issuing step 207. If the threshold value Thr is not exceeded, then an anomaly is not detected (branch No) and an alarm signal ALARM is not generated. Particularly, the comparison and the alarm generation step 207 can be performed by the above mentioned alarm generation module 105. The threshold can be static or dynamic and can be determined on the basis of historical data. In particular, a dynamic threshold can change adaptively.
Following a positive (Yes) or negative (No) anomaly detection, the detection method 200 can be repeated in connection with further packet flow portions. Particularly, the further packet flow portions can lie in time intervals whose end points are delayed with respect to the ones in which the first (PFP1) and second (PFP2) packet flow portions were included. Even more particularly, the further packet flow portions can lie in time intervals whose both start and end points are delayed with respect to the ones in which the first (PFP1) and second (PFP2) packet flow portions were included.
It should be noticed that for each monitored flow portion, not only a single packet numerical feature but also a plurality of numerical packet features can be extracted and stored and subsequently converted into symbolic features, as explained above. For example, the following features can be considered: Rpacket, Rbyte, and Nsize. It is observed that any two of the numerical features Rpacket, Rbyte, and Nsize are mutually independent.
Any such feature can be used to detect a respective anomaly. In particular, the average packet size Nsize is preferable for detecting anomalous traffic comprising repeated transmission of essentially the same or similar packets (e.g., packets with the same payload), because in this case Nsize changes its probability distribution over time with respect to normal traffic, e.g., its concentration measure over time may increase. For example, if a message flooding (D)DoS attack is in progress on a SIP network, then it may be likely that a particular type of SIP messages/packets (e.g., INVITE, RE-INVITE, BYE, or REGISTER) is (much) more frequent than the others.
Moreover, in addition to the average packet size Nsize also the packet rate Rpacket is monitored and involved into the anomaly detection method 200. For most anomalous traffic, such as the request or message flooding and reflection DDoS traffic, the traffic volume is increased and this is reflected in an increased value of Rpacket. An increased Rpacket can also be caused by normal traffic such as flash crowds. Also, in case of DDoS attacks, the traffic volume is high near the target, but may be low near the distributed sources of the attack. Therefore, it is preferable to employ both Nsize and Rpacket for statistical anomaly detection.
The features Nsize and Rpacket can be traced in time at the chosen network node (i.e. a router) or a set of nodes, for each individual flow or for certain selected flows, e.g., according to the packet rate. Alternatively, in accordance with a particular example, the numerical feature values for individual flows can be aggregated in groups according to selected packet parameters such as the source or destination IP addresses or the source or destination port numbers. For example, the flows can be grouped for the same source IP address or the same destination IP address. In the former case, the flow statistics correspond to the outbound traffic from a particular network node, and in the latter, they correspond to the inbound traffic to a particular network node. The number of simultaneously monitored flows with the same IP address as the source/destination address indicates the activity of a node with that IP address as the source/destination node in the observed time interval, respectively. The detection method 200 can be applied to any group of aggregated packet numerical features values converted into symbolic features. The features grouping step can be performed by the flow aggregation module 103 (
Alternatively, the features for all the flows monitored can be grouped together, in particular, by distinguishing the direction of flows, regardless of the particular source/destination IP addresses. This type of grouping is interesting for a high level analysis which does not pay attention to particular nodes or users, but rather to the network traffic as a whole. Instead of the IP addresses, the features grouping can be made according to the port numbers, which are indicative of the applications of the packets transmitted.
With reference to the selection of the basic symbolic features, it should be noticed that other symbolic packet features of interest include the source and destination IP addresses and the source and destination port numbers. They are especially interesting for detecting DDoS attacks using randomly spoofed source IP addresses or port scanning attacks using randomly generated destination port numbers. They are also useful for detecting other attacks such as massive malicious software attacks targeting random destination IP addresses. In particular, two-dimensional symbolic data features and the average conditional concentration measure of destination IP addresses can be considered in order to detect SPAM or SPIT attacks or massive malicious software attacks and the average conditional concentration measure of source IP addresses can be considered in order to detect DDoS attacks.
With reference to the selection of other symbolic features, it is also possible to extract and use information contained in other layers such as the application layer (layer 7), in addition to the basic symbolic packet features described above. For example, for SIP packets, the type of packet being transmitted or a source or destination SIP URI can be extracted. Then, this information along with the basic flow features related to layers 3 and 4 can be used either directly or for aggregating the packet data Rpacket, Rbyte, and Nsize. For example, one may use the application layer information directly, for detecting DDoS attacks using randomly spoofed source email addresses or SIP URIs and for detecting SPAM or SPIT attacks using random destination email addresses or SIP URIs, respectively. In particular, one may consider two-dimensional symbolic data features and the average conditional concentration measure of destination email addresses or SIP URIs in order to detect SPAM or SPIT attacks, respectively, or the average conditional concentration measure of source email addresses or SIP URIs in order to detect DDoS attacks.
A first embodiment 300 of the detection method 200, is described herein below with reference to
As regards step 202 of
Accordingly, two successive windows of (approximately) the same length T are shifted τ units of time from each other and hence overlap over T−τ units of time. In this embodiment, at any given time, the packet flow portion PFP1 then corresponds to a sliding window at this time and the packet flow portion PFP2 corresponds to the next sliding window, delayed by τ. It should be noted that samples of the numerical features x can be taken irregularly in time, i.e., in time intervals of possibly variable length ΔT. In this case, the number of samples per sliding window may vary in time, and so do the numbers of overlapping and non-overlapping samples in two successive sliding windows.
Alternatively, when the samples of x are taken irregularly in time, sliding windows containing the same number of samples and mutually shifted with respect to each other by a fixed number of samples instead of a fixed number of units of time can be chosen. In this case, the windows are defined in terms of samples instead of the corresponding time, which may be variable.
As shown by means of functional blocks in
(x)i=m
corresponding to a jth sliding window of length T (e.g., window W1 of
The number of achievable discrete values of the feature x is denoted by m and the set of all m achievable values by is denoted by A={ak:1≦k≦m}.
In a given jth sliding window, Fk,j denotes the number of times a value ak is achieved, i.e., the absolute frequency of a value ak. In a step 304 of
f
k,j
=F
k,j
/n
j (2)
wherein fk,j is the relative number of times a value ak is achieved, i.e., the relative frequency of a value ak. In general, a relative frequency of a particular discrete value in a finite sample of values is defined as the number of occurrences of this value in the sample divided by the total number of values in the sample. Accordingly, the estimated or sample probability distribution is
P
j=(fk,j)k=1m. (3).
Therefore, the estimated or sample probability distribution is an ordered set of relative frequencies.
According to the first embodiment, the statistical concentration quantity considered is a concentration measure associated with the estimated probability distribution Pj=(fk,j)k=1m. Then, a first concentration measure Cj is computed as a quadratic concentration measure for the computed relative frequencies:
The summation of formula (4) is computed for m addends.
According to another embodiment of step 304, the concentration measure is computed by applying the following formula:
where mjeff, which is smaller than or equal to m, denotes the total number of values ak achieved in the jth sliding window, i.e., the total number of non-zero relative frequencies. The summation of formula (5) is computed for mjeff addends. Expression (5) represents a quadratic concentration measure for the computed non-zero relative frequencies. If the number of samples nj is not sufficiently large to cover the whole range of m values (e.g., if nj<m or nj<<m), then mjeff<m.
Alternatively, the concentration can be computed by the following expression:
which may be numerically more convenient than expression (4) if m is large and the involved relative frequencies are very small. It shall be noticed that a statistical dispersion measure corresponding to the quadratic concentration measure (4) is a quadratic entropy defined as:
It follows that 1/m Cj≦1. The maximum value Cj=1 is achieved if and only if the probability distribution is maximally concentrated, i.e., if there exists exactly one relative frequency equal to 1 and all the others are equal to 0. The minimum value Cj=1/m is achieved if and only if the probability distribution is uniform, i.e., fk,j=1/m for all 1≦k≦m. In particular, the quadratic concentration measure defined by any of the expressions (4)-(6) is particularly interesting if the number of samples nj is relatively large with respect to m, e.g., if nj≧m.
Alternatively to the expressions (4), (5) and (6), the following formula can be used:
where {tilde over (m)}≦m and
are the normalized {tilde over (m)} highest relative frequencies, which sum up to 1. These frequencies correspond to a probability distribution P′j=(f′k,j)m=1m, which represents an ordered estimated probability distribution with the relative frequencies indexed in order of decreasing values, i.e., f′1,j≧f′2,j≧ . . . ≧f′m-1,j≧f′m,j. The summation of formula (8) is computed for {tilde over (m)} addends. Expression (8) defines a quadratic concentration measure for the number {tilde over (m)} of highest relative frequencies.
It follows that 1/{tilde over (m)}≦Cj≦1. With reference to expression (8), the maximum value Cj=1 is achieved if and only if the probability distribution is maximally concentrated, i.e., if there exists exactly one relative frequency equal to 1 and all the others are equal to 0. The minimum value Cj=1/{tilde over (m)} is achieved if and only if the probability distribution of the in highest relative frequencies is uniform, i.e., {tilde over (f)}k,j=1/{tilde over (m)} for all 1≦k≦{tilde over (m)}. This concentration measure is interesting if the number of samples nj is relatively small with respect to m, i.e., if nj<m. The number {tilde over (m)} should be determined so that mjeff>{tilde over (m)} is satisfied with a high probability.
In accordance with an alternative embodiment, the concentration measure is computed as the number of repetitions among all nj samples:
C=n
j
−m
j
eff. (10)
It follows that 0≦Cj≦nj−1. The maximum value Cj=nj−1 is achieved if and only if the probability distribution is maximally concentrated, i.e., if there exists exactly one relative frequency equal to 1 and all the others are equal to 0, i.e., if mjeff=1. The minimum value Cj=0 is achieved if and only if there are no repetitions, i.e., if all the values generated are different, i.e., if mjeff=nj, in which case nj≦m has to be satisfied. This concentration measure is particularly interesting if the number of samples nj is much smaller than m, i.e., if nj<<m, in which case most of relative frequencies are relatively small. Note that if the samples are generated randomly according to the uniform probability distribution and if nj≈√{square root over (m)}, then the expected number of repetitions is approximately nj2/2m. More precisely, the expected number of repetitions is approximately nj−m(1−e−n
With reference to the case in which the symbolic packet feature under consideration has a two-dimensional nature and so can be indexed by two indexes, the concentration measure can be based on a conditional probability, i.e., the probability of a value of one of the two indexes given a value of the other index. Particularly, the concentration measure can be computed as an average conditional quadratic concentration measure in accordance with the following expression:
With reference to the quantities indicated in expression (11), in a given jth window, Fk
Furthermore, the average conditional quadratic concentration measure to conditioned on the index k2 is computed in accordance with the following expression, which is analogous to formula (11):
It should be noticed that in expressions (11) and (12), fk
With reference again to a two-dimensional symbolic variable, alternatively to expressions (11) and (12), the following “unconditional” concentration measures can also be computed:
Formula (13) defines an unconditional concentration measure pertaining to both the indices k1 and k2, formula (14) defines an unconditional concentration measure pertaining to the first index k1, and formula (15) defines an unconditional concentration pertaining to the second index k2.
Analogously to the first segment of expression (1), a second segment of symbolic features samples
(xi)i=m
corresponding to a (j+1)th sliding window of length T (e.g., second window W2 of
In a computing step 305, the variation quantity Δ is computed from said first Cj and second Cj+1 concentration measures. According to an example, the variation quantity is an absolute squared difference Δj+, of concentration measures for the two successive segments (1) and (16) and can be computed by the following formula:
Δj+1=(Cj+1−Cj)2 (17)
Alternatively, the variation quantity is computed as the relative squared difference of the concentration measures for two successive segments by one of the following expressions:
In a comparison step 306, the difference Δj+1 or δj+1 is then compared with a fixed or dynamic threshold θj+1, where generally the threshold increases as T/τdecreases. If the threshold is exceeded once or repeatedly a specified number of times in successive computations, where this number may increase as the ratio T/τdecreases, then an alarm ALARM for anomalous traffic is generated in an alarm step 307.
With reference to the threshold definition and according to an example, the threshold θ may be a fixed value. In accordance with another example and to account for changes of concentration measure in normal traffic, the threshold θ could be determined possibly from historical data for normal traffic, at a considered network node, in order to keep the false positive rate reasonably low. In particular, the threshold θ may depend on the time of the day. The concrete relative squared difference to be used among the ones above defined can be chosen so as to minimize its variation on historical data for the normal traffic. Particularly, the threshold can be chosen irrespectively of statistical model estimating the traffic behavior.
Given an appropriate value of the threshold θ, it is then expected that the probability that the threshold is exceeded, i.e., the false-positive rate is low for normal traffic, whereas at times where there is a change from normal traffic to anomalous traffic, it is expected that the threshold is not exceeded with a low probability, i.e., with a low false-negative rate.
It is noticed that the method 300 is robust as the changes are related to concentration measures of probability distributions, and not to probability distributions, which may change rapidly for the normal traffic conditions. The concentration measure of formula (8) is more robust than the one of formulas (4) and (5), because it relates to a subset of the highest relative frequencies only. The concentration measure of formula (10) is more robust than the one of formulas (4) and (5) or (8), because it depends only on the total number of values achieved and, as such, is less sensitive to the probability distribution itself.
However, unlike the concentration measures (4), (5) and (8), the concentration measure according to expression (10) is sensitive to changes in the number of samples nj from one sliding window to another. Accordingly, the sliding windows for the concentration measure (10) can be defined in terms of a fixed number of samples instead of a fixed time duration and mutually shifted with respect to each other by a fixed number of samples instead of a fixed number of units of time. This also simplifies the computation of the concentration measures (4), (5) and (8). Alternatively, if the number of samples is expected to vary considerably from one window to another, one can perform a normalization of expression (10) by dividing by an appropriate normalization factor, e.g., by nj2/2m.
It should be noticed that the value of the delay or shift τ determines the resolution of the above proposed statistical anomaly detection method 300, because it takes τ units of time, or a small multiple of τ units of time, in order to detect a change from normal to anomalous traffic. Preferably, the value of T should be large enough in order to obtain relatively stable estimates of the chosen concentration measure so that for normal traffic the relative changes of the concentration measure are not too large. On the other hand, the ratio T/τ should not be too large so that the change of traffic from normal to anomalous does not require a very small threshold θ to be detected. For example, the ratio T/τ may be chosen so as to satisfy the following expression:
1≦T/τ≦10. (21)
According to a second example of the detection method 200, the two successive windows are defined in a different way with respect to the first embodiment.
According to this second embodiment, in step 202 at time j+1, the following first and second sample segments corresponding to packet flow portions PFP1 and PFP2, respectively, are considered:
(xi)i=m
(xi)i=m
where the first segment (22) is the initial part of the second segment, without the ending part (xi)i=m
As indicated in step 206, the squared difference Δj+1 or Sj+1, is then compared with a threshold. This threshold may be somewhat reduced in comparison with the threshold of the first embodiment 300, because, for normal traffic, the concentration measures for the two segments are then expected to be less mutually different.
The method of the second embodiment may be more suitable than the one of the first embodiment 300 for detecting anomalous traffic of duration shorter than the window size T. This is due to the fact that in the first embodiment, a considerable change in concentration measure, due to the beginning of anomalous traffic, would be detected not only by the ending point of the sliding window (such as the window W2 in
In a third embodiment of the detection method 200, a moving window of increasing length is defined. Such moving window extends from a chosen initial time up to the current time, and each time, the ending point of the moving window advances τ units of time, where τ determines the resolution in time for detecting the anomalous changes in traffic.
At each time, the packet flow portions PFP1 and PFP2 correspond to two successive moving windows. Accordingly, for a generic window index j, the packet flow portion PFP1 is defined by the segment
(xi)i=1m
which is associated with the jth moving window containing mj samples, and the packet flow portion PFP2 is defined by the segment
(xi)i=1m
which is associated with the (j+1)th moving window containing mj+1 samples.
According to the third embodiment, the concentration measure is based on the relative frequencies of individual discrete values which are computed by selected exponentially weighted sums so that the influence of the past data on the concentration measure decreases as the time decreases, in order to ensure the sensitivity to anomalous behavior of the current data. As an example, the third embodiment is based on a novel Exponential Weighted Moving Average (EWMA) technique applied to relative frequencies of discrete values of symbolic variables. This method is described hereinafter in greater detail and in terms of mathematical equations.
In the computing step 204, a sequence ({right arrow over (λ)}i)i=1∞ of value-indicator vectors associated with the sequence of samples (xi)i=1∞ are defined. A vector {right arrow over (λ)}i, at time i, is an m-dimensional binary vector associated with the sample xi whose coordinates correspond to different discrete values, with only one coordinate equal to 1, namely, the coordinate corresponding to the discrete value assumed by the sample xi, and all the remaining coordinates equal to zero. Moreover, another vector {right arrow over (f)}t is defined; this vector is an m-dimensional vector of estimated relative frequencies on the segment (xi)i=1t of t initial data samples.
In accordance with the specific EWMA technique defined, a computation of the vector of estimated relative frequencies {right arrow over (f)}t is performed by an iterative-recursive method. Particularly, the computation of {right arrow over (f)}t for every new data sample considered, for t=1,2, . . . , is performed in accordance with the following expression:
{right arrow over (f)}
t
=a{right arrow over (λ)}
t+1+(1−a){right arrow over (f)}t (26)
with the initial value {right arrow over (f)}1={right arrow over (λ)}1, where 0<a≦1. A meaning of the recursion (26) can be seen from its explicit solution:
which represents an exponentially weighted average applied to the value-indicator vectors. It should be noticed that the vectorial recursion (26) is equivalent to the set of m scalar recursions, corresponding to m individual discrete values, for computing individual relative frequencies, for k=1,2, . . . , m, by
f
k,t+1
=aλ
k,t+1+(1−a)fk,t (28)
with an explicit solution
where λk,i=1 if xi=ak and λk,i=0 otherwise.
At time j+1, the concentration measure associated with the second segment (xi)i=1m
The concentration measures can be computed by applying any of the corresponding expressions (4), (5), (8), (11), or (12) to the iteratively computed relative frequencies. It is observed that the concentration measure defined by expression (10) is excluded from the third embodiment, because it is not specifically related to relative frequencies.
More precisely, the chosen concentration measure Cj is then computed for each segment (xi)i+1m
The value of the constant a determines the effective number of past samples influencing the current relative frequency vector and the resulting concentration measure estimates. More precisely, this number increases as the constant decreases. In particular, smaller values of a are preferred in order to obtain relatively stable relative frequency estimates. Preferably, a should be chosen in accordance with the statistical properties of the normal traffic. In general, the faster the concentration measure variations in normal traffic one may expect, the bigger the constant a one should choose.
In a fourth embodiment 400 of the present invention, schematically shown in
Moreover, for each considered symbolic packet feature, a relative squared difference of concentration measures is computed, for j=1, . . . , N. In a further step 401 (Σ), the relative squared differences of concentration measures C1, . . . , CN are combined to obtain a total variation quantity Δtot. According to an example, the combination step is a summation of said relative squared differences of concentration measures C1, . . . , CN. The total variation quantity Δtot is then compared (step 402) with a threshold value Thr in order to detect an anomalous condition which can cause the generation of an alarm signal or message ALARM (branch Yes). The comparison of the total variation quantity Δtot with the threshold value Thr could detect a normal traffic condition (branch No) and, in this case, no alarm signals or messages are generated.
It should be noticed that the combination step 401 may be performed by different types of combination algorithms. According to another particular example, prior to performing the summation, the relative squared differences of concentration measures C1, . . . , CN are multiplied with different weights that may be associated with individual symbolic packet features.
Moreover, different decision criteria may be employed. According to an example, a total variation quantity Δtot is not computed and comparisons of each of the relative squared differences of concentration measures C1, . . . , CN with a respective threshold are performed. An alarm signal is then generated if at least a specified number of said comparisons detect an anomalous traffic condition. According to another example, in addition to the variation quantity criterion, aiming at detecting sudden changes of concentration measure, one may also take into account other criteria, e.g., for message flooding (D)DoS attacks, one may require that there is also a significant change of the packet rate Rpacket or the byte rate Rbyte. For Nsize, apart from looking for a considerable (relative) change in concentration measure, one may also specifically require that the concentration measure increases.
In addition to symbolic packet features, also numerical packet features (as defined above) can be taken into account. In this case, the detection of anomalies in the communication system 100 can be performed by a combination of the above indicated criteria based on symbolic packet features and other criteria based on monitoring of statistical behavior of numeric packet features associated with the first and second packet flow portions. As an example, the criteria described in the pending PCT application in the name of the same applicant of the present patent application and having the following title “Method of detecting anomalies in a communication system using numerical packet features” can be employed in combination with the ones herein described. This further PCT application, which refers to a method of monitoring a statistical dispersion quantity, is herein enclosed by reference. Particularly, the result of the detection method 200 of the present patent invention based on symbolic packet features can be combined with the result of the detection method based on numerical packet features in accordance with a combination algorithm comprising, as an example, logical operations, e.g., OR and AND.
According to another example of the present invention, relating to all four embodiments described above, each elementary time interval ΔT can contain a number of aggregated discrete values of the symbolic feature chosen, instead of only one discrete value. As indicated above, the data aggregation can be performed by the aggregation module 103. It is observed that, in this case, timings of individual discrete values within ΔT can be regarded as irrelevant. Since all the above described concentration measures depend only on the numbers of occurrences of discrete values in the moving windows considered, and not on their order, the proposed methods in the first and second embodiments, using sliding windows can be easily applied to aggregated data, whereas the method of the third embodiment, using moving windows and exponentially weighted averages, can be adapted to deal with the aggregated data. More precisely, instead of the binary value-indicator vectors ({right arrow over (λ)}i)i=1∞ corresponding to individual discrete values at given times, at each time corresponding to an elementary time interval ΔT, the average value-indicator vector is computed, as the arithmetic mean of all the binary value-indicators in this interval, and then further processed in the same way as above. This average value-indicator vector equivalently represents a relative frequency vector corresponding to the elementary time interval considered, as in the sliding window methods described with reference to the first and second embodiments.
A fifth embodiment refers to the same definition of sliding windows as described in relation to the first embodiment 300 (
According to the fifth embodiment, instead of recomputing the chosen concentration measure for each new sliding segment, the relative frequencies fk,j and the concentration quantities Cj already computed for the preceding segment are being updated so as to save in computations. With reference to the data memory requirements, also for the fifth embodiment all the data samples belonging to a preceding sliding window for which the concentration measure was previously computed need to be stored.
As described with reference to the first embodiment 300 described above, a first sliding segment (xi)i=m
In accordance with the fifth detection method, for each 1≦k≦m, the absolute frequency Fk,j is updated into Fk,j+1 by inspecting only the two non-overlapping parts of the two segments (xi)i=m
F
k,j+1
=F
k,j
+F
k,j+1
new
−F
k,j
old. (30)
Instead of recomputing Fk,jold, a previously computed and memorized value Fk,j′new, can be computed provided that T is an integer multiple of τ. Then, the updated relative frequencies are computed as fk,j+1=Fk,j+1/nj+1. Similarly, mjeff can be updated into mj+1eff. Finally, the updated concentration measures can be computed by using any of the expressions (4), (5), (8), (10), (11), or (12).
Alternatively, the quadratic concentration measures (4), (5), (8), or (10) can be directly updated. Namely, expression (10) can be updated by the following formula:
C
j+1
=C
j+(nj+1−nj)−(mj+1eff−mjeff). (31)
The expressions (4) and (5) can be updated by:
which, if the numbers of samples are equal, i.e., nj+1=nj, reduces to
whereas expression (8) can be updated analogously.
A sixth embodiment refers to sliding windows of the type described for the second embodiment (
According to this sixth embodiment, the concentration measures Ĉj for the shortened sliding segments (xi)i=m
Then, for each j=1,2,3, . . . , the concentration measure Cj+1 for the segment (xi)i=m
together with Fk,j+={circumflex over (F)}k,jFk,j+1new and fk,j+1=Fk j+1 nj+1, as in this case. {circumflex over (F)}k,jold=0.
It is observed that for all the above described embodiments, the underlying expectation is that the change of the concentration measure from one moving window to another may be much smaller for normal traffic than when there is a change from normal into anomalous traffic. With reference to different types of attacks and the concentration measure behavior, the following exemplary situations may occur. A first situation is a DDoS attack with randomly spoofed source IP addresses, when the concentration measure of these addresses tends to decrease considerably. A second situation is a DDoS attack that utilizes randomly spoofed source email addresses or SIP URIs. A third situation is a port scanning attack, when the concentration measure of the destination port numbers may decrease significantly. A fourth situation is a SPAM or SPIT attack using random destination email addresses or SIP URIs, respectively. A fifth situation is a massive malicious software attack targeting random destination IP addresses. A sixth situation is a message or request flooding DDoS attack, when the concentration measure of (possibly quantized) packet sizes tends to increase considerably, due to repeatedly sending essentially the same or similar packets. In particular, one may consider two-dimensional symbolic data features and the average conditional concentration measure of destination email addresses or SIP URIs in order to detect SPAM or SPIT attacks, respectively, or the average conditional concentration measure of source email addresses or SIP URIs in order to detect DDoS attacks.
The findings and teachings of the present invention show many advantages. Theoretical considerations of the Applicant have shown that the proposed various concentration measures and the corresponding variation quantities result in a powerful and general method for reliable detection of anomalous changes in network traffic data.
Moreover, it should be observed that the example of
Some prior art techniques propose the usage of certain entropy measures related to certain symbolic packet features such as the IP addresses (e.g., the Shannon entropy or the compression ratio of data compression), but fail short in providing a simple, general, and reliable method for detecting anomalous changes in these entropy measures (for example, see the above mentioned articles “Mining anomalies using traffic feature distributions”, A. Lakhina, M. Crovella, and C. Diot, and “Entropy based worm and anomaly detection in fast IP networks”, A. Wagner and B. Plattner). In this regard, the above described two overlapping sliding window techniques and the exponentially weighted moving average technique appear to be particularly advantageous.
The methods described above are mathematically relatively simple, sufficiently robust to changes inherent to normal traffic, and yet capable of detecting anomalous traffic due to attacks such as (D)DoS attacks, SPAM and SPIT attacks, and scanning attacks, as well as massive malicious software attacks. For example, the quadratic and other concentration measures together with the corresponding variation quantities according to the present invention do not require complex computations in contrast with the articles mentioned above. As such, the proposed detection method appears to be very suitable for high-speed and high-volume communications networks.
Furthermore, the proposed average conditional concentration measures for two-dimensional symbolic features offer particular advantages for detecting anomalous traffic due to various network attacks.
In general, in the following claims, the terms used should not be construed to limit the claims to the specific embodiments disclosed in the specification and the claims, but should be construed to include all possible embodiments along with the full scope of equivalents to which such claims are entitled. Accordingly, the claims are not limited by the disclosure.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2007/011474 | 12/31/2007 | WO | 00 | 6/29/2010 |