This application claims priority under 35 U.S.C. §119(a) to European Patent Application No. 08425077.8, filed Feb. 11, 2008, the contents of which are hereby incorporated by reference in their entirety.
The present invention refers to a method to determine if an encrypted flow of packets belongs to a predefined class of flows.
The majority of local area networks today enforce security policies to control the traffic that crosses their boundaries. Security policies are usually implemented by combining two types of devices, firewalls and Application Level Gateways (ALGs). A very common setup is to have a firewall that allows only traffic that cross the ALG and leave the task of traffic control to the ALG. The ALG verifies through a thorough classification based analysis that the traffic that cross the network boundaries obeys the policies.
Through the recent years, however, the safety that can be guaranteed by this kind of devices is dramatically diminishing. Several factors are contributing to this trend: an example is given by the emergence of masquerading techniques that tunnel forbidden application protocols inside those that are allowed by the policies.
A solution to detect when the HTTP protocol is used to tunnel other application protocols on top of it, is reported in the paper by M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli titled “Detecting HTTP Tunnels with Statistical Mechanism” and published in the “Proceedings of the 42th IEEE International Conference on Communications” (ICC 2007), Glasgow, Scotland, June 2007.
Although effective for the HTTP protocol, the solution described in this paper is totally ineffective toward tunneling mechanisms that use encryption such as those that can be set up between any pair of Secure Shell client and server peers. These tunnels can be used to protect by means of cryptographic techniques any traffic stream flowing between a SSH client, the tunnel entry point, and a SSH server, the tunnel exit point: the resulting stream is not distinguishable from normal SSH traffic by the classifiers used within ALGs.
The SSH protocol is typically used to exchange traffic between a pair of peers on a secure connection when the network is not secure.
While in the case of HTTP tunnels, advanced ALG devices could analyze what is actually carried on top of the HTTP protocol, the same analysis can not be accomplished if the tunneling protocol encrypts the exchanged information.
From the above it follows an increasing need for a method that can determine whether or not an encrypted flow of packets belongs to a predefined class of flows, identifying for example non-legitimate activities such as tunnel activities over SSH, so as to avoid lock of flows belonging to this predefined class and to possibly block the encrypted flow not belonging to this predefined class.
The scope of the present invention is to provide a method to determine if an encrypted flow of packets belongs to a predefined class of flows in accordance with the requirements described above and being able to get around the described impairments as reported in the prior art.
This scope is achieved by a method to determine if an encrypted flow of packets belongs to a predefined class of flows.
According to a further aspect, this scope is achieved by an apparatus to determine if an encrypted flow of packets belongs to a predefined class of flows.
Thanks to the present invention, it is possible to determine with very high accuracy if an analyzed encrypted flow of packets belongs to a predefined class of flows and with the same accuracy if the encrypted flow of packets is used for tunneling activities given that the usage context characterized by the predefined class of flows is a tunneling context.
Other features and benefit offered by the method to determine the class membership of an encrypted flow of packets according to the present invention will be reported in the following together with an example of the preferred embodiment of this invention, given as an example and in a non-binding way, referring to the enclosed figures.
In the following of the present invention, the concept of probability density function will be used within the context of statistical pattern recognition methodologies. To provide a framework for the presented technique, here the basics of pattern recognition theory are introduced.
Three main definitions are the basis of statistical pattern recognition: pattern, feature and class. The pattern is an r-dimensional vector of measurements {right arrow over (x)} =(x1, x2, . . . , xr), whose components xi measure the features of an object, that means the value of one or more directly quantifiable variables associated to the object. The concept of class is used in discrimination: assuming a plurality of C classes named ω1, . . . ωC, each pattern {right arrow over (x)} will be associated with a variable that denotes its class membership.
In the case of the present invention, classes are gathered from an a priori training set composed of flows belonging to one or more chosen classes. Given a set of data and a set of features, the goal of a pattern recognition technique is to represent each element as a pattern, and to assign the pattern to the class that best describes it, according to chosen criteria.
In the following it is described a method to determine if an encrypted flow of packets F belongs to a predefined flow class ωt.
According to one embodiment, the predefined flow class ωt identifies legitimate activities and hence represents the class of flows that must be accepted. It follows that it is possible to define a complementary class ωr that identifies not legitimate activities, such as those which involve tunneling traffic, and hence represents the class of flows that must not be accepted or equivalently blocked; all of the above require that C=ωr∪ωt, and ωr∩ωt=Ø.
In the following the term packet flow F represent an ordered sequence of N+1 packets Pkti composed of packets used to accomplish the authentication step, and of packets used to exchange information. Here variable i varies in i=0 . . . N, where the i-th packet Pkti represents a packet generated by the client toward the server (or viceversa), and N+1 is the number of packets that compose the packet flow F. Please note that in the case of transport protocols such as the Transmission Control Protocol (TCP) that rely on signaling mechanisms, the method should not be applied to those packets that contain only signaling information.
Each packet Pkti can be represented by at least two measurable variables, i.e. the pair of measurable variables (s,Δt), so that the flow can be defined by the ordered sequence of pairs si,Δti given by the at least two measurable variables si,Δti.
According to the embodiment described in the following, si represents the length of packet Pkti and Δti is the inter arrival time between the reception of two consecutive packets Pkti-1 and Pkti.
The method to determine if the encrypted flow of packets F belongs to the predefined flow class ωt comprises a first step a) of providing, for each i-th packet position in an ordered sequence of packets Pkti, a probability density function p(xi|ωt) of the values of the two measurable variables in a plurality of encrypted flows of packets Fj belonging to the predefined flow class ωt, where xi=(si,Δti).
Class ωt can represent flows generated by legitimate activities. It should be noted that the flows class ωt might be associated to legitimate activities to be determined with respect to illegitimate activities which can be associated to the complementary class ωr.
In particular, step a) comprises the following steps:
According to one embodiment, step a3) comprises, for each i-th packet position, the steps of:
The adopted kernel k has to be chosen to optimize the results that can be obtained by the subsequent classification of the packet flow F. Examples of functions that can be used as kernels are the Gaussian function and the Hyperbolic Secant function respectively defined as:
It should be noted also that the normalization of the probability density functions pf(xi|ωt) obtained by applying the kernel k has to be performed keeping unaltered the range of values which the observable variables may assume.
Afterwards, the method involves the following steps:
It should be noted that steps b) and c) must be applied to a predefined number of consecutive packets, all following a predetermined packet no chosen in such a way that steps b) and c) are applied exclusively to those packets of the encrypted flow of packets that carry data, hence excluding both signaling packets and authentication packets.
According to one embodiment of this invention, when the reference value S({right arrow over (x)}|ωt) is lower than the threshold value T, the encrypted flow of packets F is determined as belonging to the predefined flow class ωt: if, instead, the reference value S({right arrow over (x)}|ωt) is greater than the threshold value T, the encrypted flow of packets F is determined to belong to the complementary class ωr.
The method described in the present invention can be especially deployed to discover illegitimate activities, when the predefined flow class ωt describes legitimate flows and the complementary class ωr characterizes illegitimate flows.
In this case, the method permits to establish if flow F can be assigned to the predefined flow class ωt. Since the predefined flow class ωt defines an acceptance region that is complementary to the rejection one defined by the complementary flow class ωr, every flow F that does not belong to the predefined flow class ωt is classified as a flow that belongs to the complementary flow class ωr which defined the rejection region.
In the following of the present description, we will refer, in a non binding way, to a flow F over the SSH protocol, defined by a sequence of N pairs Pi=(si,Δti) of the two measurable variables s,Δt, with 1≦i≦N, where si represents the size of packet Pkti and Δti represents the time interval between receiving two consecutive packets Pkti-1 and Pkti.
A flow is hence represented by a pair of patterns {right arrow over (x)}, one for each of the two directions. In the following example,
where r is the number of packets composing the flow excluding the first, as the first value Δt can be measured only after having analyzed the second packet Pkt1.
To determine if a SSH flow is carrying tunneling activity, it should be determined if the flow F under analysis has been generated by illegitimate activities, such as tunneling ones, or if the SSH protocol is being used for legitimate purposes, such as remote access or secure file copy. In this case, the method supplies the characterization of only one class, in the example the class of legitimate SSH flows corresponding to the predefined flow class ωt.
The generation of the probability density functions p(xi|ωt) is based on the analysis of a plurality of Z probability density functions generated by Z SSH flows that do not carry tunnel activities, that is Z SSH flows that belong to class ωt.
The i-th function p(xi|ωt) is generated from the i-th pairs si,Δti belonging to the flows having at least i+1 packets. Hence each function p(xi|ωt) describes the behavior of the i-th packets in the domain of variables s,Δt.
The value of a pair si,Δti of an i-th packet of a SSH flow F applied to the probability density function P(xi|ωt) gives the correlation between the unknown flow F and the probability density function. The bigger is the value of the probability density function in point si,Δti, the greater is the probability that the unknown flow F has been generated by the SSH protocol.
To take into account the noise that the network can introduce on the measured variables, a kernel is applied, e.g. the Gaussian kernel said above, to the Z functions pf(xiωt) and these functions are then normalized to obtain the corresponding Z functions Mt(si,Δti).
In practice, given a plurality Z of flows Fj generated by the SSH protocol, that means flows that belong to class ωt, these flows are converted into their equivalent pattern representation.
According to one embodiment of the present invention, since the samples are at our disposal, the histogram method can be used. This method partitions the r-dimensional space of class ωt into a number of equally-sized cells and estimates the probability density p({right arrow over (x)}) at a point {right arrow over (x)} as follows:
where
To reduce the complexity of the resulting matrix, one can suppose that consecutive pairs s,Δt are independent; the complexity is hence reduced to rN cells.
In this case, the probability density functions are defined as follows:
where p(xk|ωk) is the one-dimensional probability density function of the components {right arrow over (x)} with respect to class ωt.
The length of the pattern can then be limited to L elements: it follows that each SSH flow contributes to the estimation of a number of probability density functions given by min(r,L), where r is the length of the sequence composed of all the pairs s,Δt.
With the availability of the probability density functions p(xi|ωt) of the values of the two measurable variables s,Δt for each packet position i, the values si,Δti of the two measurable variables for each packet of the encrypted flow of packets F can be measured.
As said above, the measure is determined starting from a predetermined packet n0.
It should be taken into account, in fact, that a SSH flow is made of packets that carry authentication information and packets that carry application data: it follows that the measures of the values of the measurable variables and their application to the probability density functions must not be performed on authentication packets but only on data packets. Similarly all packets that carry only transport layer signaling should be excluded from the measures.
To this end, as anticipated above, steps b) and c) of the method are applied to a predefined number of consecutive packets that follow the predetermined packet n0. Packet n0 is chosen to avoid the application of the method to the authentication packets of the flow F.
Afterwards the measured values si,Δti are applied to the probability density function p(xi|ωt) of the respective packet position i to generate a sequence of values of probability density and this sequence of values of probability density is processed to generate the reference value S({right arrow over (x)}|ωt) .
In particular, the reference value S({right arrow over (x)}|ωt) is calculated using the following equation:
where p(xi|ωt) represents the probability that the i-th element of flow F belongs to class ωt, that is the value given by Mt(si,Δti). It should be noted that, in the case the value of function p(xi|ωt) should be equal to 0, this null value is replaced with the smallest number that can be represented with the chosen arithmetic precision: e.g. value 0 can be replaced by value 10−300, it follows that the log10 of this small value is equal to −300.
Furthermore, taking into account the fact that the method is applied starting from packet n0, the reference value S({right arrow over (x)}|ωt) is obtained as follows:
where n0 is the index of the first pair considered si,Δti and represents the first packet that carries data in each SSH session.
As a matter of fact, the presence of tunneling activities over SSH is detected when the reference value S({right arrow over (x)}|ωt) is greater than the threshold value T. In this case, in fact, flow F will not be assigned to the predefined flow class ωt, which defines the acceptance region; it should instead assigned to the complementary class ωr, which defines the rejection region.
To compute the optimal values of the threshold T and the number no of the first packet to consider, the method described above is applied to encrypted flows of packets F that: use the SSH protocol; do not carry tunneling activities; and have not been used to generate the probability density functions.
In particular, it could be defined a maximum percentage of false negatives, i.e. those legitimate flow (non tunneling ones) that are erroneously assigned to class ωr, e.g. 1%: this means that at least 99% of legitimate flows are correctly classified.
According to one embodiment of the present invention, the set of Z encrypted flows belonging to the predefined flow class ωt can be divided in two subsets of flows; the first set is used to generate the probability density functions p(xi|ωt) and the second set is used to calculate the value of the threshold T. The value of T is chosen to correctly classify the flows in the second subset as belonging to the predefined flow class ωt with a probability, for example, of 99%.
It should be noted that the probability density functions p(xi|ωt) are obtained from flows that belong to only one class, that is the predefined flow class ωt that represent legitimate activities. According to one embodiment of the present invention, the predefined flow class ωt can comprise a plurality of flow classes ωi that represents both legitimate and not legitimate activities. This allows to reduce the uncertainty on the knowledge of the complementary class ωr that defines the rejection region.
In this case, according to the method, the probability density functions p(xi|ωt) are generated for each class ωi belonging to class ωt. The reference value S({right arrow over (x)}|ωt) is computed using the following equation:
where index i corresponds to the predefined classes that define the acceptance class ωt. The comparison step e) will hence establish if the encrypted flow of packets F is conforming to at least one of the known classes ωi that define the flow class ωt or to none of them; in this last case the encrypted flow of packets F is classified as belonging to the complementary class ωr and hence it should be rejected.
To carry out the comparison the class ωm that is used is the one that verifies the following condition:
and flow F is assigned to this class ωm if the following condition holds:
ωm==ωt
S({right arrow over (x)}|ωm)<T
where T is the threshold computed to classify correctly a predefined percentage of flow belonging to flow class ωt, for example the 99% of them. This threshold T is computed as described above.
The present invention also relates to a computer program product loadable in the memory of a numerical processing device, comprising portions of program code which can implement the method described above when run on this processing device. The computer program for implementing the method may be stored on a computer-readable medium.
According to a further aspect, the present invention relates to an apparatus 10 for determining if the encrypted flow of packets F belongs to a predefined flow class ωt.
With reference to
The apparatus 10 comprises the measurement means 12 able to receive as input the encrypted flow of packets F to measure the values si,Δti of the pairs of measurable variables, for each packet Pkti of a plurality of packets of the encrypted flow of packets F.
The apparatus 10 comprises also processing means 13 coupled with the storage means 11 and the measurement means 12 and able to:
apply the measured i-th value pair si,Δti as arguments to the probability density function p(xi|ωt) that corresponds to the i-th packet; and generate a sequence of values of the probability density functions p({right arrow over (x)}|ωt), one for each packet Pkti; and process the sequence of values of the probability density functions p({right arrow over (x)}|ωt) to compute a reference value S({right arrow over (x)}|ωt).
In particular, the measurement means 12 and the processing means 13 are able to measure and process the data packets of the encrypted flow of packets F.
The apparatus 10 comprises also comparison means 14 coupled with the processing means 13 and adapted to compare the reference value S({right arrow over (x)}|ωt) with the threshold value T to determine if the encrypted flow of packets F belongs to the predefined flow class ωt.
As it can be seen from what described so far, the method and the apparatus presented in this invention allow to satisfy the expectations and to solve the impairments that have been described at the top of the present description with reference to the technical note.
In particular, the method described by the present invention allow to determine with very high accuracy if the analyzed flow is not carrying tunneling activities and with the same very high accuracy if the analyzed flow is carrying tunneling activities given that the flow is not classified as a non tunneling flow.
Obviously, a skilled technician, with the intent of satisfying any more specific requirement, could apply a number of adjustments and revisions to the method presented in this invention, which at any rate would be contained within the scope of protection defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
08425077 | Feb 2008 | EP | regional |
Number | Name | Date | Kind |
---|---|---|---|
7448084 | Apap et al. | Nov 2008 | B1 |
20050249125 | Yoon et al. | Nov 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20090207740 A1 | Aug 2009 | US |