The present invention relates to a method and a system for detecting a single data flow in an aggregate packet data flow and for identifying the application generating the single data flow.
In the prior art, there is known the problem of detecting a single data flow in a packet data flow and of identifying the application generating that flow, for example identifying a single voice flow, and the application that generated it, in an aggregate traffic or flow on an IP network.
In particular, such a problem is known with reference to VoIP telephony in which a voice communication is set up over an IP network between two users using unknown and encrypted protocols. A typical example of software that generates voice data flow over an IP network is Skype.
The protocols and algorithms enabling Skype, and most voice programs, to generate voice data flow over an IP network are unknown and often encrypted and are based on encrypting the content.
For this reason it is very difficult to detect the presence of a single data flow generated by a particular application, such as for example Skype, in an aggregate data flow comprising flows generated by various types of applications, whether voice, data transport, video communications, etc.
From the above-described, there emerges the requirement to be capable of detecting the presence of a single data flow in an aggregate packet data flow and of identifying the application generating the single data flow without knowledge of the protocols and algorithms used by the application itself to generate the single data flow and to include such a single data flow in the aggregate packet data flow.
In view of the prior art described, the aim of the present invention is to implement a method and a system for detecting a single data flow in an aggregate packet data flow and identifying the application generating the single data flow, capable of overcoming the drawbacks present in the prior art.
According to the present invention, such an aim is achieved by a method for detecting a single data flow in an aggregate packet data flow and identifying the application generating the single data flow, according to claim 1.
By virtue of the present invention, it is possible to obtain a method for detecting a single data flow in an aggregate packet data flow and identifying the application generating the single data flow over an IP network using a simple technique.
According to a further aspect of the present invention, such an aim is achieved by a system for detecting a single data flow in an aggregate packet data flow and identifying the application generating the single data flow, according to claim 6.
Other features and advantages of the method and system for detecting a single data flow in an aggregate packet data flow and identifying the application generating the single data flow, according to the present invention, will become clear from the following description of a preferred example embodiment, given by way of indication and in a non-limiting manner, with reference to the appended drawings, in which:
Hereafter in the present description statistical functions for measuring the frequency deviation will be used, in particular the Pearson chi-square function. The Pearson chi-square statistical function is illustrated below.
The Pearson chi-square function provides for checking whether the behaviour of an object, observed for a finite number of times, follows an expected behaviour.
This is carried out by calculating the deviation of the measured values of the object with respect to the expected distribution of values of the object.
It is assumed for example that an object is observed for a number of times NTOT and that the object under observation can take N possible outputs or values for each observation.
If the expected distribution of values is such that the value i, where recurs with a probability pi, then the expected number of events or frequency of i is given by the relationship Ei=NTOTpi. With Oi representing the number of events or frequency of i actually observed during the observation, then the value
represents a measurement of the deviation of the observed behaviour with respect to the expected behaviour, i.e. of the observed frequency with respect to the expected frequency.
If the observed object behaves as expected, then the value of χ2 is distributed according to a chi-square distribution with N−1 degrees of freedom.
The chi-square function can be used even for a single observation. In particular, it is assumed that the value of the observed object is distributed with probabilities pi.
In the case in point of an aggregate packet data flow, the packet data flow is generated by a specific generating application and is divided into messages, each message comprising a plurality of blocks g.
Each block g of the plurality of blocks has n bits for identifying 2n block values i, for example i=0, 1, 2, . . . , 2n−1.
With reference to the appended drawings, the method for detecting a single data flow in an aggregate packet data flow and identifying the application generating the single data flow comprises the steps of:
a) providing, for each block value i, an expected frequency value Ei,
b) measuring, for a predefined number G of blocks g of the plurality of blocks, i.e. for Gn bits, the values Oig of frequency with which each block g assumes each block value i so as to obtain a plurality of measured frequency values Oig,
c) processing, for each block g, the measured frequency values Oig and the expected frequency values Ei in order to generate a frequency deviation value χg2 representative of the deviation of the plurality of measured frequency values Oig with respect to the expected frequency values Ei,
d) processing the frequency deviation values χg2 generated for each block g with at least one frequency deviation threshold value χth in order to detect the presence of a single data flow in said aggregate packet data flow and identify the application generating the single data flow.
The single data flow can be both a voice flow and a peer-to-peer (P2P) flow.
In particular, as will be described in detail below, step d) enables the source generating the single data flow, i.e. the application used to generate the detected single data flow, to be determined.
According to one embodiment, step d) comprises the steps of:
d1) processing the frequency deviation values χg2 generated for each block g in order to generate at least one reference frequency deviation value χref for said predefined number of blocks G, and
d2) comparing these generated reference frequency deviation values χref with the frequency deviation threshold value χth in order to determine the source generating the single data flow.
According to one embodiment, step c) comprises the step of applying the plurality of measured frequency values Oig and the expected frequency values Ei to a function of statistical measurement of the frequency deviation.
In particular, the function of statistical measurement of the frequency deviation can be chosen from one of the functions of entropy, mean, variance, chi-square and similar.
In this case, the chi-square function is chosen, expressed by the following formula:
where
χg2 corresponds to the frequency deviation value χg2,
Oig corresponds to the plurality of measured frequency values Oig, and
Ei corresponds to the expected frequency values Ei.
The expected frequency values Ei can be obtained as a function of the application which is desired to be identified, or, in the absence of such information a priori, can be distributed uniformly.
With reference to the appended drawings, there is described hereafter the application of the method according to the invention for detecting a single data flow generated by a Voice over IP application, Skype, in an aggregate packet data flow and identifying such an application generating the single data flow.
Since Skype is a closed and proprietary program which uses encryption algorithms, it is not possible to identify a data flow generated by Skype using conventional techniques for analyzing the contents of packets.
However, there is an important difference regarding messages introduced into a network according to the transport protocol underneath used.
For example, the TCP protocol implements a connection-oriented transmission protocol and therefore guarantees that all the segments of data are received in the same sequence as when they are introduced into the network, possibly with a delay.
However the connectionless service for a connection provided by the UDP protocol does not guarantee the delivery of all the data and in the same sequence as when the data items were introduced.
Consequently, a Skype encoder cannot encrypt the whole message but must allow the Skype receiver to extract from the application layer header some additional information for detecting and managing any messages that are lost or delivered out of sequence to the receiver.
This information cannot be protected by encryption but can only be obscured in such a way that it is easily identified upon reception. This portion of the message is called the Start of Message (SoM).
For example, when a message is transported over the TCP protocol, the entire content of the Skype message is encrypted and therefore the bytes of the message randomly take random values. On the other hand, in the case of transport over UDP, only a part of the message is distributed randomly while other parts exhibit statistical properties typical of deterministic data, for example the SoM.
The method described above provides for differentiating therefore the single data flow generated by Skype applications from data flows generated by other applications for generating a data or voice flow over IP, since such applications use different header formats resulting in different distributions of the bytes of the messages.
It is therefore necessary to check whether the frequency deviation values χg2 are such as to satisfy the expected assumption. With this assumption expected, the content features of the message are used, which are summarized in the table below for messages of type End-to-End (E2E) over UDP, End-to-Out (E2O) over UDP and End-to-End or End-to-Out over TCP, where End-to-End represents traffic generated between two host terminals, each of which uses a Skype client, while End-to-Out represents traffic generated between a host terminal and a conventional PSTN terminal.
For example, the E2E over UDP flow has bytes 1, 2 and 4 encrypted, i.e. random, while byte 3 contains some random bits and some constant bits (mixed in the table), and the start of message bytes of the E2O over UDP flow take deterministic values.
To determine whether a block has a random, deterministic or mixed distribution, the distribution of uniformly distributed bits is considered to be the expected distribution. In that case the expected frequency value E is equal to NTOT/2n for all the block values i, where NTOT is the number of messages analyzed belonging to the flow.
The generated frequency deviation values 4 are therefore compared with one or more thresholds derived from the chi-square distribution with 2n−1 degrees of freedom. These thresholds are indicated by χRnd2, χMix2 and χDet2 for random, mixed and deterministic blocks respectively.
The values G, for the predefined number of blocks, and n number of bits can be fixed, for example at n=4 bits and G=16. In that case, this gives the reference chi-square distribution having 2n−1=15 degrees of freedom and Ei=NTOT/16 for all the block values i=0, . . . , 15.
The generated frequency deviation values χg2 and the reference frequency deviation values χRnd2, χMix2, and χDet2 are compared for example as follows:
where:
G′={g|1≦g≦G,g≠5,6} are the blocks g corresponding to the random part of the E2E message,
is a first generated reference frequency deviation value,
is a second generated reference frequency deviation value, and
χRnd2 and χMix2 are two frequency deviation threshold values.
In essence, it is expected that the blocks g with random distribution have uniform distribution and therefore the generated frequency deviation values χg2 must be relatively low and therefore less than the frequency deviation threshold value χRnd2, and that the blocks g with mixed distribution containing some deterministic blocks have high generated frequency deviation values χg2 and therefore greater than the frequency deviation threshold value χMix2.
In this case, it is expected that the start of message SoM, i.e. the first 4 bytes, i.e. g=8 blocks of n=4 bits, is deterministic and that the remaining part is random, since the whole message is encrypted.
In these cases, it is expected that all the blocks of bits have random distributions.
Advantageously, the number of messages belonging to the flow NTOT is large. For example, the number NTOT is such that the expected frequency value Ei≧5 for all the block values i. In the example stated here, this amounts to saying that
i.e. NTOT≧80 with n=4 bits.
It is also worthwhile noting that the difference between the generated frequency deviation values χg2 for a deterministic or random block g increases as a function of the value of the number of messages belonging to the flow NTOT.
For a deterministic block g:
Therefore χg2 increases substantially linearly with NTOT, therefore the greater the length of the flow, the greater NTOT and the greater the expectation that the block g is deterministic, i.e. exceeds the reference threshold value χDet2.
In the case of a mixed block g, if one bit is fixed and the others have random distributions, Oi=0 for half of the possible block values i, and Oi>0 for the remaining block values i. Since the possible values of i are 2n, the generated frequency deviation value χg2 is:
where χ2
This means that in the case of a block g with a deterministic bit, χg2 still increases linearly with NTOT.
In
In the example, in order to reduce the number of parameters, one can set χRnd2=χMix2=χDet2=150.
The present invention also relates to a system for detecting a single data flow in an aggregate packet data flow and identifying the generating application in the single data flow. The system comprises storage means for storing, for each block value i, an expected frequency value Ei, and for storing a frequency deviation threshold value χth, and measurement means for measuring, for a predefined number G of blocks g of the plurality of blocks, the values Oig frequency with which each block g identifies each block value i for generating a plurality of measured frequency values Oig.
The system also comprises processing means in signal communication with the measurement means and with the storage means for processing, for each block g, the plurality of measured frequency values Oig and the expected frequency values Ei in order to generate a frequency deviation value χg2 representative of the deviation of the plurality of measured frequency values Oig with respect to the expected frequency values Ei, and processing the frequency deviation values χg2 generated for each block (g) with the frequency deviation threshold value χth in order to generate a signal representative of the presence of the single data flow in the aggregate packet data flow and representative of the application generating the single data flow.
Advantageously, the method and the system of the present invention can be used in combination with the method and the system for detecting voice data flow in a packet data flow described in Italian patent application no. MI 2006 A 002417 included here for reference.
In summary, the method and the system of Italian patent application MI 2006 A 002417 provide for the packet data flow to be able to be characterized by at least two measurable variables X,Y and provide, for each measurable variable X,Y, a distribution function P{x|C},P{y|C} for the values of each variable X,Y in a voice data flow. Next, the values x,y of each variable X,Y are measured to obtain a sequence of measured values x(k), y(k) on a number K of blocks and each measured value x(k), y(k) is applied to the respective distribution function P{x|C}, P{y|C} in order to generate a sequence of values of likelihood Bx(k), By(k) from which respective average likelihood values E[Bx], E[By] are generated. Lastly, these average values are processed to generate a reference likelihood value B which, compared with a threshold likelihood value Bmin, provides for detecting the presence of voice data flow in the packet data flow.
From experiments that have been performed, it has emerged that the combined use of the method and system described in Italian patent application MI 2006 A 002417 and the method and system of the present invention is extremely effective in detecting and classifying any voice over IP traffic and in detecting and classifying voice traffic generated by a Skype application and transported either over UDP or over TCP. It was also demonstrated that both methods and both systems mentioned above exhibit a high level of robustness.
As can be appreciated from that which has been described above, the method and system according to the present invention provide for meeting the requirements and overcoming the drawbacks referred to in the introductory part of the present description with reference to the prior art.
In particular, the method and system according to the invention provide for detecting the presence of any type of voice flow, even an encrypted one.
Clearly, in order to satisfy the contingent and specific requirements, a person skilled in the art may introduce many modifications and variants to the method and system according to the invention described above, all however contained within the scope of protection of the invention, which scope of protection is defined by the following claims.
Number | Date | Country | Kind |
---|---|---|---|
MI2007A1141 | Jun 2007 | IT | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB2008/001425 | 6/3/2008 | WO | 00 | 12/3/2009 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2008/149203 | 12/11/2008 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20050220023 | Kodialam et al. | Oct 2005 | A1 |
20070076611 | Magnaghi et al. | Apr 2007 | A1 |
20100214933 | Mellia et al. | Aug 2010 | A1 |
Number | Date | Country |
---|---|---|
1 764 951 | Mar 2007 | EP |
2119105 | Dec 2011 | EP |
Number | Date | Country | |
---|---|---|---|
20100177652 A1 | Jul 2010 | US |