The present invention relates generally to message processing techniques, and more particularly to techniques for determining the validity of SIP messages.
The Internet has become a primary communications network. All manners of Internet sessions, such as voice sessions, instant message sessions and gaming sessions, occur on the Internet tens of millions of times a day. The Session Initiation Protocol (SIP [24]) is an important signaling protocol that allows heterogeneous sessions to be established. See, e.g., J. Rosenberg et al., “SIP: Session Initiation Protocol. RFC 3261 (Proposed Standard) (2002), incorporated herein by reference. See also, Updated SIP RFCs 3265, 3853, 4320, 4916, 5393, 5621, 5626, 5630, 5922, 5954 and 6026, each incorporated herein by reference. As SIP becomes pervasive on the Internet, securing SIP becomes paramount. SIP ecosystems are especially prone to denial of service (DoS) or distributed denial of service (DDoS) attacks. The threat of such attacks targeted either at the SIP layer or at the supporting infrastructure that SIP needs to operate are well-known.
SIP is a text-based protocol defined by a context-sensitive grammar. It is difficult to build a parser generator for the protocol since the grammar is not LL(1). Generally, an LL(1) parser looks only at the next token to make parsing decisions. Furthermore, the grammar is permissive and allows various combinations for representing a valid SIP message. The grammar allows (i) multiple legal representations of headers in a SIP message (e.g., short form header name and long form header name); (ii) multiple headers of the same name (e.g., Via, Route) to occur either in a block, or separated by other headers; and (iii) some headers to be separated by commas or a carriage-return line-feed digraph. Due to vagaries of SIP parsing, most parsers are either hand-crafted or a hybrid between hand-crafting and LL(*) parsing.
For other transport layer protocols, on the other hands, such as Transmission Control Protocol (TCP) and User Datagram Protocol (UDP), or even application-layer protocols like Real-Time Transport Protocol (RTP), the protocol data unit (PDU) is also parsed, but the grammars of these protocols do not allow too much flexibility in representation. The headers of these protocols are fixed-length and the byte sequences are well defined such that deviating from the fixed format invalidates a PDU immediately. This is not true of SIP as there are many valid ways to represent a given SIP message. This is of concern as minor perturbations in a message can make it unusable, forcing the recipient to spend resources parsing the message in its entirety before reaching the conclusion that the message is invalid.
Furthermore, the SIP grammar incorporates rules from other constructs such as electronic mail Uniform Resource Identifiers (URIs), Internet host names, and various Multipurpose Internet Mail Extensions (MIME) types that define the session being set up. The resultant composite grammar is complex and prone to individual interpretation by implementers. Thus. SIP parsing is considered an easy vector for mounting an attack as it forces the recipient to spend resources parsing the message to determine its validity. It has been estimated that a SIP server utilizes 25-40% of the processing time in parsing. The SIP server will spend resources trying to parse the received message, and if thousands of such malformed messages arrive simultaneously, a DoS attack can be effectively mounted.
A number of techniques have been proposed or suggested for detecting invalid SIP messages. For example, Euclidean-distance classifiers have been employed where a SIP message is reconstituted to a series of n-grams. For example, if n is equal to 4, then the SIP message is broken down to a series of 4-byte words. The counts of occurrences of each such word are represented as a feature vector that characterises the SIP message that can be processed by a statistical classifier. At run time, when a new SIP message arrives, it can be converted to a similar feature vector by counting the occurrences of n-grams derived from the headers and payload that comprises the new SIP message. The incoming feature vector is compared to the training set using Euclidean distance as a metric, and the message is considered normal if its distances to the normal training vectors fall within a threshold, and anomalous otherwise. The Euclidean-distance classifier thus determines which incoming SIP message is anomalous based on its previous training data. Euclidean-distance classifiers. however, require that there be a substantive difference (in the number of bytes) between a normal SIP message and an anomalous one.
A need therefore exists for improved techniques for detecting invalid SIP messages.
Generally. methods and apparatus are provided for determining the validity of SIP messages, such as self-similar messages, without parsing the message. According to one aspect of the invention, a SIP message is processed by creating a feature vector matrix of the SIP message; processing the feature vector matrix using a plurality of classifiers; combining results generated by the plurality of classifiers to obtain a combined result; and processing the SIP message based on the combined result. The plurality of classifiers can be trained on a training data set. The SIP message can optionally be classified, for example, as a normal message or an anomalous message based on the combined result. In addition, the SIP message can optionally be processed or rejected based on the combined result.
The results generated by the plurality of classifiers are combined to obtain the combined result using a combination function such as a voting rule or a logistic regression. A logistic regression, for example, employs a linear combination of individual decisions of the plurality of classifiers to predict a logarithm of the ratio of the probability that the SIP message belongs to a first class over the probability of the SIP message belonging to a second class.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.
The present invention provides improved techniques for detecting invalid SIP messages. including self-similar SIP messages having minimal perturbations relative to a valid SIP message. These perturbations are minimal such that distinguishing a perturbed (and hence invalid) version of a SIP message from its regular counterpart is non-trivial, especially for automatic machine operated algorithms or intrusion detection systems (IDS). For example, the “From:” header in a SIP message can be minimally perturbed by inserting a space between the letters ‘r’ and ‘o,’ to invalidate the message, as successfully parsing the From header is crucial to SIP. It is noted that an IDS does not effectively detect anomalous messages from normal messages when the difference between the normal and anomalous messages is a few negligible bytes. Due to the expansive SIP grammar, it becomes rapidly inefficient to write thousands of IDS rules to catch all possible permutations of illegal SIP messages while allowing the combination of legal SIP messages to pass through unaffected.
A SIP adversary who has access to the signaling channel can craft malformed SIP messages that are virtually indistinguishable from the real messages gathered from an eavesdrop of the channel. Even if the SIP adversary does not have access to the signaling channel, they can craft malicious packets that appear to look like real ones to bypass an IDS or a firewall filter, which will typically not perform in-depth analysis on the message.
According to one aspect of the invention, the validity of a message is determined (e.g., a SIP message is classified as a normal or anomalous message) without parsing the message. According to a further aspect of the invention, a multiple classifier system is employed to classify SIP messages. See, for example, Tin Kam Ho et al., “Decision Combination in Multiple Classifier Systems,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1):66-75 (1994), incorporated by reference herein.
The exemplary multiple classifier system employs a plurality of several high-performing classifiers with complementary strengths, and a decision combination function. The present invention recognizes that multiple classifier systems do not rely on a decision from a single classifier, but rather, combine individual decisions from multiple classifiers to reach a consensus decision. A combination function is used to take advantage of the strengths of individual classifiers, avoid their weaknesses, and improve classification accuracy. In one exemplary implementation, a combination function is used that leverages the classifier correlation using logistic regression (LR).
According to yet another aspect of the invention, each classifier in the disclosed multiple classifier system is selected based on its individual and relative classification performance. In various embodiments, the decisions of the chosen classifiers are combined using one or more of simple voting and a weighted linear combination based on logistic regression (LR).
In one exemplary implementation, the multiple classifier system processes a SIP message by creating a feature vector matrix of the SIP message; processing the feature vector matrix using a plurality of classifiers; combining results generated by the plurality of classifiers to obtain a combined result; and processing the SIP message based on the combined result. The following discussion is organized in a similar manner.
Feature Vector Extraction and Reduction
Machine learning algorithms usually operate on vector data, so a technique is employed to embed a SIP message to a high-dimensional vector space. Using the nomenclature established by K. Rieck et al., “A Self-Learning System for Detection of Anomalous SIP Messages,” Principles, Systems and Applications of IP Telecommunications. Services and Security for Next Generation Networks: Second International Conference, IPTComm 2008, Heidelberg, Germany, Jul. 1-2, 2008. Revised Selected Papers, 90-106 (2008), to define the feature extraction process, use a set of feature strings Q are used to model the contents of a SIP message. Given a feature string q∈Q and a SIP message x, the number of occurrences of q are determined in x and a frequency value f(x,q) is obtained. The frequency value of q in x acts as a measure of importance; the higher the frequency of q, the more it contributes to z.
An embedding function Φ maps all SIP messages X to a |Q|-dimensional vector space by considering the frequencies of feature strings in Q:
Φ:X→R|Q| with Φ(x)(f(x,q))q∈Q′ (1)
where R is a set or a vector that stores the frequencies of the n-grams found in X.
Once the embedding function Φ has mapped the SIP message to a vector, it can be analyzed further using standard classifiers.
An important consideration is how to specify Φ. Clearly, tokenizing x using SIP ABNF delimiters is not an option because that implies parsing x according to the SIP ABNF. As discussed above, parsing a SIP message is computationally expensive and therefore should not be the first option. Rieck et al. use the n-grams technique to arrive at a vector. Instead of parsing a SIP message, feature strings are extracted by moving a sliding window of length n over x (n=4). At each position, a substring of length n is considered and the frequency of its occurrence is counted.
For example, consider the feature string extraction across the following SIP message fragment (the <cr><lf> symbols below represent carriage-return and line-feed digraph that terminates each SIP header line):
Φ(“BY Esip: a@example.comSIP/2.0<cr><lf>
To:sip:b@example.com”)
Using the n-gram technique, with n=4 will produce 49 n-grams, including ampl (2 occurences); .com (2 occurences); e.co (2 occurences); @exa (2 occurences) and exam (2 occurences).
Generally, the vector space induced by the n -grams is very highly-dimensional. To reduce the number of dimensions, the frequency distribution of these n-grams is observed and a number of exemplary n-grams are selected that have significantly high frequency counts. These exemplary n-grams serve as the feature vectors that are used to train and test the classifiers. The feature vectors and the SIP messages combine to produce a matrix that was used as input to train and test the classifiers in the multiple classifier system. As discussed further below in conjunction with
Classifiers in Multiple Classifier System
The multiple classifiers that are employed in a given implementation can be selected, for example, by evaluating classification performance from a comprehensive list of classifiers. For example, standard classifiers can be employed from Weka. Sec, for example, Ian H. Witten and Eibe Frank, “Data Mining: Practical Machine Learning Tools and Techniques.” Elsevier Inc. (2005), incorporated herein by reference. Generally, Weka is a public-domain machine learning software tool from the University of Waikato, New Zealand. Weka contains a comprehensive suite of commonly used classifiers. Default parameters supplied by Weka can optionally be employed to evaluate a dataset with each of the classifiers. It is recognized that further improvements can be obtained by expanding the training set to include a larger population of messages.
Linear regression can be performed on the pre-selected classifiers to determine the statistical similarity of these classifiers. For those classifiers that are statistically similar, a single classifier can be selected as the representative classifier from this statistically similar group.
For a discussion of classifier training, see, for example. Trevor Hastie et al., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, (Springer Science+Business Media, New York, 2nd edition, 2009), incorporated by reference herein.
Multiple Classifier Systems
As previously indicated, the disclosed multiple classifier system does not rely on a decision from a single classifier, but rather, combines individual decisions from multiple classifiers to reach a consensus decision. For a general discussion of multiple classifier systems, see, for example, Tin Kam Ho et al., “Decision Combination in Multiple Classifier Systems,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(1):66-75 (1994); or J. Kittler et al., “On Combining Classifiers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226-239 (1998), each incorporated by reference herein.
An appropriate combination function can take advantage of the strengths of individual classifiers, avoid their weaknesses, and improve classification accuracy. A voting rule combination function, for example, incorporates a counting rule over the decisions of the classifiers in the multiple classifier system. A logistic regression (LR) combination function is employed by the exemplary embodiment to analyze classifier correlation by a statistical model based on logistic regression. Any combination function can be employed to combine individual decisions from multiple classifiers to reach a consensus decision, as would be apparent to a person of ordinary skill in the art.
Logistic Regression Combination Function
Logistic regression provides a suitable mechanism for combining the decisions of multiple classifiers. In this approach, the log-odds ratio between two classes is modeled as a regression function of the decisions given by the individual classifiers. A widely used regression model in this context is simple linear regression, where a linear combination of the individual decisions of the classifiers Sjk for message mi is used to predict the logarithm of the ratio of the probability that the message belongs to one class over the probability of it belonging to the other class. From the predicted log-odds ratio, one can derive the posterior probability of the message belonging to the target class, and assignment to that class is made if the estimated posterior probability is greater than 0.5.
Logistic regression is advantageous as it exploits the standard tools of regression to estimate the weights according to each classifier's past contributions to correct decisions on the training data. A higher weight can be given to a classifier within a subset Sik that has better classification accuracy than other classifiers in the same subset. This has a chance of increasing the overall classification accuracy beyond simple voting.
The procedure for decision combination with logistic regression can be formulated mathematically, as follows. Let Sjk be the jth system of classifiers of length
k, i.e.,
where pjk is the classification performance of Sjk; M is the set of all SIP messages. |M| is the number of SIP messages in the set; and
where {circumflex over (p)}i,jk and ĉi,jk are estimated values, which can be either 0 or 1.
Here,
where βa are the estimated coefficients of ĉi,jk for the simple linear regression model. Equation 4 is the combination function for LR, returning a 1 when the LR result is ≧0.50 (since the decision of each classifier is binary). Equation 4 is the evaluation function that evaluates the classification performance of Sjk.
The selection function is expressed as:
σ(S)=Sj*k*, (5)
The constraints on the selection function are expressed as follows:
k*=min {k|pjk is maximum for some
and j*=min {j|pjk* is maximized}.
Multiple Classifier Process
The feature vector matrix is applied to the multiple classifier system 100 (
One or more embodiments can make use of software running on a general purpose computer or workstation.
The term “processor” as used herein is intended to include any processing device, such as, for example, one that includes a CPU (central processing unit) and/or other forms of processing circuitry. Further, the term “processor” may refer to more than one individual processor. The term “memory” is intended to include memory associated with a processor or CPU, such as, for example, RAM (random access memory), ROM (read only memory), a fixed memory device (for example, hard drive), a removable memory device (for example, diskette), a flash memory and the like.
In addition, the phrase “input/output interface” as used herein, is intended to include, for example, one or more mechanisms for inputting data to the processing unit (for example, mouse), and one or more mechanisms for providing results associated with the processing unit (for example, printer). The processor 402, memory 404, and input/output interface such as display 406 and keyboard 408 can be interconnected. for example, via bus 410 as part of a data processing unit 412. Suitable interconnections, for example via bus 410, can also be provided to a network interface 414, such as a network card, which can be provided to interface with a computer network, and to a media interface 416, such as a diskette or CD-ROM drive, which can be provided to interface with media 418.
Analog-to-digital converter(s) 420 may be provided to receive analog input, such as analog video feed, and to digitize same. Such converter(s) may be interconnected with system bus 410.
Accordingly, computer software including instructions or code for performing the methodologies of the invention, as described herein, may be stored in one or more of the associated memory devices and, when ready to be utilized, loaded in part or in whole and implemented by a CPU. A data processing system suitable for storing and/or executing program code will include at least one processor 402 coupled directly or indirectly to memory elements 404 through a system bus 410. The memory elements can include local memory employed during actual implementation of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during implementation.
Input/output or I/O devices (including but not limited to keyboards 408, displays 406, pointing devices, and the like) can be coupled to the system either directly (such as via bus 410) or through intervening I/O controllers (omitted for clarity).
Network adapters such as network interface 414 may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
As previously indicated, the arrangements of multiple classifier systems, as described herein, provide a number of advantages relative to conventional arrangements. As indicated above. the disclosed techniques allow the validity of a SIP message to be determined without parsing the message. The exemplary multiple classifier system employs a plurality of several high-performing classifiers with complementary strengths, and a decision combination function. In this manner, the disclosed multiple classifier system does not rely on a decision from a single classifier. but rather, combines individual decisions from multiple classifiers to reach a consensus decision. A combination function can be used to take advantage of the strengths of individual classifiers, avoid their weaknesses, and improve classification accuracy.
Also, the disclosed techniques for determining the validity of SIP messages can be used to determine the validity of self-similar messages.
It is emphasized that the above-described embodiments of the invention are intended to be illustrative only. In general, the exemplary multiple classifier systems can be modified, as would be apparent to a person of ordinary skill in the art, to incorporate an alternative classifiers and/or combination functions. In addition, the disclosed techniques for determining the validity of SIP messages can be employed in any working system. For example, the present invention can be deployed in a SIP proxy server that accepts SIP messages and proxies them downstream. Such a system could also consist of other SIP entities such as back-to-back user agents, user agent server, user agent client, registrar, redirect server, a SIP firewall element or a session border controller. Generally, any system that accepts SIP messages and acts upon the message can benefit from this invention.
In addition, while the exemplary embodiments have contemplated a combination of classifier decisions that are made in parallel, a multiple classifier system in accordance with the present invention can be more flexible in the organization of the classifiers. For example, some classifiers can run in parallel while other classifiers can run sequentially, and the combination of decisions can happen in different stages, as would be apparent to a person of ordinary skill in the art.
While exemplary embodiments of the present invention have been described with respect to digital logic blocks, as would be apparent to one skilled in the art, various functions may be implemented in the digital domain as processing steps in a software program, in hardware by circuit elements or state machines, or in combination of both software and hardware. Such software may be employed in, for example. a digital signal processor, application specific integrated circuit. micro-controller, or general-purpose computer. Such hardware and software may be embodied within circuits implemented within an integrated circuit.
Thus, the functions of the present invention can be embodied in the form of methods and apparatuses for practicing those methods. One or more aspects of the present invention can be embodied in the form of program code, for example, whether stored in a storage medium, loaded into and/or executed by a machine, or transmitted over some transmission medium, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code segments combine with the processor to provide a device that operates analogously to specific logic circuits. The invention can also be implemented in one or more of an integrated circuit, a digital signal processor, a microprocessor, and a micro-controller.
It is to be understood that the embodiments and variations shown and described herein are merely illustrative of the principles of this invention and that various modifications may be implemented by those skilled in the art without departing from the scope and spirit of the invention.