I.A. Field
The present disclosure relates generally to communication devices, and specifically to classifying information received by communications devices. More particularly, this disclosure relates to the classification of textual information transmitted over a digital computer network.
I.B. Background
1. References
The following U.S. patents and papers provide useful background information, for which they are incorporated herein by reference in their entirety.
Aho, Sethi and Ullman 1985. Addison Wesley. Compilers. Principles, Techniques and Tools.
2. Introduction
In communication networks, it is essential to provide fast classification of data passing through communication devices such as routers, switches, or gateways. Commonly, in digital communication networks, data is arranged in packets, cells, frames, etc. Packets contain data and classification information. Addressing and routing information as well as protocol-related information are kinds of classification information that are required for transmission of data from a source to a destination in a digital communication network.
The process of identification and classification of network traffic transactions requires parsing of traffic streams flowing through the network. Parsing is relatively easy when all protocol headers, for the protocol used in the communication, have well known fixed offsets from the beginning of a packet that is part of the communication. Conversely, classification is significantly more complex in protocols where offsets are not fixed and/or sets of values are to be determined. Clearly it is desirable to reduce the complexity of such a parsing.
Regular expressions are used in pattern matching in text-based Internet protocols. Regular expression denotes a language that is built according to a given set of rules. Such regular expressions are well known in the art. State machines are used to determine whether a given word, i.e., a sequence of one or more characters, is valid within the language. Therefore state machines can be used to determine if an arbitrary pattern appears in a stream flow. A state machine is defined based on a given regular expression. A pattern is a word in the language defined by the given regular expression.
Reference is now made to
“r=a(bpm|c(d(fk|gj|hi))+n)”
One advantage for using a state machine for pattern matching is that the checks are done in parallel.
There are several implementations of state machine for pattern matching in text-based Internet protocols. An implementation based on hardware is the most efficient in terms of processing time. A conventional hardware implementation is based on a micro controller that includes a processor and random access memory (RAM). The RAM is used for storage of the incoming characters. The processor retrieves data from the RAM and uses the data to perform an operation that determines the next state. The processor then switches the state machine into this next state. Each state in the state machine is a thread executed by the processor. This implementation offloads the task of transaction detection from the CPU and leaves the CPU to handle the related actions. However, such conventional implementations do not provide data extraction, and cannot provide detection of fragments of a traffic stream.
Therefore it would be advantageous to implement a system that can provide an identification and classification of traffic transactions using state machines that will support data extraction and would provide transaction detection in fragments of a traffic stream. It would be further advantageous if the system could support multiple searches.
To realize the advantages discussed above, the disclosed teachings provide a search engine for matching textual patterns in a traffic stream. The search engine comprises a traffic control unit, a micro-code memory, a comparator and a report memory. The traffic control unit is capable of managing the traffic stream. The micro-code memory is capable of storing and retrieving micro-code instructions. The comparator is capable of executing said micro-code instructions to match the textual patterns. The report memory is capable of storing and retrieving reports generated said comparator.
Specifically, the search engine is further capable of performing a search in fragments of said traffic stream.
Specifically, the textual patterns are regular expressions.
Specifically, the traffic control unit is provided control information for said managing.
Specifically, the traffic control unit is capable of tracing the traffic stream using a traffic pointer.
Specifically, the traffic pointer points to a current byte in the traffic stream.
More specifically, the control information comprises a length of the traffic stream, a first micro-code instruction to be executed and a length of the textual patterns to be matched.
Specifically, the micro-code memory is one of a random access memory (RAM), a flash memory and a cache memory.
Specifically, the report memory is implemented as first in first out (FIFO) memory.
Specifically the report memory is one of a RAM memory, a flash memory and a cache memory.
Specifically, the micro-code instruction comprises fields for op-code, search-mode, case sensitivity, traffic pointer flag (TPF), report, next instruction, and token.
More specifically, the op-code field includes an op-code that indicates a type of search to be performed by the search engine.
More specifically, the type of search includes at least one of a charset search, string search, multi-search, range search, and no-operation (NOP).
More specifically, the charset search op-code is used for matching a single byte from the traffic stream to contents of the token field.
More specifically, the string search op-code is used for matching a set of consecutive bytes from the traffic stream to contents of the token field.
More specifically, the range search op-code is used for determining if contents of the incoming data field is within a defined range of characters.
More specifically, the multi search op-code is used for matching at least a single byte from the traffic stream to at least two tokens.
More specifically, NOP op-code is used for generating reports.
More specifically, the search mode field includes at least a search mode that indicates a type of search to be performed.
More specifically, the search mode is at least one of a normal search, a skip-until search, and a skip-over search.
More specifically, the normal search is used for scanning the traffic stream sequentially.
More specifically, the skip-until search is used for skipping until a match to the contents of the token field is found.
More specifically, the skip-over search is used for skipping over the contents of the token field.
More specifically, the case-sensitive field is used to distinguish between lowercase and uppercase characters.
More specifically, the TPF is used to determine whether to move the traffic pointer forward.
More specifically, the report field is used to determine whether to generate a report.
More specifically, the next instruction field comprises an index to the next instruction that is to be executed.
More specifically, the next instruction field includes at least a sub-field each for a next instruction in a case of a match, and a next instruction in a case of a mismatch.
More specifically, the token field includes a sequence of alphanumeric characters to be matched.
More specifically, the micro-code instructions include instructions for analyzing the op-code field, the search-mode field, and the case sensitive field, instructions for comparing between at least one byte from the traffic stream to contents of the token field, instructions for analyzing the TPF and the next instruction field; instructions for determining whether to generate an instruction report, and instructions for sending the instruction report to said report memory if required.
More specifically, the instruction report is generated when said comparator completes execution of the micro-code instructions.
More specifically, the instruction report includes information on at least one of a pointer to data in the traffic streams, reported instruction number and a report trigger.
More specifically, the report trigger is one of a match trigger and a mismatch trigger.
More specifically, the TPF is analyzed for determining the number of bytes to advance the traffic pointer.
Specifically, the reports is at least one of instruction report, terminate report and NOP report.
More specifically, the instruction report is generated when said comparator completes execution of said micro-code instructions.
More Specifically, the instruction report includes information on at least one of a pointer to data in said traffic, reported instruction number and report trigger.
More specifically, the report trigger is one of a match trigger and a mismatch trigger.
More specifically, the NOP report is generated when the op-code field has a NOP op-code.
More specifically, the NOP report comprises information entered by said comparator and a report trigger.
More specifically, the report trigger is a NOP trigger.
More specifically, the terminate report is generated when said comparator completes matching.
More specifically, the terminate report comprises information on at least one of the reported instruction and a report trigger.
More specifically, the report trigger is one of a match trigger, a mismatch trigger and an inconclusive trigger.
More specifically, the inconclusive match is a trigger that indicates that the traffic stream has ended before it was possible to determine whether there was a pattern match or a pattern mismatch.
Specifically, the search engine is capable of performing a search by generating a terminate report with an inconclusive trigger, if the traffic stream has ended before it was possible to determine whether there was a pattern match or mismatch; uploading the terminate report from said report memory, if said comparator receives a packet which is a continuation of the traffic stream which caused the generation of said terminate report with an inconclusive trigger; and continuing the search according to the designated instruction's parameters provided in said terminate report.
Another aspect of the disclosed teachings is a method for matching textual patterns in a traffic stream using a search engine comprising at least a traffic control unit, a micro-code memory, a comparator, and a report memory. The method comprises loading data from the traffic stream into the comparator using the traffic control unit. The micro-code instruction to be executed next is fetched from the micro-code memory and executed using the comparator. A terminate report is then generated.
Specifically, the micro-code instruction is executed using a sub-process comprising analyzing the op-code field, the search-mode field, and the case sensitive field. At least one byte from the traffic stream is compared to contents of the token field. The TPF and the next instruction field is analyzed. It is determined whether to generate an instruction report. Ff required, the instruction report is sent to the report memory.
More specifically, the instruction report is generated when said comparator completes execution of said micro-code instruction.
More specifically, the instruction report includes information on at least one of a pointer to data in the traffic streams, reported instruction number and a report trigger.
More specifically, the report trigger is one of a match trigger and a mismatch trigger.
More specifically, the terminate report is generated when said engine completes matching.
More specifically, the terminate report comprises information on at least one of the reported instruction and a report trigger.
More specifically, the report trigger is one of a match trigger, a mismatch trigger and an inconclusive trigger.
More specifically, the inconclusive match is a trigger that indicates that the traffic stream has ended before it was possible to determine whether there was a pattern match or a pattern mismatch.
More specifically, the search in fragments of traffic stream comprises generating a terminate report with an inconclusive trigger, if said traffic stream has ended before it was possible to determine whether there was a pattern match or mismatch. The terminate report is uploaded from said report memory, if said comparator receives a packet which is a continuation of the traffic stream which caused the generation of said terminate report with an inconclusive trigger. The search is continued according to the designated instruction's parameters provided in said terminate report.
Another aspect of the disclosed teachings is a micro-code instruction for matching textual patterns in a traffic stream using search engine, the micro-code instruction comprises fields for op-code, search-mode, case sensitivity, traffic pointer flag (TPF), report, next instruction, and token.
The above objectives and advantages of the disclosed teachings will become more apparent by describing in detail preferred embodiments thereof with reference to the attached drawings in which:
FIG. 1—is an exemplary state machine.
FIG. 2—is a schematic block diagram of a search engine in accordance with the disclosed teachings.
FIG. 3—is a non-limiting example to micro-code instruction structure.
FIGS. 4(A)-(D) is non-limiting op-codes sub-set.
FIG. 5—is an exemplary flow chart describing the process of pattern matching.
FIG. 6—is an exemplary flow chart describing the process of executing micro-code instruction.
FIGS. 7(A)-(D) depict exemplary diagrams showing the use of the state machine according to the disclosed teachings.
This disclosure teaches a system and method enabling identification and classification of text-based traffic in a digital computer network. The disclosed techniques are realized through a regular expression search engine. The disclsoed techniques provide for a fast processing of pattern matching. Additionally, the search engine extracts data out of the traffic stream according to demand.
Reference is now made to
Upon request by comparator 230, traffic control unit 210 sends bytes of data to comparator 230. Comparator 230 determines the number of bytes to be sent from traffic control unit 210. Traffic control unit 210 traces the traffic stream by using a traffic pointer, which points to the last read byte in the traffic stream. Normally a single character will reside in a single byte of data. Traffic control unit 210 changes the traffic pointer according to commands initiated by comparator 230.
Micro-code memory 220 includes the micro-code instructions that are executed by comparator 230. The set of instructions required for performing the match are loaded into micro-code memory 220 through control lines 225.
Each set of instructions is used for matching a single regular expression. Comparator 230 may have more than one instruction set. Comparator 230 manages all the activities related to matching of defined patterns to data streams as well as reporting match results. Comparator 230 reads a data segment of the traffic stream from traffic control unit 210, and fetches the next instruction from micro-code memory 220. Reports from comparator 230 are stored in report memory 240.
Report memory 240 is implemented as a first-in-first-out (FIFO) memory and includes the reported instruction number, traffic pointer value, and length of the report. Comparator 230 also provides terminate messages 235. These messages are described in detail below.
Reference is now made to
Op-code field 310 defines the type of search operation to be used. An exemplary non-limiting list of op-codes is shown in
The “range-search” op-code is used to determine whether an incoming data is within a defined range of characters. For example, it enables the determination whether the incoming data is a digit, by having search token field 360 set to “0-9”, a lowercase letter, by setting search token field 360 to “a-z”, or an uppercase letter, by setting search token field 360 to “A-Z”. Other ranges could be easily set to identify other operations. In order to define the type of a search performed under an op-code the search mode field 315 is used. Search mode field 315 defines the mode of searches to be used, and includes, but is not limited to, normal search, skip over, and skip until, further shown in
Case sensitive field 320 is used when it is necessary to distinguish between uppercase and lowercase characters. When ‘case sensitive’ is activated, comparator 230 finds only those instances in which the character case matches that of the token in token field 360.
TPF field 330 determines whether to move the traffic pointer forward in traffic control unit 210 in case of a match, or a mismatch. The options used in TPF 330 are further shown in
Report field 340 determines whether comparator 230 should generate a report. Report field 340 consists of two sub-fields: report in a case of match and report in a case of mismatch. The content of report field 340 may be, for example, “00” for no report, “01” for report on mismatch, “10” for report on match. The “next instruction” field 350 includes indexes of the next instructions to be performed in the cases of a match or a mismatch. Each sub-field for a match and mismatch may include the next instruction number, or the offset to the next instruction. In the multi-match instruction the “next instruction” 350 appears more than once as described in detail below. It is further possible for one of the fields to point back to itself hence allowing a repetitive sequence until the other condition, either a match or a mismatch occurs.
Token field 360 includes a sequence of alphanumeric characters to be matched, or other types of information, which may be required by the instruction. In one embodiment of the invention the token may include, at most, a predefined number of characters, for example, four characters. The multi-match instruction includes more than one token as described in further detail below. It should be noted that the micro-code instruction may include additional fields, depending on the type of the instruction. Such fields are described in more detail below.
Extension field 370 is used for additional information that is useful in implementing the various micro-code instructions. Extension field 370 may consist of several different fields each containing various pieces of information. Examples for such fields are mentioned below.
When using the “charset match” instruction, comparator 230 may return two messages: match or mismatch, as the case may be. Similarly, a “string match” instruction is defined by the “string search” op-code. When this instruction provided to comparator 230, it enables the comparison of consecutive bytes of traffic stream 212 to a string defined in token field 360. For example, if token 360 includes four characters and its content is “XY5Z”, and traffic 212 is “XY5Z” comparator 230 will return a match message. In a case where traffic 212 is “cZYX” or “XbYc” comparator 230 will return a mismatch message.
The string match instruction format includes an additional field “string length”, which determines the length of the string to be matched. The string length field is part of extension field 370 defined in the micro-code instruction format. The string length content determines the number of bytes of traffic 212 to be matched. In order to perform the comparison the number of bytes from traffic 212 must be equal to number of bytes in token field.
The “multi-match” instruction is defined by the “multi match” op-code, and provides the capability of comparison of bytes from the traffic to different tokens. The “multi match” instruction format includes, in addition to the fields described in
The additional N “report flag” fields provide the system the ability to generate a report, in a case of match or mismatch, for each compared token. In one embodiment of embodying the disclosed techniques, the number of token to be matched (i.e. “N”) is limited to a maximum number, for example to be at most six tokens. When executing the “multi-match” instruction, comparator 230 reports mismatch if none of the possibilities were matched. In all of the instructions described above, comparator 230 advances the traffic pointer, fetches the next instruction according to the content in TPF 330 and next instruction 350 fields. Moreover, comparator 230 generates reports based on the content of the report field 340.
A report is used to extract data from traffic stream 212. The extraction of data is done by pointing to data position at traffic stream 212. Such a report includes the traffic pointer value at the beginning of the data, the pointer value at the end of the data, the instruction number, and the trigger for the report. A report trigger may be a match or a mismatch. It should be noted that despite the fact that each one of the mach micro-code instructions compares a token with a limited number of bytes to traffic 212, comparator 230 may compare an unlimited number of bytes to traffic 212. In order to perform such a comparison, search engine 200 provides the ability to link unlimited number of micro-code instructions.
NOP instruction is used for generating special reports. Such special reports are created by placing information defined by the user in the micro-code instruction. On demand, this information is copied to report memory 240. The information is placed at a user-defined field, which is part of extension field 370. The special report includes the report instruction, the required information, and the trigger for that report. In that case the trigger should be a NOP trigger.
It should be noted that a person skilled in the art could easily add new micro-code instructions by adding new op-codes, search modes, or any other relevant parameter. Furthermore, a person skilled in the art could easily change the micro-code instruction format, by adding new fields to the instruction or by changing the length of each field.
Reference is now made to
Step 530 is further detailed in
“Inconclusive match” message indicates that the traffic stream ended before it was possible to determine whether there was a pattern match or mismatch. The inconclusive match provides the ability to match tokens to fragments of a traffic stream. In a case of an inconclusive match comparator 230 returns the “inconclusive match” message. Additionally, comparator 230 stores the current instruction number and the traffic pointer value in report memory 240. When traffic control unit 210 receives an additional packet, or packets, belonging to the designated traffic steam 212, comparator 230 uploads the match parameters from report memory 240 and continues the matching process.
Reference is now made to
It should be noted that the next instruction may be the currently executing instruction. For example, in the case where the instruction includes the “skip over” search mode, comparator 230 does not fetch a new instruction but rather repeats the same instruction until the match is found or no traffic is available. Steps 630 and 640 are executed in parallel.
In step 650, comparator 230 based on report field 340, determines whether to generate a report. In the case where a report should be generated, then in step 660 comparator 230 passes the instruction number, the traffic pointer value, and the report trigger to report memory 240.
Reference is now made to
GEI % *\r\n(Host: % \r\n|User: % r\\n).
The pattern is matched to the traffic stream shown in
In the example, state machine 700 consists of nodes 710-1 through 710-8, each representing a micro-code instruction. Each edge between the nodes represents a token to be matched. In state 710-1 the “GET” token is matched. The instruction used in state 710-1 is the “string match” instruction. In state 710-2, comparator 230 attempts to match blank or “space” characters, the search mode define in the instruction is “skip over” mode. Therefore, comparator 230 does not step forward from state 710-2 until matching a character different from the space character. This is done in order to skip over one or more blank characters that may appear at the traffic stream. In states 710-3, and 710-4 comparator 230 identifies the characters for new line (e.g.“\r\n”). First, comparator 230 in state 710-3 attempts to match the character “\r”, in the case of mismatch comparator 230 stays at state 710-3, until matching “\r”. The next character coming after “\r” must be “\n” therefore the instruction at state 710-4 uses the “normal search mode”. The instruction used in state 710-5 is the “multi-match” instruction, therefore, state 710-5 includes two tokens to be matched “Host:” and “User:” In case of match comparator 230 branches to state 710-6, else the process is terminated. The matching process is ended at state 710-8, which represents a NOP instruction. State 710-8 generates the terminate message, in this example the terminate message would be “match”. States 710-1 and 710-5 should report in a case of match, therefore, comparator 230 generates a report in each state. The report includes the position within the traffic length of the data that have been extracted. The reports can be seen in
Other modifications and variations to the invention will be apparent to those skilled in the art from the foregoing disclosure and teachings. Thus, while only certain embodiments of the invention have been specifically described herein, it will be apparent that numerous modifications may be made thereto without departing from the spirit and scope of the invention.
Number | Name | Date | Kind |
---|---|---|---|
4414644 | Tayler | Nov 1983 | A |
4550436 | Freeman et al. | Oct 1985 | A |
4616359 | Fontenot | Oct 1986 | A |
4788656 | Sternberger | Nov 1988 | A |
5042029 | Hayakawa | Aug 1991 | A |
5050162 | Golestani | Sep 1991 | A |
5067127 | Ochiai | Nov 1991 | A |
5090011 | Fukuta et al. | Feb 1992 | A |
5167033 | Bryant et al. | Nov 1992 | A |
5193151 | Jain | Mar 1993 | A |
5317509 | Caldwell | May 1994 | A |
5400329 | Tokura et al. | Mar 1995 | A |
5414650 | Hekhuis | May 1995 | A |
5414704 | Spinney | May 1995 | A |
5444706 | Osaki | Aug 1995 | A |
5463620 | Sriram | Oct 1995 | A |
5463777 | Balkowski et al. | Oct 1995 | A |
5473604 | Lorenz et al. | Dec 1995 | A |
5574910 | Balkowski et al. | Nov 1996 | A |
5617421 | Chin et al. | Apr 1997 | A |
5646943 | Elwalid | Jul 1997 | A |
5650993 | Lakshman et al. | Jul 1997 | A |
5673263 | Basso et al. | Sep 1997 | A |
5715250 | Watanabe | Feb 1998 | A |
5742239 | Siloti | Apr 1998 | A |
5745488 | Thompson et al. | Apr 1998 | A |
5757770 | Lagoutte et al. | May 1998 | A |
5761640 | Kalyanswamy et al. | Jun 1998 | A |
5764641 | Lin | Jun 1998 | A |
5781545 | Matthew | Jul 1998 | A |
5796942 | Esbensen | Aug 1998 | A |
5805577 | Jain et al. | Sep 1998 | A |
5806086 | Kimmel et al. | Sep 1998 | A |
5815500 | Murono | Sep 1998 | A |
5842040 | Hughes et al. | Nov 1998 | A |
5898837 | Guttman et al. | Apr 1999 | A |
5901138 | Bader et al. | May 1999 | A |
5936939 | Des Jardins et al. | Aug 1999 | A |
5936940 | Marin et al. | Aug 1999 | A |
5946302 | Waclawsky | Aug 1999 | A |
5956721 | Douceur et al. | Sep 1999 | A |
5995488 | Kalkunte et al. | Nov 1999 | A |
5995971 | Douceur et al. | Nov 1999 | A |
6032190 | Bremer et al. | Feb 2000 | A |
6041054 | Westberg | Mar 2000 | A |
6052683 | Irwin | Apr 2000 | A |
6075769 | Ghanwani et al. | Jun 2000 | A |
6104696 | Kadambi et al. | Aug 2000 | A |
6111874 | Kerstein | Aug 2000 | A |
6157617 | Brandin et al. | Dec 2000 | A |
6161144 | Michels et al. | Dec 2000 | A |
6167047 | Welfeld | Dec 2000 | A |
6185208 | Liao | Feb 2001 | B1 |
6185568 | Douceur et al. | Feb 2001 | B1 |
6266664 | Russell-Falla et al. | Jul 2001 | B1 |
6275861 | Chaudri et al. | Aug 2001 | B1 |
6292489 | Fukushima et al. | Sep 2001 | B1 |
6295532 | Hawkinson | Sep 2001 | B1 |
6298340 | Calvignac et al. | Oct 2001 | B1 |
6393587 | Bucher et al. | May 2002 | B2 |
6404752 | Allen et al. | Jun 2002 | B1 |
6434153 | Yazaki et al. | Aug 2002 | B1 |
6460120 | Bass et al. | Oct 2002 | B1 |
6463068 | Lin et al. | Oct 2002 | B1 |
6535482 | Hadi Salim et al. | Mar 2003 | B1 |
6542466 | Pashtan et al. | Apr 2003 | B1 |
6542508 | Lin | Apr 2003 | B1 |
6590894 | Kerr et al. | Jul 2003 | B1 |
6608816 | Nichols | Aug 2003 | B1 |
6628610 | Waclawsky et al. | Sep 2003 | B1 |
6631466 | Chopra et al. | Oct 2003 | B1 |
6633540 | Raisanen et al. | Oct 2003 | B1 |
6633920 | Bass et al. | Oct 2003 | B1 |
6647424 | Pearson et al. | Nov 2003 | B1 |
6652694 | Nonaka et al. | Nov 2003 | B1 |
6654374 | Fawaz et al. | Nov 2003 | B1 |
6657962 | Barri et al. | Dec 2003 | B1 |
6665725 | Dietz et al. | Dec 2003 | B1 |
6681217 | Lewak | Jan 2004 | B1 |
6700889 | Ben-Nun | Mar 2004 | B1 |
6704728 | Chang et al. | Mar 2004 | B1 |
6714517 | Fawaz et al. | Mar 2004 | B1 |
6788697 | Aweya et al. | Sep 2004 | B1 |
6804701 | Muret et al. | Oct 2004 | B2 |
6826669 | Le et al. | Nov 2004 | B1 |
6842906 | Bowman-Amuah | Jan 2005 | B1 |
6917972 | Basko et al. | Jul 2005 | B1 |
7013323 | Thomas et al. | Mar 2006 | B1 |
20010016899 | Nei | Aug 2001 | A1 |
20020122386 | Calvignal et al. | Sep 2002 | A1 |
20020165947 | Akerman et al. | Nov 2002 | A1 |
Number | Date | Country | |
---|---|---|---|
20030204584 A1 | Oct 2003 | US |