Regular expressions are patterns of characters that are used for matching sequences of characters in text. For example, regular expressions can be used to test whether a sequence of characters has an allowed pattern corresponding to a credit card number or a Social Security number. Regular expressions (abbreviated as regexp, regex, or regxp) are used by many text editors and utilities to search and manipulate bodies of text based on certain patterns. Many programming languages support regular expressions for string manipulation. For example, Perl has a regular expression engine built directly into its syntax. The set of utilities provided by Unix were the first to popularize the concept of regular expressions.
A regular expression defining a regular language is compiled into a recognizer by constructing a generalized transition diagram call a finite automation. The finite automaton is a method of algorithmically recognizing the patterns specified by the regular expression. A finite automation can be deterministic or nondeterministic, where “nondeterministic” means that more than one transition out of a state may be possible on the same input symbol.
Both Deterministic Finite Automata (DFA) and Nondeterministic Finite Automata (NDFA) are capable of recognizing precisely the regular sets. Thus finite automata can recognize exactly what the regular expression denotes. However, there is a time-space tradeoff; while deterministic finite automata can lead to faster recognizers than non-deterministic automata, a deterministic finite automata can be much more complex than an equivalent nondeterministic automata. Some classes of regular expressions can only be described by automata that grow exponentially in size, while the required regular expression only grows linearly.
Thus, current computer architectures have only a limited ability to execute DFAs. This is primarily due to the large number of states that have to be maintained. For each state, the computer has to execute more instructions and manage more state variables and data located either in registers or in a main memory. Further, the highly complex inter-relationship between the different states, often make it difficult to modify an existing DFA algorithm with new search criteria.
If two back-to-back W characters are detected, the DFA 12 moves to state S2. The processor implementing DFA 12 moves into state S3 when three contiguous W characters are detected and moves to state S4 when three contiguous back-to-back W's are immediately followed by a period “.” character.
Notice that in this example, a branch occurs at state S4. When the character string “WWW.” is detected, the processor in states S9, S10, S11, and S12 search for the second piece of the URL containing the extension “.ORG”. However, the processor might need to also determine if another “WWW.” sting occurs while searching for “.ORG”. For example, the first detected “WWW.” character string may have been used in text that is not associated with the URL “WWW.XXX.ORG”. Therefore, a separate set of states S5, S6, and S7 have to be maintained in the DFA 12 for the possibility that the input data 14 may contain a character sequence such as: “WWW.XXXXXXWWW.XXX.ORG”.
The Problems With Deterministic and Non-Deterministic Finite Automaton Algorithms Additional character string matches, longer character string matches, and branch operations all substantially increase the number of states that have to be maintained in DFA engine 30. For example, the number of input characters 18 fed into PLD 26 may be J bits wide and the state vector 24 output by the PLD 26 may be K bits wide. While different algorithms are used to minimize the complexity of state table 22, the size of the logic array used in PLD 26 may still be: state table size=2(J+K).
The physical size limitation of PLD 26 restrict the DFA engine 30 to relatively low-complexity character string searches. The PLD 26 is predictable as long as the state table 22 does not exceed the capacity of PLD 26. However, the number of DFA states in the DFA engine 30 continues to increase for each additional character added to the search string. Thus, adding just one additional search string, or search character, to the DFA algorithm can possibly exceed the capacity of PLD 26.
For example, the character string “WWWW.XXX.ORG” might need to be searched instead of the search string WWW.XXX.ORG previously shown in
It is also difficult to reconfigure the DFA engine 30 for new character searches. Even if additional characters are not added, changing just one character in search string may require reconfiguration of the entire DFA state table 22. For example, changing the desired search string from “WWW.XXX.ORG” to “WOW.XXX.ORG” may change many of the state transitions in state table 22. This is further complicated by any state optimizations or minimizations that are performed to reduce the overall size of DFA state table 22. As a result, the size and operation of the DFA engine 30 can be unpredictable.
Current search techniques, including the regular expression implementation in the Lennox® operating system, are based on DFA algorithms. The DFA algorithm may be simulated in software where that the entire state table 22 is stored in memory. Other systems implement the DFA state table 22 using a programmable hardware device, such as the PLD 26 shown in
The present invention addresses this and other problems associated with the prior art.
A computer architecture uses a PushDown Automaton (PDA) and a Context Free Grammar (CFG) to process data. A PDA engine maintains semantic states that correspond to semantic elements in an input data set. The PDA engine does not have to maintain a new state for each new character in a target search string and typically only transitions to a new state when the entire semantic element is detected. The PDA engine can therefore use a smaller and more predictable state table than DFA algorithms. Transitions between the semantic states are managed using a stack that allows multiple semantic states to be represented by a single nested non-terminal symbol.
The foregoing and other objects, features and advantages of the invention will become more readily apparent from the following detailed description of a preferred embodiment of the invention which proceeds with reference to the accompanying drawings.
An index 54 is output by semantic table 42 that corresponds to an entry 46,44 that matches the combined symbol 62 and input data segment 60. A semantic state map 48 identifies a next non-terminal symbol 54 that represents a next semantic state for the PDA engine 40. The next non-terminal symbol 54 is pushed onto a stack 52 and then popped from the stack 52 for combining with a next segment 60 of the input data 14. The PDA engine 40 continues parsing through the input data 14 until the target search string 16 is detected.
The PDA engine 40 shown in
Further, referring to
This is different than DFA algorithms that maintain states for each indiscriminate bit or byte that comprises a piece of the semantic element. For example, referring back to
Conversely, the PDA engine 40 in
Conversely, the DFA state table 22 in
The PDA engine 40 can also reduce or eliminate state branching. For example, as described above in
The PDA engine 40 eliminates these additional branching states by nesting the possibility of a second “WWW.” string into the same semantic state 72 that searches for the “.ORG” semantic element. This is represented by path 75 in
Another aspect of the PDA engine 40 is that additional search strings can be added without substantially impacting or adding to the complexity of the semantic table 42. Referring to
Thus, the PDA architecture in
Example Implementation
It should also be noted that the PDA engine 40 can also be implemented in software so that the semantic table 42, semantic state map 48, and stack 52 are all locations in a memory accessed by a Central Processing Unit (CPU). The general purpose CPU then implements the operations described below. Another implementation uses a Reconfigurable Semantic Processor (RSP) that is described in more detail below in
In this example, a Content Addressable Memory (CAM) is used to implement the semantic table 42. Alternative embodiments may use an Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The semantic table 42 is divided up into semantic state sections 46 that, as described above, may contain a corresponding non-terminal (NT) symbol. In this example, the semantic table 42 contains only two semantic states. A first semantic state in section 46A is identified by non-terminal NT1 and associated with the semantic element “WWW.”. A second semantic state in section 46B is identified by non-terminal NT2 and associated with the semantic element “.ORG”.
A second section 44 of semantic table 42 contains different semantic entries corresponding to semantic elements in input data 14. The same semantic entry can exist multiple times in the same semantic state section 46. For example, the semantic entry WWW. can be located in different positions in section 46A to identify different locations where the semantic element “WWW.” may appear in the input data 14. This is only one example, and is used to further optimize the operation of the PDA engine 40. In an alternative embodiment, only a particular semantic entry may only be used once and the input data 14 sequentially shifted into input buffer 61 to check each different data position.
The second semantic state section 46B in semantic table 42 effectively includes two semantic entries. A “.ORG” entry is used to detect the “.ORG” string in the input data 14 and a “WWW.” entry is used to detect a possible second “WWW.” string in the input data 14. Again, multiple different “.ORG” and “WWW.” entries are optionally loaded into section 46B of semantic table 42 for parsing optimization. It is equally possible to use one “WWW.” entry and one “ORG.” entry, or fewer entries than shown in
The semantic state map 48, in this example, contains three different sections. However, fewer sections may also be used. A next state section 80 maps a matching semantic entry in semantic table 42 to a next semantic state used by the PDA engine 40. A Semantic Entry Point (SEP) section 78 is used to launch microinstructions for a Semantic Processing Unit (SPU) that will be described in more detail below. This section is optional and PDA engine 40 may alternatively use the non-tenninal symbol identified in next state section 80 to determine other operations to perform next on the input data 14.
For example, when the non-terminal symbol NT3 is output from map 48, a corresponding processor (not shown) knows that the URL string “WWW.XXX.ORG” has been detected in input data 14. The processor may then conduct whatever subsequent processing is required on the input data 14 after PDA engine 40 identifies the URL. Thus, the SEP section 78 is just one optimization in the PDA engine 40 that may or may not be included.
A skip bytes section 76 identifies the number of bytes from input data 14 to shift into input buffer 61 in a next operation cycle. A Match All Parser entries Table (MAPT) 82 is used when there is no match in semantic table 42.
Execution
A special end of operation symbol “$” is first pushed onto stack 52 along with the initial non-terminal symbol NT1 representing a first semantic state associated with searching for the URL. The NT1 symbol and a first segment 60 of the input data 14 are loaded into input buffer 61 and applied to CAM 90. In this example, the contents in input buffer 61 do not match any entries in CAM 90. Accordingly, the pointer 54 generated by CAM 90 points to a default NT1 entry in MAPT table 82. The default NT1 entry directs the PDA engine 40 to shift one additional byte of input data 14 into input buffer 61. The PDA engine 40 then pushes another non-terminal NT1 symbol onto stack 52
Map entry 48B also identifies the number of bytes that the PDA engine 40 needs to shift the input data 14 for the next parsing cycle. In this example, since the “WWW.” string was detected in the first four bytes of the input buffer 61, the skip bytes value in entry 48B directs the PDA engine 40 to shift another 8 bytes into the input buffer 61. The skip value is hardware dependant, and can vary according to the size of the semantic table 42. Of course other hardware implementations can also be used that have larger or smaller semantic table widths.
Note that during the last two PDA cycles there was no change in the semantic state represented by non-terminal NT2. There was no state transition even though the first three characters “.OR” in the second semantic element “.ORG” were received by the PDA engine 40. This is contrary to the DFA engine 30 shown in
Map entry 48D also includes a pointer SEP1 that optionally launches microinstructions are executed by a Semantic Processing Unit (SPU) (see
Concurrently with the launching of the SEP micro-instructions for the SPU, the map entry 48D may also direct the PDA engine 40 to push the new semantic state represented by non-terminal NT3 onto stack 52. This may cause the PDA engine 40 to start conducting a different search for other semantic element in the input data 14 following the detected URL 16. For example, as shown in
Thus, the PDA engine 40 identifies the URL with substantially fewer states than the DFA engine 22 shown in
As also previously mentioned above in FIGS, 4-6, the semantic states in the PDA engine 40 are substantially independent of search string length. For example, a longer search string “WWWW.” can be searched instead of “WWW.” simply by replacing the semantic entries “WWW.” in semantic table 42 with the longer semantic entry “WWWW.” and then accordingly adjusting the skip byte values in map 48.
Conversely, the DFA engine 30 in
Reconfigurable Semantic Processor (RSP)
A Direct Execution Parser (DXP) 180 implements the PDA engine 40 and controls the processing of packets or frames received at the input buffer 140 (e.g., the input “stream”), output to the output buffer 150 (e.g., the output “stream”), and re-circulated in a recirculation buffer 160 (e.g., the recirculation “stream”). The input buffer 140, output buffer 150, and recirculation buffer 160 are preferably first-in-first-out (FIFO) buffers.
The DXP 180 also controls the processing of packets by a Semantic Processing Unit (SPU) 200 that handles the transfer of data between buffers 140, 150 and 160 and a memory subsystem 215. The memory subsystem 215 stores the packets received from the input port 120 and may also store an Access Control List (ACL) in CAM 220 used for Unified Policy Management (UPM), firewall, virus detection, and any other operations described in co-pending patent applications: NETWORK INTERFACE AND FIREWALL DEVICE, Ser. No. 11/187,049, filed Jul. 21, 2005; and INTRUSION DETECTION SYSTEM, Ser. No. 11/125,956, filed May 9, 2005, which have both already been incorporated by reference.
The RSP 100 uses at least three tables to implement a given PDA algorithm. Codes 178 for retrieving production rules 176 are stored in a Parser Table (PT) 170. The parser table 170 in one embodiment is contains the semantic table 42 shown in
Codes 178 in parser table 170 are stored, e.g., in a row-column format or a content-addressable format. In a row-column format, the rows of the parser table 170 are indexed by a non-terminal code NT 172 provided by an internal parser stack 185. The parser stack 185 in one embodiment is the stack 52 shown in
The semantic code table 210 is also indexed according to the codes 178 generated by parser table 170, and/or according to the production rules 176 generated by production rule table 190. Generally, parsing results allow DXP 180 to detect whether, for a given production rule 176, a Semantic Entry Point (SEP) routine 212 from semantic code table 210 should be loaded and executed by SPU 200.
The SPU 200 has several access paths to memory subsystem 215 which provide a structured memory interface that is addressable by contextual symbols. Memory subsystem 215, parser table 170, production rule table 190, and semantic code table 210 may use on-chip memory, external memory devices such as synchronous Dynamic Random Access Memory (DRAM)s and Content Addressable Memory (CAM)s, or a combination of such resources. Each table or context may merely provide a contextual interface to a shared physical memory space with one or more of the other tables or contexts.
A Maintenance Central Processing Unit (MCPU) 56 is coupled between the SPU 200 and memory subsystem 215. MCPU 56 performs any desired functions for RSP 100 that can reasonably be accomplished with traditional software and hardware. These functions are usually infrequent, non-time-critical functions that do not warrant inclusion in SCT 210 due to complexity. Preferably, MCPU 56 also has the capability to request the SPU 200 to perform tasks on the MCPU's behalf.
The memory subsystem 215 contains an Array Machine-Context Data Memory (AMCD) 230 for accessing data in DRAM 280 through a hashing function or Content-Addressable Memory (CAM) lookup. A cryptography block 240 encrypts, decrypts, or authenticates data and a context control block cache 250 caches context control blocks to and from DRAM 280. A general cache 260 caches data used in basic operations and a streaming cache 270 caches data streams as they are being written to and read from DRAM 280. The context control block cache 250 is preferably a software-controlled cache, i.e. the SPU 200 determines when a cache line is used and freed. Each of the circuits 240, 250, 260 and 270 are coupled between the DRAM 280 and the SPU 200. A TCAM 220 is coupled between the AMCD 230 and the MCPU 56 and contains an Access Control List (ACL) table and other parameters that may be used for conducting firewall, unified policy management, or other intrusion detection operations.
Detailed design optimizations for the functional blocks of RSP 100 are described in co-pending application Ser. No. 10/351,030, entitled: A Reconfigurable Semantic Processor, filed Jan. 24, 2003 which is herein incorporated herein by reference.
Parser Table
As described above in
Since the TCAM employs the “Don't Care” capability and there can be multiple TCAM entries for a single NT, the TCAM can find multiple matching TCAM entries for a given NT code and DI[n] match value. The TCAM prioritizes these matches through its hardware and only outputs the match of the highest priority. Further, when a NT code and a DI[n] match value are submitted to the TCAM, the TCAM attempts to match every TCAM entry with the received NT code and DI[n] match code in parallel. Thus, the TCAM has the ability to determine whether a match was found in parser table 170 in a single clock cycle of semantic processor 100.
Another way of viewing this architecture is as a “variable look-ahead” parser. Although a fixed data input segment, such as eight bytes, is applied to the TCAM, the TCAM coding allows a next production rule (or semantic entry as described in
The TCAM implementation of the production rule table 170 is described in further detail in co-pending patent application entitled: PARSER TABLE/PRODUCTION RULE TABLE CONFIGURATION USING CAM AND SRAM, Ser. No. 11/181,527, filed Jul. 14, 2005, which is herein incorporated by reference.
The preceding embodiments are exemplary. Although the specification may refer to “an”, “one”, “another” or “some” embodiment(s) in several locations, this does not necessarily mean that each such reference is to the same embodiment(s), or that the feature only applies to a single embodiment.
The system described above can use dedicated processor systems, micro controllers, programmable logic devices, or microprocessors that perform some or all of the operations. Some of the operations described above may be implemented in software and other operations may be implemented in hardware.
For the sake of convenience, the operations are described as various interconnected functional blocks or distinct software modules. This is not necessary, however, and there may be cases where these functional blocks or modules are equivalently aggregated into a single logic device, program or operation with unclear boundaries. In any event, the functional blocks and software modules or features of the flexible interface can be implemented by themselves, or in combination with other operations in either hardware or software.
Having described and illustrated the principles of the invention in a preferred embodiment thereof, it should be apparent that the invention may be modified in arrangement and detail without departing from such principles. Claim is made to all modifications and variation coming within the spirit and scope of the following claims.
This application claims priority to U.S. Provisional Patent Application No. 60/701,748. filed Jul. 22, 2005; and is a continuation-in-part of copending, commonly-assigned U.S. patent application Ser. No. 10/351,030, filed on Jan. 24, 2003, which is herein incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
60701748 | Jul 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10351030 | Jan 2003 | US |
Child | 11458544 | Jul 2006 | US |