The need to perform sophisticated, high performance searching of data is driven by the desire for high performance quality-of-service (QoS) and signature-based security systems. Such security systems include intrusion detection, virus scanning, content classification, network surveillance, spam filtering, etc. The sophisticated requirements of these searching domains make the use of simple literal textual searching inadequate. A common paradigm that is used in these search domains is that of regular expression searching.
Regular expressions are patterns built up by combining literal text with special operators. These operators are textual characters that have been deemed to convey special meaning. Minimal regular expression syntax comprises literal text combined with the following operators, as shown in table I below
This minimal expression syntax is frequently extended with the following standard operators shown in Table II below:
Regular expressions are patterns against which an input stream may succeed or fail to match. Thus, they may be used as the basis of sophisticated searching systems. The conventional regular expression syntax can be extended to include the concept of action tags. Action tags are a postfix notation used to associate a number with a place in a regular expression. The semantics of actions is that when the regular expression, implemented in a suitable pattern matching architecture, matches up to the tagged point, the action tag is generated as an event. The following regular expression:
generates the event 1 when “dog” is matched, the event 2 when “cat” is matched and the events 2 and 3 on the input string “catfish”. Regular expressions using this extended syntax are referred to as “action tagged” regular expressions.
High throughput searching systems that use regular expressions rely on a high speed implementation of regular expression matching. The most common method for implementing high speed regular expression matching is use of Finite State Automaton representation.
A Finite State Automaton (FSA) [see
Finite state automata come in two forms: Deterministic and Non-deterministic. If the transition function gives a single new state for any given current state and current symbol, the automaton is said to be a Deterministic Finite State Automaton (a DFA) [see
A finite state automaton with a transition function that generates more than one “next state” for some current state, current symbol combination, is said to be a Non-deterministic Finite State Automaton (an NFA) [see
A regular expression may be converted into a DFA or an NFA through the use of an appropriate algorithm (see
The excessive processing requirements of high performance searching systems demands the need for specialized hardware or software solutions. General software solutions, run on conventional hardware using a general purpose operating system, are unable to maintain the high throughput and constancy of throughput that is required of solutions in such domains.
In order to satisfy the constant throughput requirements of high performance searching it is necessary to build a system with a worst case performance that exceeds the required throughput or to build a system based on constant throughput algorithms and data structures. As the amount of data over which searches must be performed is growing faster than the rate of increase in processing power [ref], the provision of reasonable cost systems that can guarantee sufficiently high worst case performance is impractical. Development of practical high speed constant throughput devices is thus dependant on the use of constant throughput algorithms and data structures. The use of searching algorithms based on DFAs provides one solution.
Deterministic Finite State Automata use large amounts of memory to represent the required action for every possible situation that can arise during data searching. This is conventionally represented by a transition table giving, for each state, the appropriate next state for each possible input symbol. By explicitly representing the required action for every possible situation it is possible to keep the processing time to decide each such action to a constant. However, the large memory requirements of DFA based searching systems makes their use prohibitively expensive in many searching domains. In particular, for certain regular expressions, such as those of the form:
It is known to those skilled in the art, that a DFA representation will require a number of states that is exponential in the length of the expression. This implies that simply increasing the available memory will never be a sufficient solution. What is required is a system that preserves as much as possible of the constant throughput benefit of DFA based searching while reducing the overhead of the associated large memory requirements.
In accordance with the present invention, an apparatus and a method is provided to produce, from a regular expression, a configuration of a multi-level system while significantly reducing the overall memory requirements and, in particular, reducing the memory requirements of the lowest level DFA based layer of the generated multi-level system.
This invention relates to the automated transformation of regular expressions. In accordance with the present invention, a plurality of regular expressions is transformed into a second form—that includes a representation and collection of segments—whereby the language embodied by the second form is an approximation of the language embodied in the original plurality of expressions.
The method of the invention derives the second form, mentioned above, by deriving a first form, dividing this first form into a first collection of segments and producing a first representation that embodies relationships between the segments in the first collection. This first form is then transformed into the abovementioned second form.
Another object of the invention is to extract, from a plurality of regular expressions, features that can be efficiently represented as a Finite State Automaton (FSA) or a set of FSAs while maintaining a representation of higher order features of the regular expression to facilitate use in a multi-level pattern matching system.
Yet another object of the current invention is to facilitate the distribution of an implementation of a pattern matching system for regular expressions over multiple levels of a system; for example, a system comprising a software program supported by accelerated pattern matching hardware.
A further object of the invention is to generate a second form from a plurality of regular expressions, as described above, such that the collection of segments can be implemented in resource limited environments such as pattern matching acceleration hardware.
Another object of the invention is to translate a plurality of regular expressions into a second form, the overall space requirements of an implementation of which are less than those of a simple single level implementation of said regular expressions as a Finite State Automaton (FA). Various other objects of the present invention are apparent in view of the description provided below.
The invention described below is a method of transforming a plurality of regular expressions into a second form suitable for configuring one of a number of searching systems. Details of the invention are presented, describing its operation in generating configuration information for each of a variety of different searching apparatuses. In each such description the searching apparatus that is to be configured is referred to as the “destination apparatus” and is described along with the particular aspects of the method of the invention that are relevant to such an apparatus.
The second representation [108] generated by the method of the invention embodies some or all of the higher level semantic structure of the input regular expressions [102] that is lost in the segmentation process. This second form is derived in a form for use with a choice of high-level pattern matching architectures [107], or a hierarchy of progressively higher level pattern matching apparatuses.
The collection and representation comprising the second form derived by the method of the invention is used to configure a hierarchy of pattern matching architectures comprising at least two levels. The lowest level is any conventional single level pattern matching architecture of a style familiar to those skilled in the art. The higher levels comprise apparatuses selected from a variety of different types.
In the destination apparatus depicted in
Through the above described operational procedure the destination apparatus is able to perform pattern matching that is almost identical to matching using the original regular expression while requiring significantly less storage than would be required by a single level system that represented the input regular expression as a single DFA. The differences between the matching behavior of the presented apparatus, as configured by the method of the invention, and the matching behavior of an implementation of the input regular expression as a single DFA are recognizable by those skilled in the art as being insignificant in almost all domains in which such searching is performed. The breaking up of the original regular expression into a collection of segments reduces the possibility of exponential space requirements, well understood by those skilled in the art, that are typical of DFA representations of complex regular expressions.
In another embodiment of the invention (see
The split candidates are used in the segmentation process to produce a collection of sub-expressions of the first regular expression, e.g. the expression would be split at any occurrence of the sub-expression “.*” or “[\n]*”, either discarding or retaining the identified split candidates. The segmentation is performed by producing a canonical representation of any sub-expressions (sub-trees of the parse tree) resulting from the splitting process. These sub-expressions resulting from segmentation are each assigned a unique “tag”, then recombined disjunctively (using the “|” operator) to form a second regular expression.
A second representation of the original regular expression is produced by replacing the sub-expression parse sub-trees (corresponding to the elements of the collection produced in abovementioned segmentation) with proxy nodes representing the unique tags previously assigned. The parse tree thus generated, is translated into a finite state automaton through conventional algorithms known to those skilled in the art, (see 605,
The second regular expression generated in the recombination step is compiled, using extended algorithms, to a form for use on a hardware pattern matching device—this hardware pattern matching device generating the unique tags, assigned in the recombination step, in response to the matching any of the sub-expressions in the collection generated in the segmentation process (see 604,
Matches identified by the secondary state machine correspond to matches of the semantic requirements embodied in the second representation. The semantics embodied by the secondary state machine define a formal language that is a superset of the formal language specified by the original regular expression. It is understood by those skilled in the art that division of the matching process into a two level system loses a small amount of information embodied in the original regular expression, consequently loosening the semantic requirements for matching and thus increasing the size of the formal language.
Another embodiment of the invention generates configuration information for the multi-level pattern matching apparatus depicted in
The configuration information generated by this embodiment of the method of the invention for the destination apparatus of
In two further embodiments of the invention the configuration information generated by the method of the invention is for an apparatus in which the generated second collection of segments is matched by a set of DFAs. The output of these DFAs is then used as input to a single DFA or a set of DFAs that embody the second representation generated, by the method of the invention, from the original plurality of regular expressions.
A further embodiment of the method of the invention generates configuration information for a hierarchical pattern matching apparatus in which the second representation is a set of pattern matching objects.
The individual objects that comprise the object set [304] each have at least two message handling predicates with the following semantics. The input predicate is used to receive match notifications generated by the low level matching architecture [302] and dispatched to the object via the demultiplexer [303]. It is through this predicate that the object implements the semantics of the second representation that it is designed to match. The second requisite predicate is the query predicate, match
, that is used to find the current state of the object, in particular with respect to whether the embodied representation has been matched, although the embodiment of partial matches, counted matches and other similar semantic constructs are within the scope of this invention. In most embodiments the invention will generate a second representation that configures a collection of objects that keep a record of where in the input stream matches have occurred, to allow the overall apparatus to report useful information regarding match location. This facility relies on the low level architecture [302] to report the input location when generating events.
The operation of the low level and high level pattern matching architectures is coordinated by the controller component [301]. This component receives an input stream from an external source, passes this input stream on to the low level component [302] and at an appropriate time, determined by the implementation semantics of the controller component, queries the constituent objects of the high level architecture [304] to identify the occurrence of any matches. After performing said object queries the controller [301] reports match notifications to any interested external system.
The abovementioned embodiments of the method of invention are each extended to accommodate input regular expressions that include actions tags. Action tagged regular expressions have numerical event identifiers associated with specific locations in the regular expression. The method of the invention is extended so that the generated second representation includes details of the action tags present in the input regular expressions. This allows the same action identifiers to be generated as output from the high level pattern matching architecture as would be generated from a single level implementation of the action tagged input regular expressions in an extension of a conventional pattern matching architecture.
All of the abovementioned embodiments of the method of the invention are extended with a number of variations of the method for producing the second collection of segments. Several variants retain the concept of splitting the original regular expressions through the removal of substrings known as split candidates. These split candidates are identified by a number of means. The simplest means, as used in the above described embodiments is the matching of substrings to a table of candidate literals. Such candidates include the above used example “.*”.
In further embodiments the identification of split candidates is performed using a pattern matching architecture configured with a set of candidate patterns.
All of the abovementioned embodiments of the invention can be extended to include recursive application of the basic method of the invention. In the simplest embodiments, as taught above, the input expression is divided in a single pass. More complex embodiments of the method of the invention apply the procedure recursively, the resultant segment collection of one application of the process being subjected to a further application of the process and so on. The recursive application of the process can lead to representations embodying the high level semantics of the input regular expression that necessitate the use of the finite state machine model, or the pattern matching object model for the high level pattern matching architecture.
Still further embodiments of the invention use worst case analysis of the number of states required in a DFA representation of the second generated collection of segments. In these embodiments, a heuristic is used to estimate the number of states required to represents segments of the input expression. When a segment is estimated to exceed some predefined threshold the segment is divided into disjoint component segments. This method proceeds by applying the heuristic analysis recursively to the generated collection of segments until no further division is required. The accompanying second representation implied by this division method requires that the high level matching apparatus be implemented as a finite state machine or collection of objects.
In various embodiments of the invention the above described worst case analysis can be performed with a restriction on the total number of states required for any individual DFA representation of a generated segment or, in alternative embodiments, with a restriction on the total number of states required by the combination of all such generated DFAs or the total number of states required by a combined DFA matching all generated segments. In addition, in further embodiments of the invention the worst case analysis relies on the amount of memory used for a proprietary representation of the DFA, for example a compressed state table representation as described in published U.S. application No. US2005/0028114 A1, entitled “Efficient Representation of State Transition Tables”, and published U.S. application No. US2005/0035784 A1, entitled “Apparatus and Method for Large Hardware Finite State Machine with Embedded Equivalence Classes”, both commonly owned, the contents of both of which are incorporated herein by reference in their entirety. As is known to those skilled in the art, the concept of “top level” expression requires parsing of regular expressions and refers to whole expressions separated by use of the disjunctive operator “|” that do not occur within parenthesized sub-expressions.
is converted into two DFAs. The invention produces a second collection of segments by dividing the first generated form of the input regular expression at the occurrence of features of little significance—in this case the occurrences of the idiom “.*”—and thus identifies the following second collection of segments:
The method of the invention converts these segments into a form suitable for the destination apparatus; in this case a single combined DFA [604]. It is understood that for simplicity DFA [602] is depicted in a simplified form that only includes significant transitions. It is further understood that other DFAs generated by the invention include more back transitions taken in the event of failed partial matches. This DFA has unique identifying tags associated with its terminal states. These tags are generated as output from the low level pattern matching apparatus [602] in the event of the DFA reaching one of these terminal states, i.e. when the DFA matches a low level feature in its input stream.
The method of the invention also generates a second representation in the form of DFA [605], this DFA being configuration for the high level pattern matching architecture component of the destination apparatus [603]. This DFA takes as input the output of the low level DFA [602], i.e., the action tags assigned to the identified low level features. The high level DFA [605] has its terminal states labeled with appropriate action tags from the input regular expression. These action tags are generated as output from the high level pattern matching architecture [603] in the event of the DFA reaching one of these terminal states, i.e., when the overall apparatus matches a sequence of segments that corresponds to a match of the input regular expression. The output of the high level pattern matching architecture [603] is revealed as the output of the whole apparatus and constitutes the pattern matching result.
The above embodiments of the present invention are illustrative and not limiting. Various alternatives and equivalents are possible. Other additions, subtractions or modifications are obvious in view of the present disclosure and are intended to fall within the scope of the appended claims.
The present application claims benefit under 35 USC 119(e) of U.S. provisional application No. 60/604983, filed on Aug. 26, 2004, entitled “Method For Transformation Of Regular Expressions” the content of which is incorporated herein by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| 60604983 | Aug 2004 | US |