Regular expressions provide a concise and formal way of describing a set of strings over an alphabet. Given a regular expression and a string, the regular expression matches the string if the string belongs to the set described by the regular expression. Regular expression matching may be used, for example, by command shells, programming languages, text editors, and search engines to search for text within a document. Known techniques for regular expression matching can have long worst-case matching times.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
Regular expressions are a formal way to describe a set of strings over an alphabet. Regular expression matching is the process of determining whether a given string (for example, a string of text in a document) matches a given regular expression. That is, whether the given string is in the set of strings that the regular expression describes. Given a string that matches a regular expression (or a regular expression that matches a string), submatch extraction is a process of extracting substrings corresponding to specified subexpressions known as capturing groups. This feature provides for regular expressions to be used as parsers, where the submatches correspond to parsed substrings of interest. For example, the regular expression (.*)=(.*) may be used to parse key-value pairs, where the parentheses are used to indicate the capturing groups.
A submatch extraction system and method for extracting submatches from a string that matches a regular expression are described herein. As described in detail below, the system and method use Boolean functions to represent transitions of non-deterministic finite automata (NFAs), and manipulate the Boolean functions using ordered binary decision diagrams (OBDDs). An automaton is defined as an abstract machine that can be in one of a finite number of states and includes rules for traversing the states. As described in further detail below, an OBDD is defined as a data structure that is used to represent a Boolean function. The automaton and the OBDD may be stored in the submatch extraction system as machine readable instructions, or as data other than instructions. The system and method use a two-pass technique. A forward pass scans an input string and decides whether it is accepted by an automaton. If so, a backward pass is used to extract submatches described by the capturing groups of a regular expression. In this manner, the regular expression may be compiled by an automaton creation module (described below), and then matched to many different input strings to extract submatches. The implementation of the submatch extraction system and method provide both time and space efficiency. The system and method provide reduced submatch extraction times in the worst case.
In an example, the submatch extraction system generally includes a memory storing a module comprising machine readable instructions to receive an input string, receive a regular expression, and convert the regular expression with capturing groups into OBDDs to extract submatches. The system may include a processor to implement the module.
In an example, the method for submatch extraction includes receiving an input string, receiving a regular expression, and converting the regular expression with capturing groups into OBDDs to extract submatches.
In an example, a non-transitory computer readable medium having stored thereon machine readable instructions for submatch extraction is also described. The machine readable instructions when executed may cause a computer system to receive an input string, receive a regular expression, and convert the regular expression with capturing groups into OBDDs to extract submatches.
As discussed above, an OBDD is a data structure that may be used to represent a Boolean function and can be considered as a compressed representation of sets or relations. OBDDs can be used to transform Boolean function manipulations into efficient graph representations. An OBDD represents a Boolean function f (x1, x2, . . . , xn) as a rooted, directed acyclic graph (DAG). The DAG includes two terminal nodes, which are labeled 0 and 1, and have no outgoing edges. Each remaining non-terminal node is associated with a label from the set {x1, x2, . . . , xn}, and includes two outgoing edges labeled 0 and 1. An OBDD may be ordered such that node labels are associated with a total order <. Node labels along all paths in the OBDD from the root to the terminal nodes follow this total order. Evaluation of a Boolean function denoted by an OBDD may be performed by traversing appropriately labeled edges from the root to the terminal nodes of the DAG. For example,
OBDDs provide for efficient manipulation of Boolean functions. With OBDDs, checking whether a Boolean function is satisfactory or unsatisfactory is a constant time operation. This is because it is sufficient to check whether the terminal node labeled 1 (or 0) is present in the OBDD. With regard to OBDDs, APPLY and RESTRICT operations allow OBDDs to be combined and modified with a number of Boolean operators.
The APPLY operation allows binary Boolean operators, such as AND (Λ) and OR (V), to be applied to a pair of OBDDs. Two input OBDDs, OBDD(f) and OBDD(g), have the same variable ordering. APPLY (OP, OBDD(f), OBDD(g)) computes OBDD(f OP g), which has the same variable ordering as the input OBDDs.
The RESTRICT operation is unary, and produces as output an OBDD in which the values of some of the variables of the input OBDD have been fixed to a certain value. That is, RESTRICT (OBDD(f), x←k)=OBDD(f|ƒ(x←k)), where f|(x←k) denotes that x is assigned the value k in f. In this case, the output OBDD does not have any nodes with the label x.
The APPLY and RESTRICT operations are implemented as a series of graph transformations and reductions to the input OBDDs, and include efficient implementations. For example, the time complexity of the APPLY and RESTRICT operations is polynomial in the size of the input OBDDs.
The APPLY and RESTRICT operations may be used to implement existential quantification, which is used in the operation of the submatch extraction system and method. In particular, ∃xi.f(x1, . . . , xn)=f(x1, . . . , xn)|(xi←0)f(x1, . . . , xn)|(xi←1). Expressed by OBDD, this results in OBDD(∃xi.f (x1, . . . , xn))=APPLY (, RESTRICT (OBDD(f), xi←1), RESTRICT(OBDD(f), xi←0)). Further, OBDD(∃xi. f (x1, . . . , xn)) will have no node labeled xi.
OBDDs may be used to obtain concise representations of relations over finite domains. For example, if R is an n-ary relation over the domain {0, 1}, then its characteristic function fR may be defined as follows: fR(x1, . . . , xn)=1 if and only if R(x1, . . . , xn). For example, the characteristic function of the 3-ary relation R={(1, 0, 1), (1, 1, 0)} is fR(x1, x2, x3)=(x1
A set of elements over an arbitrary domain D can also be expressed as an OBDD. If S is a set of elements over a domain D, then a relation RS may be defined such that RS(s)=1 if and only if s ∈ S. Operations on sets may then be expressed as Boolean operations and performed on the OBDDs representing these sets. For example,
By using OBDDs for submatch extraction, the time efficiency of submatch extraction is increased. Using OBDDs for submatch extraction also provides for retention of space-efficiency. Further, using OBDDs provides for handling of all classes of regular expressions. The system and method described herein also provide for submatch extraction for complex regular expressions and for submatch extraction on a set of regular expressions combined together.
With regard to parsing, parsing using regular expressions may be used as a building block for security applications such as security information and event management (SIEM) systems. SIEM systems perform real-time analysis of event logs and security alerts generated by hardware and software systems in an enterprise network. Since each source generates its logs in a different format, a SIEM system may use submatch extraction to parse common fields, such as, for example, device name and attack source from the logs. In such a setting, a relatively small number of regular expressions, which are known in advance, may be matched to a large number of input strings in real time. In this regard, the submatch extraction system and method provide for efficient submatch extraction when matching a string to a regular expression, where the expression may be compiled in advance into a form that will facilitate matching and submatching. The submatch extraction system and method may therefore be implemented, for example, in a parser, in a SIEM system, and in an intrusion detection system (IDS).
The modules 101-105, and other components of the system 100 may comprise machine readable instructions stored on a computer readable medium. In addition, or alternatively, the modules 101-105, and other components of the system 100 may comprise hardware or a combination of machine readable instructions and hardware.
The components of the system 100 are described in further detail with reference to
Referring to
With regard to regular expressions, the language described by a regular expression may also be defined by a finite automaton. For example, the language described by regular expression a*aa represents strings starting with zero of more a′s, and ending with aa. For a regular expression R, the automaton creation module 102 may construct an ε-NFA that defines the same language as R. The automaton creation module 102 may further reduce the ε-NFA to an ε-free NFA. An ε-free NFA may be described by a tuple A=(Q, Σ, δ, S, F), where Q is a finite set of states, Σ is a finite set of input symbols, S is a set of start states, F is a set of accept states, and δ is the transition function which takes a state in Q and an input in Σ as arguments and returns the next set of states which is a subset of Q.
For the regular expression R, a capturing group may be a part of a regular expression, called a subexpression, which is wrapped within a pair of parentheses. As discussed above, the capturing group may be used to specify a substring of a string described by a regular expression, with the substring being denoted a submatch. For example, a submatch of string aaab specified by regular expression (a*b*)b* is aaa. A submatch may not be unique for some regular expressions. For example, aaab is another valid submatch of aaab for regular expression (a*b*)b*.
The automaton creation module 102 converts a regular expression with capturing groups to a tagged NFA (note: references to NFA refer to a tagged NFA), which is described by a tuple A=(Q, Σ, T, δ, Γ, S, F), where:
In order to construct the automaton A, the automaton creation module 102 may use an inductive NFA construction approach. Referring to
The automaton creation module 102 performs ∈-elimination to convert an ∈-NFA to an ∈-free NFA described by a tuple A1=(Q1, Σ, T, δ1, Γ1, S1, F1), where:
Q1 is a finite set of states;
Σ is a finite set of input symbols;
δ1 is the transition function; δ1: Q1×Σ→2Q1;
S1 is ∈ Q1 set of start states;
F1 is ∈ Q1 set of accept states;
T is a finite set of tags;
Γ1 is a tag output function 1: Q1×Σ×Q1→2T.
Q1 includes final states in A and any state that has an outgoing transition for a symbol in Σ. Transition function δ1 and tag output function Γ1 may be defined as follows: δ1(q, a)=p and Γ1(q, a, p)=t if and only if there exists a, q, p′, and t such that δ1(q, a)=p′, Γ1(q, a, p′)=t and p ∈ ∈-closure(p′). The final state F1 is defined by F1={q| ∈ ∈-closure(q); q ∈ Q1; f ∈ F}, where F is the set of final states of A.
The transition function δ1 and the tag output function Γ1 may be represented by an extended transition table Tr=Tr (x, i, y, t), where x denotes a current state, i denotes an input symbol, y denotes a next state for x with input i, and t is a set of output tags associated with transition (x, i, y). Each row of Tr may include a transition along with a set of output tags.
In order to determine whether an input string is in a language described by the regular expression, the ∈-free NFA created by the automaton creation module 102 may be used by the comparison module 104 to perform a match test. In order to perform the match test, for a given input string a0a1 . . . al ∈ Σ*, the NFA may be in a set of states (called the frontier) at any instant during its operation. For frontier derivation, starting from the start states S1, with the first input symbol ao, the next set of states may be obtained by looking up transitions such that the values of the first two columns satisfy x ∈ S1 and i=a0. If these transitions are denoted by Tr1, then Tr1={(x, i, y, t)|x ∈ S1; i=a0, (x, i, y, t) ∈ Tr}. The new frontier (i.e., next set of states) Y1 is defined by Y1={y|(x, i, y, t) ∈ Tr1}. Renaming Y1 to X1 and using it as a current set of states, with the second input symbol a1, the new frontier of the NFA may be derived. The intermediate transitions may be denoted by Trj, and the frontiers may be denoted by Xj, where j ∈ 1 . . . I.
For the example regular expression (a*)aa, the start states S1={1}, and the accept states F1={3}. If the kth row of transition table Tr (shown in
When the final input symbol is reached, the comparison module 104 may determine if any state in XI belongs to F1. If so, the input string a0a1 . . . aI is accepted by A1, which means the input string matches the regular expression defined by A1. In the above example, X4={1; 2; 3} contains an accept state 3, thus, aaaa is accepted by (a*)aa.
In case an input string a0a1 . . . aI is accepted by A1 and the regular expression has capturing groups, the extraction module 105 may extract substrings specified by the capturing groups. To do so, a backward path is determined from an accept state to a start state of the automaton. Any backward path from an accept state to a start state generates a valid set of submatch instances.
In order to determine a backward path, the extraction module 105 may find a previous state given a current state with an input symbol. Starting from the last input symbol aI and one of the accept states qf ∈ F1 (such that the input string is accepted at qf), a previous state qI which led the automaton to state qf with input symbol aI can be obtained by checking the intermediate transitions TrI generated during the match test. In particular, qI is any state in QI, where QI={x|(x, aI, qf, t) ∈ TrI}. Suppose (qI, aI, qf, t) is a transition in TrI that led state qI to qf with input symbol aI, then the submatch tags associated with symbol aI is t. It is noted that t can be empty if there is no capturing group associated with a transition. Using qI as a current state, with input symbol aI-1, a previous state which led the automaton to qI with input symbol aI-1 can be obtained by looking up the intermediate transitions in TrI-1. The submatch tags associated with aI-1 can be found using the same approach as for aI. This backward process may be continued to find previous states and submatch tags until a0 is reached. Finally, a backward path from qf to a start state and submatch tags associated with each input symbol may be obtained.
In the above example, a backward path from an accept state 3 to a start state 1 with input string aaaa is shown in
The above match test and backward path finding can be accomplished using OBDDs, which can have higher time efficiency than a table look-up approach. It is noted that the OBDDs may be constructed prior to determination of the backward path.
In order to construct the OBDDs by the OBDD creation module 103, generally an NFA may be represented and operated symbolically using Boolean functions. Boolean functions may be used to update the frontiers during the match test discussed above. OBDDs may be used to represent and manipulate the Boolean functions such that tags corresponding to the capturing groups are properly set.
In order to represent and operate NFA A1=(Q1, Σ, T, δ1, Γ1, S1, F1) with Boolean functions, the OBDD creation module 103 uses NFAs in which ∈ transitions have been eliminated. The Boolean functions of an NFA use four vectors of Boolean variables, x, y, i, and t. Vectors x and y are used to denote states in Q1, and they contain [log|Q1|] Boolean variables each. Vector i denotes symbols in Σ, and thus contains [log|Σ|] Boolean variables. Vector t denotes the submatch tags and contains [log|T|] Boolean variables. The following Boolean functions for NFA A1=(Q1, Σ, T, δ1, Γ1, S1, F1) may be constructed.
Δ(x, i, y, t) denotes the transition table of A1. It is a disjunction of all transition relations (x, i, y, t) of an NFA. The transition table of the example regular expression of
I94(i) stands for the Boolean representation of symbols in Σ. In the foregoing example, Ia=i. A symbol different than a is represented by i.
F(x) is a Boolean representation of a set of frontier states. For example, after consuming the first input symbol of aaaa, the frontier of the NFA is {1,2}, and the Boolean representation is F(x)=
ΔF(x, i, y, t) denotes the Boolean representation of a set of intermediate transitions of an NFA, from which the new frontier states and the output tags are derived.
A(x) is used to define the Boolean representation of a set of accept states of an NFA. In the example NFA shown in the table of
The Boolean functions described above can be computed automatically from the extended transition table of
For the submatch extraction performed by the extraction module 105, suppose that the frontier of an NFA is F(x) at some instant of operation, and the next symbol in the input string is σ, then the new frontier states can be computed by the following Boolean operations:
g(y)=∃x∃i∃t[ΔF(x, i, y, t)] Equation (1)
For Equation (1) ΔF(x, i, y, t)=F(x)Iσ(i)Δ(x, i, y, t).
In order to understand why g(y) represents the new frontier states, consider the truth table of Boolean function Δ(x, i, y, t). By construction, this function evaluates to 1 only for x, i, y, and t for which (x, i, y, t) is a transition of the NFA. Function F(x) evaluates to 1 only for values of x that denote the states in the current frontier of the NFA. Thus, the conjunction of Δ(x, i, y, t) with F(x) and Iσ(i) only selects the rows of the truth table Δ(x, i, y, t) that correspond to transitions from states in the frontier labeled with symbol a, resulting in intermediate transitions ΔF(x, i, y, t), which will be used in submatch extraction. ΔF is a function of x, i, y, and t. The new frontier states are only associated with y, (i.e., the target states of transitions) for which the conjunction has a satisfying assignment. To find the new frontier states, x, i, and t may be existentially quantified, resulting in g(y). To express the new frontier states in terms of x, variables in y may be renamed to the corresponding ones in x.
To check whether the automaton is in an accept state, the satisfiability of conjunction between the frontier F(x) and the accept set of states A(x) is checked.
In case an input string is accepted by the automaton based on a determination by the comparison module 104 and the regular expression has capturing groups, as discussed above, the extraction module 105 may use OBDDs to find a backward path from an accept state to a start state of the automaton. The backward path finding may start from an accept state and the last symbol of the input string. The current state of a backward path may be denoted reverse frontier, which contains only one state, since one path is needed. Suppose at an instant of the path finding the reverse frontier is Fr(y), and the previous input symbol is Iσ the previous state which led the automaton to state Fr(y) with input symbol Iσ can be determined by the following Boolean function:
Δr(x, i, y, t)=Fr(y)I94 (i)ΔF(x, i, y, t) Equation (2)
For Equation (2), ΔF(x, i, y, t) are the intermediate transitions corresponding to input symbol Iσ during the match test process. The conjunctions in Equation (2) select transitions (labeled by σ) from ΔF(x, i, y, t) where the target state is Fr(y). The previous states are associated with x in Δr(x, i, y, t). Since one path is needed, one row may be chosen in the truth table of Δr(x, i, y, t) to find one previous state of Fr(y). If the chosen row is denoted as PICKONE(Δr(x, i, y, t)), a previous state g(x) of Fr(y) can be derived by:
g(x)=∃y∃i∃t[H(x, i, y, t)] Equation (3)
H(x, i, y, t)=PICKONE(Δr(x, i, y, t)) Equation (4)
To obtain submatch tags τ(t) associated with σ, x, i, and y are existentially quantified on H (x, i, y, t) as follows:
τ(t)=∃x·∃i·∃y·H(x, i, y, t) Equation (5)
Based on the discussion above, the OBDD creation module 103 represents NFAs by Boolean functions and manipulates the Boolean functions using OBDDs. An OBDD for an NFA A1=(Q1, Σ, T, δ1, Γ1, S1, F1) is a 5-tuple [{OBDD(Δ(x, i, y, t))}, {OBDD(Iσ|∀σ ∈ Σ)}, {OBDD(τt|t ∈ T)}, {OBDD(FS
As discussed above, by using OBDDs for submatch extraction, the time efficiency of submatch extraction is improved. For example, considering the operation of an NFA, to derive a new set of frontier states, the transition table is retrieved for each state in the current set of frontier states, leading to O(|δ|×|F|) operations per input symbol. The complexity of OBDD on deriving a new frontier set can be completed by two conjunctions, followed by existential quantifications. The time-efficiency may be based on the conjunction, which is O(sizeof (OBDD(Δ))×sizeof (OBDD(Iσ))×sizeof(OBDD(F))). Because OBDDs are compact representations of frontier F and transition table Δ, the time efficiency of submatch extraction is increased. The performance of OBDDs based on NFAs is pronounced when the transition table of an NFA is sparse and the frontier set is large, for which OBDDs can effectively remove the redundancy of transition relations δ1 and frontier set F of the automaton.
Using OBDDs for submatch extraction also provides for retention of space-efficiency. For example, the space cost of an OBDD based on an NFA is based on Δ(x, i, y, t), which uses a total of 2×┌log|Q1|┐+┌log|Σ|┐+┌log|T|┐ Boolean variables. The space consumption of this OBDD is O(|Q1|2 |Σ|×|T|), which is similar to the space consumption of the transition table of an NFA.
Referring to
At block 202, the regular expression with capturing groups is converted into an automaton to extract submatches. More specifically, the regular expression is converted into a finite automaton with tags representing the capturing groups. For example, referring to
At block 203, the finite automaton is converted into OBDDs. For example, referring to
At block 204, an input string is received. For example, referring to
At block 205, the constructed OBDDs are used to determine whether the input string is in a language described by the regular expression. For example, referring to
At block 206, if an input string matches the regular expression, the OBDDs created by the OBDD creation module 103 may be used to extract submatches. Specifically, the tags and the input string are processed in reverse order by the OBDDs, one character at a time, to extract the submatches. For example, referring to
The computer system 300 includes a processor 302 that may implement or execute machine readable instructions performing some or all of the methods, functions and other processes described herein. Commands and data from the processor 302 are communicated over a communication bus 304. The computer system 300 also includes a main memory 306, such as a random access memory (RAM), where the machine readable instructions and data for the processor 302 may reside during runtime, and a secondary data storage 308, which may be non-volatile and stores machine readable instructions and data. The memory and data storage are examples of computer readable mediums. The memory 306 may include modules 320 including machine readable instructions residing in the memory 306 during runtime and executed by the processor 302. The modules 320 may include the modules 101-105 of the system 100 shown in
The computer system 300 may include an I/O device 310, such as a keyboard, a mouse, a display, etc. The computer system 300 may include a network interface 312 for connecting to a network. Other known electronic components may be added or substituted in the computer system 300.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.