Regular expressions provide a concise and formal way of describing a set of strings over an alphabet. Given a regular expression and a string, the regular expression matches the string if the string belongs to the set described by the regular expression. Regular expression matching may be used, for example, by command shells, programming languages, text editors, and search engines to search for text within a document.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
Regular expressions are a formal way to describe a set of strings over an alphabet. Regular expression matching is the process of determining whether a given string (for example, a string of text in a document) matches a given regular expression, that is, whether the given string is in the set of strings that the regular expression describes. Given a string that matches a regular expression, submatch extraction is a process of extracting substrings corresponding to specified subexpressions known as capturing groups. This feature provides for regular expressions to be used as parsers, where the submatches correspond to parsed substrings of interest. For example, the regular expression (.*)=(.*) may be used to parse key-value pairs, where the parentheses are used to indicate the capturing groups.
Finding the submatches of an input string to a regular expressions that contains capturing groups may be implemented by using automata. While certain implementations may use a plurality of automata and thus a plurality of passes of the input string to determine the correct submatches, in certain cases, finding the submatches of an input string to a regular expression may be implemented by using a single (i.e., one) pass. According to an example, a one pass submatch extraction system and a method for one pass submatch extraction are disclosed. The system and method disclosed herein may be used to determine at compile time whether a regular expression being considered belongs to the set of regular expressions that may be implemented by using a single pass, and if so, a single automaton may be used at runtime. By using a single-pass operation, the system and method disclosed herein provide improved efficiency by approximately a factor of two for the matching and submatching at runtime for the regular expressions in these sets compared to using a multiple-pass (e.g., two-pass) operation.
According to an example, the one pass submatch extraction system may include an input module to receive a regular expression. An automaton generation module may generate an automaton M for the received regular expression. An automaton M is defined as an abstract machine that can be in one of a finite number of states and includes rules for traversing the states. The automaton M may be stored in the system as machine readable instructions. An automaton evaluation module may determine whether the regular expression being considered belongs to the set of regular expressions that may be implemented by using a single pass, and if so, the single automaton M may be used at runtime. If the regular expression being considered does not belong to the set of regular expressions that are implemented by using a single pass, finding submatches of an input string to the regular expression may be implemented, for example, as described in detail in commonly owned and co-pending application Ser. No. 13/460,419 titled “Submatch Extraction”, Ser. No. 13/556,684 titled “Matching Regular Expressions including Word Boundary Symbols,” and PCT/US12/28916 titled “Submatch Extraction”. Further, the systems and methods described in co-pending application Ser. Nos. 13/460,419, 13/556,684, and PCT/US12/28916 may implement finding submatches of an input string to a regular expression either when the regular expression belongs to the set of regular expressions for which matching and submatch extraction can be implemented by using a single pass as described herein, or when the regular expression does not belong to this set.
In order for the automata evaluation module to determine whether the regular expression being considered belongs to the set of regular expressions for which matching and submatch extraction may be implemented by using a single pass, the automata evaluation module may determine whether the automaton M′ is deterministic (as described in further detail below), where M′=rev(close(M)) and M is the automaton corresponding to the regular expression built in the manner described below. If M′=rev(close(M)) is deterministic, then M′ is a one pass reverse automaton, and the one pass reverse automaton M′ (i.e., M′=rev(close(M))) may be used to process a string in reverse order. Further, the automata evaluation module may determine whether the automaton M″ is deterministic, where M″=rev(close(rev(M))) and M is the automaton corresponding to the regular expression built in the manner described below. If M″=rev(close(rev(M))) is deterministic, then M″ is a one pass forward automaton, and the one pass forward automaton M″ (i.e., M″=rev(close(rev(M)))) may be used to process a string in forward order.
The system and method disclosed herein may further include a comparison module to receive input strings, and match the input strings to the regular expression (i.e., if the regular expression being considered belongs to the set of regular expressions for which matching and submatch extraction may be implemented by using a single pass) by processing a string in a reverse or forward order respectively based on whether M′=rev(close(M)) is deterministic or M″=rev(close(rev(M))) is deterministic. In extracting submatches for an input string to the regular expression, the comparison module thus determines if the input string is in a language described by the regular expression, that is, whether it matches the regular expression. If an input string does not match the regular expression, submatches are not extracted. However, if an input string matches the regular expression, the output from the processing of the input string (i.e., the input string as processed by the comparison module) may be used to extract submatches by an extraction module. In this manner, the regular expression may be matched to many different input strings and submatches may be extracted from those input strings that match the regular expression.
According to an example, the one pass submatch extraction system may include a memory storing machine readable instructions to receive an input string, receive a regular expression with capturing groups, and convert the regular expression with capturing groups into a finite automaton M to extract submatches. The finite automaton M may be evaluated to determine whether the regular expression belongs to a set of regular expressions for which submatch extraction is implemented by using one pass by determining whether the automaton M′=rev(close(M)) is deterministic, and determining whether the automaton M″=rev(close(rev(M))) is deterministic. The input string may be matched to the regular expression if the regular expression belongs to the set of regular expressions for which submatch extraction is implemented by using one pass. The one pass submatch extraction system may include a processor to implement the machine readable instructions.
According to an example, the method for one pass submatch extraction may include receiving an input string, receiving a regular expression with capturing groups, and converting the regular expression with capturing groups into a finite automaton M to extract submatches. The finite automaton M may be evaluated to determine whether the regular expression belongs to a set of regular expressions for which submatch extraction is implemented by using one pass by determining whether the automaton M′=rev(close(M)) is deterministic. The input string may be matched to the regular expression if the regular expression belongs to the set of regular expressions for which submatch extraction is implemented by using one pass.
For the example of the one pass submatch extraction system whose construction is described in detail herein, the syntax of regular expressions with capturing groups and reluctant closure on a fixed finite alphabet Σ, for example the standard ASCII set of characters, is:
E:=ε|a|EE|E|E*|E*
?|(E)t
For the syntax of regular expressions with capturing groups and reluctant closure on a fixed finite alphabet Σ, a stands for an element of the alphabet A, ε is the empty string, and the parentheses ( )t indicate the tth capturing group. The one pass submatch extraction system may use this syntax. Other examples of the one pass submatch extraction system may perform one pass submatch extraction for regular expressions written in a syntax that uses different notation to denote one or more of the operators introduced in the foregoing syntax of regular expressions with capturing groups and reluctant closure on a fixed finite alphabet Σ; or that does not include either or both of the operators * or *? in the foregoing syntax of regular expressions with capturing groups and reluctant closure on a fixed finite alphabet Σ; or that includes additional operators, such as, for example, special character codes, character classes, boundary matchers, quotation, etc.
Indices may be used to distinguish the capturing groups within a regular expression. Given a regular expression E containing c capturing groups marked by parentheses, indices 1, 2, . . . c may be assigned to each capturing group in the order of their left parentheses as E is read from left to right. The notation idx(E) may be used to refer to the resulting indexed regular expression. For example, if E=((a)*|b)(ab|b) then idx(E)=((a)2*|b)1(ab|b)3.
If X, Y are sets of strings, XY is used to denote {xy: xεX, yεy}, and X|Y to denote X∪Y. If β is a string and B a set of symbols, β|B denotes the string in B* obtained by deleting from β all elements that are not in B. A set of symbols T={St, Et: 1≦t≦c} are introduced and may be referred to as tags. The tags may be used to encode the start and end of capturing groups. The language L(F) for an indexed regular expression F=idx(E), where E is a regular expression written in the foregoing syntax of regular expressions with capturing groups and reluctant closure on a fixed finite alphabet Σ, is a subset of (Σ∪T)*, defined by L(ε)={ε}, L(a)={a}, L(F1F2)=L(F1)L(F2), L(F1|F2)=L(F1)∪L(F2), L(F*)=L(F*?)=L(F)*, L([F])=L(F), and L((F)t)=(StαEt: αεL(F)), where ( )t denotes a capturing group with index t. There are standard ways to generalize this definition to other commonly-used regular expression operators, so that it can be applied to cases where the regular expression E is written in a commonly-used regular expression syntax different from the foregoing syntax of regular expressions with capturing groups and reluctant closure on a fixed finite alphabet Σ.
A valid assignment of submatches for regular expression E with capturing groups indexed by {1, 2, . . . c} and input string a is a map sub: {1, 2, . . . c}→Σ*∪{NULL} such that there exists βεL(E) satisfying the following three conditions:
(i) β|Σ=α;
(ii) if St occurs in β then sub(t)=βt|Σ where βt is the substring of β between the last occurrence of St and the last occurrence of Et; and
(iii) if St does not occur in β then sub(t)=NULL.
If αεΣ*, α matches E if and only if α=β|Σ for some βεL(E). For a regular expression without capturing groups, this coincides with the standard definition of the set of strings matching the expression. By definition, if there is a valid assignment of submatches for E and α, then α matches E. It may be proved by structural induction on E that the converse is also true, that is, whenever E matches α, there is at least one valid assignment of submatches for E and a. The one pass submatch extraction system may take as input a regular expression and an input string, and output a valid assignment of submatches to the capturing groups of the regular expression if there is a valid assignment, or report that the string does not match if there is no valid assignment.
The difference between the operators * and *? is not apparent in the set of valid assignments of submatches, but is apparent in which of these valid assignments is reported.
The modules 101-107, and other components of the system 100 that perform various other functions in the system 100, may include machine readable instructions stored on a non-transitory computer readable medium. In addition, or alternatively, the modules 101-107, and other components of the system 100 may include hardware or a combination of machine readable instructions and hardware.
The components of the system 100 are described in further detail with reference to
Referring to
The automaton M may be considered as a directed graph. If x is any directed path in M, ls(x) denotes its label sequence. Let π: Q1×Q1→T* be a mapping from a pair of states to a sequence of tags defined as follows. For any two states q, pεQ1, consider a depth-first search of the graph of M, beginning at q and searching for p, using only transitions with labels from T∪{+, −}, and such that at any state with outgoing transitions labeled ‘+’ and ‘−’, the search explores all states reachable via the transition labeled ‘+’ before following the transition labeled ‘−’. If this search succeeds in finding successful search path λ(q, p), then π(q, p)=ls(λ(q, p))|T is the sequence of tags along this path. If the search fails, then π(q, p) is undefined. π(p, p) is defined to be the empty string. It can be shown that this description of the search uniquely specifies λ(q, p), if it exists.
In order for the automaton generation module 102 to generate the automaton M, as described above, the syntax of regular expressions with capturing groups and reluctant closure on a fixed finite alphabet Σ, for example the standard ASCII set of characters, is:
E:=ε|a|EE|E|E|E*|E*
?|(E)t
The automaton generation module 102 may use the rules of
(Σ,Q,Δ,S,F),
where
Σ=A∪E∪{St,Et:tεT},
E={+,−}, and the set T indexes the capturing groups of the regular expression. Referring to
Referring to
The rev and close operations are defined as follows.
With respect to the rev operation, the notation reverse(α) may be used for the reverse of a string α, such that if α=α1.a2 . . . an, then reverse(α)=an.an−1 . . . a1. The automaton M may be specified by the tuple:
(Σ,Q,Δ,S,F),
where Σ is the input alphabet, Q is the set of states, Δ is the set of transitions, S is the set of initial states, and F is the set of final states, and either Δ⊂Q×Σ×Q (so that the automaton has no outputs) or Δ⊂Q×Σ×Q×C* for some alphabet C of output characters (so that the outputs of the automaton M are strings over C.) For the rev operation, rev(M) is an automaton that matches a string a if and only if M matches reverse(α). For the rev operation, rev(M) is specified by the tuple:
(Σ,Q,r(Δ),F,S),
where r(Δ)={(p,σ,q): (q,σ,p)εΔ} if Δ⊂Q×Σ×Q, and
r(Δ)={(p,σ,q,reverse(γ)): (q,σ,p,γ)εΔ} if Δ⊂Q×Σ×Q×C*.
With respect to the close operation, the automaton M may be specified by the tuple:
(Σ,Q,Δ,S,F),
where Σ is the input alphabet, Q is the set of states, Δ⊂Q×Σ×Q is the set of transitions, S is the set of initial states, and F is the set of final states. For the close operation, close(M) is an automaton for which transitions in close(M) correspond to paths in the automaton M. The definition of close(M) is relative to two particular subsets A, E of Σ, and uses a new label I not in Σ and a new state q0 not in Q. For the close operation, A, E, I and q0 are fixed. For p, qεQ and γεΣ*, pγq may be written to mean that there are transitions as follows:
(A∪{I},Q′,Δ′,{q0},F),
where Q′={q0}∪{pεQ: (p,σ,q)εΔ for some σεA, qεQ}∪F, and Δ′⊂Q′×(A∪{I})×Q′×(Σ∪{I})* is the set:
{(q0, I, q, I.γ): qεQ′, γε(Σ/A)*, ∃ pεS such that pγ q}
∪{(p, σ, q, σ.γ): p, qεQ′, σεA, γε(Σ/A)*, p1σ.γ q}
With respect to whether an automaton is deterministic, if M=(Σ, Q, Δ, S, F) is an automaton such that Δ⊂Q×Σ×Q×C* and |S|=1, then the automaton M is deterministic if the start state and input of a transition uniquely determine the end state and output. Specifically, the automaton M is deterministic if and only if
(p, σ, q1, γ1), (p, σ, q2, γ2)εΔ implies q1=q2 and γ1=γ2.
Based on the foregoing definitions related to the rev and close operations, and based on the foregoing definition of whether an automaton is deterministic, the one pass reverse automaton determination module 106 may determine whether for the automaton M generated by the automaton generation module 102, the automaton M′=rev(close(M)) is deterministic. Further, the one pass forward automaton determination module 107 may determine whether for the automaton M generated by the automaton generation module 102, the automaton M″=rev(close(rev(M))) is deterministic. Thus the one pass reverse automaton determination module 106 and the one pass forward automaton determination module 107 may respectively generate the automata V=rev(close(M)) and M″=rev(close(rev(M))), and check whether these automata are deterministic.
With respect to the close operation, the close operation introduces a new label I the one pass reverse automaton determination module 106 confirms that the automaton M′=rev(close(M)) is deterministic, in order for the comparison module 104 to determine whether the string a matches the regular expression, the comparison module 104 processes the string reverse(α).I by the automaton M′=rev(close(M)). The processing will terminate with success if and only if the string a matches the regular expression. If the processing terminates with success, then there will be n+1 processing steps, where n is the length of string α. For 1≦i≦n+1, the comparison module 104 writes γi for the string output by step i, and sets γ=reverse(γ1.γ2 . . . γn+1). In order to obtain the submatch of the string a to the tth capturing group of the regular expression, the extraction module 105 finds the substring of γ lying between the last occurrence of St and the last occurrence of Et in γ, and deletes all characters from this substring that are not in A.
If the one pass forward automaton determination module 107 confirms that the automaton M″=rev(close(rev(M))) is deterministic, in order for the comparison module 104 to determine whether the string a matches the regular expression, the comparison module 104 processes the string α.I by the automaton M″=rev(close(rev(M))). The processing will terminate with success if and only if the string α matches the regular expression. If the processing terminates with success, then there will be n+1 processing steps, where n is the length of string a. For 1≦i≦n+1, the comparison module 104 writes γi for the string output by step i, and sets γ=γ1.γ2 . . . γn+1. In order to obtain the submatch of the string α to the tth capturing group of the regular expression, the extraction module 105 finds the substring of γ lying between the last occurrence of St and the last occurrence of Et in γ, and deletes all characters from this substring that are not in A.
Referring to
Referring to
Referring to
According to another example, the comparison module 104 may process a string a1 a2 . . . al in reverse order with a one pass reverse automaton (i.e., M′=rev(close(M))). The submatch boundaries are determined by the tags Si and Ei. If a tag occurs on a transition corresponding to aj, the boundary is defined to be between positions j and j+1. For example, when processing the string abc=x, the tag E1 occurs while processing the character c. Since c is the 3rd character, the tag E1 indicates that the submatch ends between the 3rd and 4th characters.
Submatch extraction for a variety of regular expressions may be implemented by a one-pass reverse automaton (i.e., the one pass reverse automaton determination module 106 confirms that the automaton M′=rev(close(M)) is deterministic) which contain no closure operations, or contain exactly one closure operation at the end of the regular expression. Examples of such regular expressions that may be used in a practical application are as follows:
(\S+?) peers exist on IIDB (\S+?)\.
State machine return code: (\S+?), (\S+?)
Submatch extraction for the foregoing regular expressions may be implemented by a one-pass reverse automaton (i.e., M′=rev(close(M))).
Referring to
According to another example, the comparison module 104 may process a string a1a2 . . . al in forward order with a one pass forward automaton (i.e., M″=rev(close(rev(M)))). If a tag occurs on a transition corresponding to aj, then the boundary is defined to be between positions j−1 and j. For example, when processing the string x=def, the tag S1 occurs while processing the character d. Since d is the 3rd character, the tag S1 indicates that the submatch starts between the 2nd and 3rd characters.
Submatch extraction for a variety of regular expressions may be implemented by a one-pass forward automaton (i.e., the one pass forward automaton determination module 107 confirms that the automaton M″=rev(close(rev(M))) is deterministic) which contain no closure operations, or contain exactly one closure operation at the end of the regular expression. Examples of such regular expressions that may be used in a practical application are as follows:
Interface (\S+?) is down\.?
Unexpected event (\S+?) (\S+?)
Submatch extraction for the foregoing regular expressions may be implemented by a one-pass forward automaton (i.e., M″=rev(close(rev(M)))).
Referring to
At block 202, the example method includes receiving a regular expression.
At block 203, the example method includes converting the regular expression with capturing groups into a finite automaton M to extract submatches. In this example method, the construction of the finite automaton M is described above.
At block 204, the example method includes evaluating the finite automaton M to determine whether the regular expression belongs to a set of regular expressions for which submatch extraction is implemented by using one pass by determining whether the automaton M′=rev(close(M)) is deterministic.
At block 205, the example method includes matching the input string to the regular expression if the regular expression belongs to the set of regular expressions for which submatch extraction is implemented by using one pass.
Referring to
At block 302, the example method includes receiving a regular expression.
At block 303, the example method includes converting the regular expression with capturing groups into a finite automaton M to extract submatches. In this example method, the construction of the finite automaton M is described above.
At block 304, the example method includes evaluating the finite automaton M to determine whether the regular expression belongs to a set of regular expressions for which submatch extraction is implemented by using one pass. Evaluating the finite automaton M to determine whether the regular expression belongs to a set of regular expressions for which submatch extraction is implemented by using one pass further includes determining whether the automaton M′=rev(close(M)) is deterministic, and determining whether the automaton M″=rev(close(rev(M))) is deterministic.
At block 305, the example method includes matching the input string to the regular expression if the regular expression belongs to the set of regular expressions for which submatch extraction is implemented by using one pass. Matching the input string to the regular expression further includes using the automaton M′=rev(close(M)) to process the input string in a reverse order if M′=rev(close(M)) is deterministic, or using the automaton M″=rev(close(rev(M))) to process the input string in a forward order if M″=rev(close(rev(M))) is deterministic.
At block 306, the example method includes using an output of the processing of the input string to extract submatches if the input string matches the regular expression.
The computer system 400 includes a processor 402 that may implement or execute machine readable instructions performing some or all of the methods, functions and other processes described herein. Commands and data from the processor 402 are communicated over a communication bus 404. The computer system also includes a main memory 406, such as a random access memory (RAM), where the machine readable instructions and data for the processor 402 may reside during runtime, and a secondary data storage 408, which may be non-volatile and stores machine readable instructions and data. The memory and data storage are examples of computer readable mediums. The memory 406 may include a one pass submatch extraction module 420 including machine readable instructions residing in the memory 406 during runtime and executed by the processor 402. The one pass submatch extraction module 420 may include the modules 101-107 of the system shown in
The computer system 400 may include an I/O device 410, such as a keyboard, a mouse, a display, etc. The computer system may include a network interface 412 for connecting to a network. Other known electronic components may be added or substituted in the computer system.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.