A regular expression; or regex, is a mechanism used to describe a text pattern. Regular expressions may be used for text searching, for example, to check whether a given text string satisfies a pattern represented by the regular expression. Each character in a regular expression may be, for example, a regular character, with a literal meaning or a metacharacter, wildcard or the like, with a special meaning. Together, these can be used to identify textual material in the text string.
The following detailed description references the drawings, wherein:
As mentioned above, a regular expression, or regex, may be used to check whether a given text string satisfies a pattern represented by the regular expression. To perform such a check, the text string may be analyzed with reference to the regular expression. Many regex matching algorithms analyze the text string one character at a time (e.g., from left to right).
To assist in determining whether a given text string satisfies a regular expression, the regular expression may be processed to form a state machine or automaton that represents the regular expression and allows an algorithm to interpret the regular expression in an orderly manner. The automaton may be a graph, for example, where the states of the graph correspond to partial or complete matches to the regular expression. Any regex may be processed to form what is called a non-deterministic finite automaton (NFA). NFA's occupy a small space—only as much space as the regular expression itself. Additionally, NFA's are efficient to compute, e.g., O(m) computing time.
NFA's may be efficient to compute, but they are slow to use to match a regular expression to a text string, for example, because a sub-match between the regex and the string may appear in multiple states of the NFA. In a NFA, a single state may proceed to multiple other states of the NFA, even with the same input. Thus, some matching algorithms may follow all possible sub-matches. Other algorithms may follow a particular sub-match until that branch of the NFA cannot be matched any further. Then, these algorithms may backtrack and pursue the next sub-match. These algorithms (all possible sub-matches, backtracking) are slow to use. For example, the backtracking approach may be exponentially slow in the size of the text string.
Instead of directly using a NFA to match a regex to a text string, some algorithms first create a deterministic finite automaton (DFA), e.g., based on the NFA. The DFA may be similar to the NFA, except in a DFA, a single state proceeds deterministically to a different state given a particular input. Any state on such a DFA corresponds to exactly one sub-match; thus, DFAs are fast to run. However, a large amount of up-front processing may be required to create the DFA. DFA's may be very large. For example, their size may be exponential in the size of the NFA.
Instead of using a NFA only or a DFA, some algorithms may use a hybrid NFA/DFA approach. Such approaches tend to start by processing the NFA, and then may compute DFA type data structures (e.g., representing a portion of the NFA) during the matching routine. In some algorithms, DFA parts may be constructed and/or discarded on the fly (during the matching routine), e.g., depending on the input (i.e., characters of the text string). Some hybrid regex matching algorithms are very inefficient, for example, with regard to the information they compute and/or store at each state of the NFA. For example, some algorithms save a whole DFA suffix tree at each node of the NFA. This makes such algorithms prohibitively expensive both from a memory standpoint and a performance standpoint. In real-world use, the amount of memory (e.g., in an L2 cache) that a data structure occupies to perform regex matching is a real concern. In theory, some algorithms may work from a functional standpoint. A goal of this disclosure is to describe an approach that is efficient in real-world use. One example real-world use for which an approach such as the one described herein may be used is hardware accelerators. For hardware accelerators, the efficiency of the regex matching approach is very important.
The present disclosure describes regular expression matching, for example, between a text string and a regex. In particular, the present disclosure describes an approach that allows multiple characters of a text string to be skipped during the regex matching routine in various circumstances. This character skipping vastly accelerates the matching routine compared to algorithms that analyze a text string one character at a time. The amount of acceleration may be equivalent to the average number of characters that are skipped. The approach of the present disclosure is also faster and more efficient than algorithms that directly use a NFA and algorithms that use a full DFA. The approach of the present disclosure may be considered to be a hybrid NFA/DFA approach; however, this approach is more efficient than other hybrid NFA/DFA algorithms (e.g., those that compute far too much DFA information at each state of the NFA).
In the present disclosure, a data structure referred to as a “segments DFA” may be generated based on a non-deterministic finite automaton (NFA) that represents a regular expression. The data structure may include a set of segments where each segment may indicate a segment starting state of the NFA. Furthermore, each segment may represent zero or more consecutive states of the NFA starting at the segment starting state. Each segment may represent a partial match of the regular expression to the string. Then, while the string is analyzed in relation to the NFA, the data structure may be modified. Such modification may include attempting to expand at least one of the segments in the set to represent additional states of the NFA. Thus, instead of starting the analysis of the NFA from the beginning and advancing to the end (like various other regex matching approaches), an approach of the present disclosure may start at multiple nodes of the NFA (even all nodes in some circumstances) and attempts to extend segments that match the string.
Regex 102 may be stored (e.g., temporarily) on system 100, for example, in a repository of system 100 as described above. Regex 102 may have been received by system 100, for example, in response to input from a user or other system. Alternatively, regex 102 may have been generated by system 100, e.g., in response to some signal or stimulus. Text string 104 may be stored (e.g., temporarily) on system 100, for example, in a repository of system 100 as described above. Text string 104 may have been received by system 100, for example, in response to input from a user or other system. Alternatively, text string 100 may have been generated by system 100, e.g., in response to some signal or stimulus. In some examples, test string 100 may be part of a larger set of information (e.g., a text document or the like), which may be stored on system 100 or external to system 100.
NFA 106 may be stored (e.g., temporarily) on system 100, for example, in a repository of system 100 as described above. NFA 106 may be updated at various times. NFA 106 may result from processing (e.g., by regex matcher 110 or some other component of system 100) regex 102. As described above, any regex may be processed to form a non-deterministic finite automaton (NFA). In some examples, NFA 106 may have been received by system 100 in a post-processed format, e.g., from another system that processed regex 102. In such examples, regex 102 may be stored on that other system and may not be stored on system 100. NFA 106 may include NFA auxiliary information 108, which may be generated by regex matcher 110, e.g., after analyzing NFA 106. In some examples, NFA auxiliary information 108 may be stored alongside NFA 106 instead of being stored as part of NFA 106.
NFA auxiliary information 108 may be determined (e.g., by regex matcher 110) up front or prior to the regex matching routine. NFA auxiliary information 108 may be determined by processing (e.g., by regex matcher 110) NFA 106. NFA auxiliary information 108 may include information that is stored for each state of NFA 106, and this information may be used by regex matcher 110 during the regex matching routine, e.g., to progress through the NFA during the regex matching routine and to determine jumps efficiently.
In some examples, NFA auxiliary information 108 includes, for each NFA state, information (e.g., a table) regarding the closest .* states of the NFA. As mentioned above, regular expressions may include metacharacters, wildcards and the like, which have special meanings. One such metacharacter/wildcard combination is .*, as can be seen in the example of
Returning to NFA auxiliary information 108, maintaining, for each NFA state, information regarding the closest .* states of the NFA may allow for determining (during regex matching) how long of a jump can be made on the string. As will become clear from the descriptions and examples below, a “jump” may refer to a situation during the regex matching routine where multiple characters of the text string may be skipped. A goal of the present approach is to jump as far as possible on the text string without hitting a .* state. Determining the closest .* states upfront prevents these determinations from having to be computed on the fly during regex matching. NFA auxiliary information 108 may also include, for each NFA state, information about which other NFA states are reachable (i.e., via a series of consecutive NFA states, otherwise known as a path) from the current state. This closest .* state information and path information can be determined based on the NFA 106 without having to start (i.e., prior to) the regex matching routine. This information may be computed and stored (per state) up front with the idea that it may be used if the regex matching routing is currently at the particular NFA state and no potential matches exists (e.g., between the text string and at least one segment of the segment DFA 112), as will be described in more detail with the examples that follow below. NFA auxiliary information 108 may also include, for each node, information about the minimal length of the NFA portion used for matching. This may allow for efficient jumping, as the amount of jump may be equivalent to this minimal length (e.g., minus some character if, for example, the currently analyzed character of the text string matches a character in the middle of the portion of the NFA used for matching). NFA auxiliary information 108 may include various other pieces of auxiliary information, and the examples of auxiliary information described herein should not be construed as limiting.
Segments DFA 112 may be stored (e.g., temporarily) on system 100, for example, in a repository of system 100 as described above. Segments DFA 112 may be updated at various times, for example, during the regex matching routine. More particularly, segments DFA 112 may be updated as various characters of the text string (e.g., 104) are analyzed. Segments DFA 112 may be initially generated (e.g., by regex matcher 110) based on NFA 106, and then may be updated (e.g., by adding states and/or updating the DFA auxiliary information of a state) as regex matcher 110 progresses through the regex matching routine.
Segments DFA 112 may be a data structure of sorts that maintains information about NFA 106. The term “segments DFA” includes the acronym DFA because this data structure serves a similar purpose to a full DFA in that it allows for deterministic progression through a NFA. However, segments DFA 112 maintains a minimal amount of information in a compact manner such that a large amount of memory is not required to perform the regex matching routine, unlike routines that use a full DFA or other hybrid NFA/DFA approaches. Segments DFA 112 may include a number of DFA states (e.g., DFA state 114, etc.) and edges 120.
Segments DFA 112 may include a number of DFA states, for example, DFA state 114. For example, initially, segments DFA 112 may include a single state (e.g., state 114), and then regex matcher 110 may create more states during the regex matching routine. In this respect, an entire DFA is not generated up front. In fact, minimal DFA information is generated until such information is needed during the regex matching routine. Previously created DFA states may be saved after they are created in case they are needed again during the regex matching routine, which may save computational effort.
Each DFA state (e.g., 114) may include a set of segments (e.g., 116) and DFA auxiliary information (e.g., 118). A “segment” may indicate one or two states on NFA 106 and may represent all consecutive states (i.e., a path) on the NFA between any two indicated states. It may be said that a segment includes a “pair” of states on the NFA 106; however, in some situations, both states of the pair may be the same state. In these situations, the length of the segment is zero. Thus, each segment may represent zero or more consecutive states of the NFA starting at a segment starting state of the NFA. The segment starting state may be represented by the first state (e.g., X) listed in a pair of states (e.g., using the “[X,Y)” notation) for a segment. Each segment may indicate its own segment starting state on the NFA, which may be the same or different as segment starting states indicated by other segments. A set of segments associated with a particular DFA state then represents a current “location” on the NFA or matching portions of the NFA. In other words, each segment represents a partial match of the regular expression (e.g., 102) to the string 104; and a set of segments for a particular state represents all the partial matches up to the current point in the regex matching routine. Several example segments will be shown and described in the examples provided below. Various example segments are shown in
Each DFA state (e.g., 114) may include DFA auxiliary information (e.g., 118). DFA auxiliary information 118 may be determined (e.g., by regex matcher 110) based on the current state of the regex matching routine and the current location on the NFA. DFA auxiliary information 118 may be used by regex matcher 110 during the regex matching routine, for example, to efficiently process jumps (e.g., when to jump and/or how far). DFA auxiliary information 118 may be updated (e.g., by regex matcher 110) at various times during the regex matching routine, for example, when different characters of the text string are analyzed. Additionally, when regex matcher 110 determines that a new state should be created for the segments DFA 112, a new set of DFA auxiliary information 118 may be generated. This new set of DFA auxiliary information 118 may be based on the DFA auxiliary information of the previous state, and the DFA auxiliary information of the previous state may be stored in case it is needed later during the regex matching routine.
For a particular state (e.g., state 114), DFA auxiliary information 118 may include the current character of the text string that is being analyzed, which may also be referred to as the current “location” on the text string. DFA auxiliary information 118 may also include the identity of any reoccurring wildcard (e.g., * or .*) states that have been passed on the NFA up to the current point in the regex matching routine. An example of this type of auxiliary information is shown in
Edges 120 may be components of the segments DFA 112 data structure that allow for progression from one state of segments DFA 112 to another state. As described above, initially, segments DFA 112 may include only a single state (e.g., state 114). Thus, initially, segments DFA 112 may not include any edges. Then, when regex matcher 110 creates more states during the regex matching routine, regex matcher 110 may also create edges that link the states. Previously created edges may be saved after they are created in case they are needed again during the regex matching routine, which may save computational effort. Example edges are shown in
Edges 120 may include or allow for the creation of various types of edges. In some examples, three types of edges are allowed. A first type of edge (“left” or “left extension”) may indicate that at least one of the segments of the current state of the segments DFA may be extended (i.e., because it matches) to the left on the text string from the current position/character of the text string. A second type of edge (“right” or “right extension”) may indicate that at least one of the segments of the current state of the segments DFA may be extended to the right on the text string from the current position/character of the text string. A third type of edge (“jump”) may indicate that the current position on the text string will be moved right from the right most character analyzed thus far. The amount of jump may be one or more characters. The amount of jump may be determined based on auxiliary information (e.g., NFA auxiliary information 108 and/or DFA auxiliary information of the current state of segments DFA 112). For example, as may be described in more detail below, NFA auxiliary information 108 may include the minimal length of the NFA (or NFA portion) based on the current position on the NFA. Thus, if the regex matching routine is at that particular NFA position when there are no matches, the jump amount may be the that minimal length.
Jumps may be performed in various situations during the regex matching routine. In one example situation where a jump may occur, all of the segments of the current segments DFA state reach a point where they cannot be extended (i.e., matched) further (right or left) and yet the segments have not been extended to terminal points in the regex/NFA (e.g., beginning, end, .* etc.). This situation may generally be referred to as “not matching” or “no match.” In another example situation where a jump may occur, the left sides of all the segments are “matched” or “glued” to a terminal point. Such a terminal point may be the start of the regex/NFA, one state to the right of a previously matched portion of the NFA or a .* (or other reoccurring wildcard) state. Various other details regarding jumps will become clear with reference to the various examples described below.
Regex matcher 110 may handle various aspects of the regex matching routine. The term “regex matching routine” may generally refer to the routine of determining whether a regular expression (e.g., 102, perhaps represented by NFA 106) matches a text string (e.g., 104). Regex matcher 110 may handle various aspects of preparation before the regex matching routine as well, for example, generating NFA 106 based on regex 102, and generating an initial segments DFA 112 based on NFA 106. Regex matcher 110 may include electronic circuitry (i.e., hardware) that implements the functionality of regex matcher 110 as described herein. Alternatively or in addition, regex matcher 110 may include instructions (e.g., stored on a machine-readable storage medium of system 100) that, when executed (e.g., by a processor of system 100), implement the functionality of regex matcher 110 as described herein. Regex matcher 110 may communicate with at least one repository or data store of system 100 (described above), that may store digital information representing at least one of regex 102, text string 104, NFA 106 and segments DFA 112. Regex matcher 110 may read such digital information and/or may modify such digital information (e.g., during the regex matching routine).
With regard to segments [1,1) and [2,2) of state 204, the ‘[’ notation refers to a NFA state that is “included” in the segment and the ‘)’ notation refers to a NFA state that is not included (although the range of the segment extends up to that state). When both states indicated by a segment are the same (e.g., [1,1)), the “[X,X)” notation indicates that the ‘X’ node is not included, but that the segment will attempt to match/extend starting at that node.
Referring again to
Referring again to initial stage 204, this state includes additional DFA auxiliary information indicated by “skip={ }” notation. This information indicates identity of any reoccurring wildcard (e.g., * or .*) states that have been passed on the NFA up to the current point in the regex matching routine. As mentioned above, node 1 of NFA 202 is a .* node, and because the initial segment first .* node is satisfied immediately in the regex matching routine, node 204 shows that the .* node (node 1) has been passed, by the “skip={1}” notation. Keeping track of this information allows for tracking of when the left side of some segment in the set is fully matched (e.g., at a .* node), which means that a jump may be performed as soon as the right side is fully matched. In the example of
What follows is a brief explanation of how the segments DFA 210 of
At state 208, a new set of segments are initiated for each of the nodes in the portion of the NFA being used for matching. The “skip” bracket is also updated because the .* node 2 was passed. At this position on the text string, it may be checked whether the current text character is any of the characters from the current sent of segments (characters ‘b’, ‘c’, ‘a’). If none of the characters match, the current position on the text string may be jumped by 3. If any of the characters match, it may be determined whether left or right extensions are required. As one example, if the current character matches ‘c’ (e.g., a left ‘c’ extension), then the segments DFA 210 may move to state 216, and the segments may be updated. Then, because ‘c’ falls in the middle of “bca”, a left b extension may be attempted. If that results in a match, the segments DFA 210 may move to state 218. Then, again, because falls in the middle of “bca”, a right a extension may be attempted. If that results in a match, the segments DFA 210 may move to state 220, and a complete match between the text string and the regex may be complete. If the left b or right a extensions fail, the segments DFA may return to state 208 via a jump edge, where the jump value depends on which extension fails. Similar sub routines may be performed for a left a edge from state 208 to state 210 and for a left b edge from state 208 to state 222.
Continuing with the example of
Thus, the regex matching routine of the example of
Finally, after one more jump, a ‘b’ character (236) is detected in the text string (shown generally by reference number 252). Because the ‘b’ in the text string matches the ‘b’ in the current regex portion (“bca”), a right extension is attempted to see if the next character (238) to the right in the text string is a ‘c’. It is, so another right extension is attempted to see if the next character (240) to the right in the text string is an ‘a’. It is, and then the entire regex 200 is matched in text string 232.
By comparing the example of
Method 300 may start at step 302 and continue to step 304, where a computing device (e.g., 400 of
Processor 410 may be one or more central processing units (CPUs), microprocessors, and/or other hardware devices suitable for retrieval and execution of instructions stored in machine-readable storage medium 420. In the particular embodiment shown in
Machine-readable storage medium 420 may be any electronic, magnetic, optical, or other physical storage device that stores executable instructions. Thus, machine-readable storage medium 420 may be, for example, Random Access Memory (RAM), an Electrically-Erasable Programmable Read-Only Memory (EEPROM), a storage drive, an optical disc, and the like. Machine-readable storage medium 420 may be disposed within computing device 400, as shown in
Referring to
The at least one repository 510 may store a string 512, a non-deterministic finite automaton (NFA) 514 that represents a regular expression, and a data structure 516 (e.g., a segments DFA) based on the NFA. The data structure may include a set of segments where each segment may indicate a segment starting state of the NFA. Each segment may represent zero or more consecutive states of the NFA starting at the segment starting state. Different segments of the set of segments may be capable of indicating different segment starting states on the NFA. Each segment may represent a partial match of the regular expression to the string. Regex matcher engine 520 may match the regular expression to the string 512. The regex matcher engine 520 may analyze the string 512 in relation to the NFA 514 and modify the data structure 516 as the string is analyzed. Such modification may include attempting to expand at least one of the segments in the set to represent additional states of the NFA.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/073249 | 12/5/2013 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2015/084360 | 6/11/2015 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6085186 | Christianson et al. | Jul 2000 | A |
7185081 | Liao | Feb 2007 | B1 |
7225188 | Gai et al. | May 2007 | B1 |
7260558 | Cheng | Aug 2007 | B1 |
7689530 | Williams, Jr. | Mar 2010 | B1 |
8024802 | Preston | Sep 2011 | B1 |
8055652 | Kumar | Nov 2011 | B1 |
8392174 | Cameron | Mar 2013 | B2 |
8448249 | Preston | May 2013 | B1 |
8464232 | Urakhchin | Jun 2013 | B2 |
8494985 | Keralapura et al. | Jul 2013 | B1 |
8495101 | Koyanagi et al. | Jul 2013 | B2 |
8572106 | Estan | Oct 2013 | B1 |
8700593 | Estan | Apr 2014 | B1 |
9270641 | Preston | Feb 2016 | B1 |
9336194 | Manadhata | May 2016 | B2 |
9665664 | Ruehle | May 2017 | B2 |
20030051043 | Wyschogrod | Mar 2003 | A1 |
20030195874 | Akaboshi | Oct 2003 | A1 |
20060277534 | Kasuya | Dec 2006 | A1 |
20080109431 | Kori | May 2008 | A1 |
20090049230 | Pandya | Feb 2009 | A1 |
20100017397 | Koyanagi | Jan 2010 | A1 |
20100138367 | Yamagaki | Jun 2010 | A1 |
20100146623 | Namjoshi | Jun 2010 | A1 |
20100161536 | Clark | Jun 2010 | A1 |
20100192225 | Ma | Jul 2010 | A1 |
20100325157 | Yamagaki | Dec 2010 | A1 |
20110022617 | Yamagaki | Jan 2011 | A1 |
20110093496 | Bando | Apr 2011 | A1 |
20110145181 | Pandya | Jun 2011 | A1 |
20120011094 | Yamagaki | Jan 2012 | A1 |
20120221497 | Goyal | Aug 2012 | A1 |
20120331007 | Billa | Dec 2012 | A1 |
20120331554 | Goyal et al. | Dec 2012 | A1 |
20130133064 | Goyal | May 2013 | A1 |
20130191916 | Yao | Jul 2013 | A1 |
20130262493 | Atasu | Oct 2013 | A1 |
20130290356 | Yang | Oct 2013 | A1 |
20140101155 | Chao | Apr 2014 | A1 |
20140101156 | Chao | Apr 2014 | A1 |
20140101157 | Chao | Apr 2014 | A1 |
20140101185 | Ruehle | Apr 2014 | A1 |
20140101187 | Chao | Apr 2014 | A1 |
20140115263 | Ruehle | Apr 2014 | A1 |
20140149439 | Ruehle | May 2014 | A1 |
20140173603 | Ruehle | Jun 2014 | A1 |
20140214749 | Ruehle | Jul 2014 | A1 |
20150040142 | Cheetancheri | Feb 2015 | A1 |
20150067836 | Billa | Mar 2015 | A1 |
20150067863 | Billa | Mar 2015 | A1 |
20150074104 | Kim | Mar 2015 | A1 |
20180004483 | Goyal | Jan 2018 | A1 |
Number | Date | Country |
---|---|---|
1986390 | Oct 2009 | EP |
WO-03075170 | Sep 2003 | WO |
Entry |
---|
Yang et al., “Optimizing Regular Expression Matching with SR-NFA on Multi-Core Systems”, in Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 424-433. (Year: 2011). |
T.T.Hieu et al., “ENREM: An Efficient NFA-Based Regular Expression Matching Engine on Reconfigurable Hardware for NIDS”, Journal of Systems Architecture 59 (2013): pp. 202-212. (Year: 2013). |
Pao et al., “A Memory-Based NFA Regular Expression Match Engine for Signature-Based Intrusion Detection”, Computer Communications 36 (2013): pp. 1255-1267. (Year: 2013). |
Beate Commentz-Walter, A string matching algorithm fast on the average, in: H. Maurer (Ed.), Proc. Sixth Internat. Coll. on Automata, Languages and Programming, Springer, Berlin, 1979, pp. 118-131. |
Bruce W. Watson , Richard E. Watson A Boyer-Moore-style algorithm for regular expression pattern matching Science of Computer Programming 48 (2003) 99-117. |
Cameron et al., “Fast Regular Expression Matching with Bit-parallel Data Streams”, May 18, 2013, 5 pages. |
Kearns, S., “Regular Expression Searching in Sublinear Time”, Aug. 15, 2013, 35 pages. |
Wikipedia, “Boyer-Moore string search algorithm”, retrieved from the Internet on May 29, 2018, 8 pages. <http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm>. |
Wikipedia, “Nondeterministic finite automation”, retrieved from the Internet on May 29, 2018, 8 pages. <http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton>. |
Number | Date | Country | |
---|---|---|---|
20160275205 A1 | Sep 2016 | US |