1. Field of the Invention
The present invention relates to a pattern matching technique for locating an occurrence of more than one text pattern in a given set of character strings as a subset of character strings.
2. Description of the Related Art
The technique for locating a specified pattern in input data is essential to the information-processing technology and its application is diversified. Text search in word processing, DNA analysis in biotechnology and detection of computer viruses in electronic mails are a few of the potential fields of application. In particular, the Aho-Corasick string matching algorithm is best known as a technique that is suitable for applications where a plurality of text patterns exist and these patterns are unique to each other (see “Efficient String Matching: An Aid to Bibliographic Search, A. V. Aho and M. J. Corasick, Communications of the ACM, June 1975, Volume 18, Number 6, pages 333-340). According to the Aho-Corasick algorithm, characters are taken one at a time from the starting point of a text string for matching in a state transition diagram and a transition occurs from one state to a state specified in the diagram.
As an example,
A prior art system that implemented the Aho-Corasick algorithm involves the use of a state transition table having a listing of transitions regarding all states and all characters. Such a state transition table is implemented as shown in
However, with the Aho-Corasick algorithm the amount of memory for implementing the state transition table increases significantly with the increase in the number of types of different characters because of the need to provide entries corresponding in number to the number of all transition states multiplied by all character types.
The bitmapped Aho-Corasick algorithm is known as a technique for reducing the amount of memory for implementing a state transition table, as described in an article “Deterministic Memory-Efficient String Matching Algorithms for Intrusion Detection”, N. Tuck, T. Sherwood, B. Calder and G. Varghese, Proceedings of IEEE Infocom Conference [1], 0-7803-8356-7/04, 2004.
However, the bitmapped Aho-Corasick algorithm has a disadvantage in that with the increasing number of character types the memory size still increases and the amount of calculations increases with a resultant decrease in the speed of string matching. Since the calculation involved in a single transition requires that “1-or-0” bit decisions be repeatedly made on bits equal in number to {(number of character types)−1}/2 by assuming that the number of characters contained in each input character string is equal. If the number of character types is 256, the bit map is 256-bit wide and the “1-or-0” bit decision must be repeated 127.5 times on the average for each state transition. This implies that a significant amount of computational resources is consumed. Since the width of the bit map is equal to the number of different characters, the amount of memory for storing a state transition table increases significantly, hence the speed of string matching decreases, with the number of different characters.
It is therefore an object of the present invention to provide a pattern matching apparatus and method that creates a state transition table whose size does not depend on the number of different characters, whereby the speed of making a search for a character pattern is independent on the number of different characters.
According to a first aspect, the present invention provides a pattern matching method for detecting a plurality of character patterns in a string of input characters, comprising (a) creating a state transition table defining a plurality of rows respectively identified by address values, each of the rows containing a reference character, first and second hash functions and first and second address values, (b) receiving a target character from the input characters and determining a hash value by substituting the target character into a previously specified hash function, (c) summing the hash value with a previously specified address value to produce a new address value, (d) comparing the target character with the reference character contained in one of the rows identified by the new address value, and (e) depending on a result of the comparison, specifying one of the first and second hash functions of the identified row and one of the first and second address values of the identified row, and repeating (b) to (d) by using the currently specified hash function instead of the previously specified hash function and the currently specified address value instead of the previously specified address value for detecting the character patterns.
According to a second aspect, the present invention provides a pattern matching method for detecting a plurality of character patterns in a string of input characters, comprising determining a plurality of hash functions and respectively assigning the determined hash functions to transition states in a state transition diagram of the plurality of character patterns, determining a plurality of hash values by respectively substituting a set of characters into the assigned hash functions, sorting the set of characters into a plurality of character groups according to the determined hash values and assigning a unique address value to each of the character groups, dividing each of the character groups into two sub-groups so that one of the sub-groups contains a reference character, determining a next transition state of each of the sub-groups through least state transitions, respectively assigning the unique address values to the next transition states of all sub-groups, the hash functions of the next transition states, and a plurality of pattern numbers which will be detected when one of the sub-groups is reached in a character search, the pattern numbers respectively identifying a plurality of character patterns, storing the hash functions, the pattern numbers and the reference characters into a plurality of rows of a state transition table according to the unique address values, comparing a target character with one of the reference characters contained in one of the rows, selecting one of the two sub-groups of one of the character groups depending on a result of the comparison, determining a hash value by substituting the target character into the hash function of a next transition state, and summing the hash value with an address value stored in the same row of the next transition state to produce a new address value and accessing the state transition table using the new address value to produce a plurality of data necessary to perform a next transition.
According to a third aspect, the present invention provides a pattern matching system for detecting a plurality of character patterns in a string of input characters, comprising a state transition table having a plurality of rows respectively identified by address values, each of the rows containing a reference character, first and second hash functions and first and second address values, a hash calculator that receives a target character from the input characters and determines a hash value by substituting the target character into a previously specified hash function, an adder that sums the hash value with a previously specified address value to produce a new address value and supplies the new address value to the state transition table to identify one of the rows, a comparator that compares the target character with the reference character contained in the identified row to produce an output indicating a match or mismatch between the compared characters, and selector circuitry that, in response to a result of the comparator, specifies one of the first and second hash functions of the identified row and one of the first and second address values of the identified row and supplies the specified hash function to the hash calculator instead of the previously specified hash function and the specified address value to the table instead of the previously specified address value.
The present invention will be described in detail with reference to the following drawings, in which:
A pattern matching apparatus 1 illustrated in
The output of the adder 22 is supplied to the memory 23 as an address for accessing one of its rows. In response to an address from adder 22, the memory 23 produces a plurality of column outputs including a reference character 123, a matched transition flag 124, a mismatched transition flag 125, a matched pattern number 126, a mismatched pattern number 127, a matched hash function 128, a mismatched hash function 129, a matched next address 130, and a mismatched next address 131.
These outputs are supplied in pairs to a corresponding one of selectors 25, 26, 27 and 28. Specifically, the transition flags 125 and 126 are supplied to a flag selector 25, the pattern numbers 126 and 127 are supplied to a pattern number selector 26, the hash functions 128 and 129 are supplied to a hash function selector 27, and the next addresses 130 and 131 are supplied to a next address selector 28.
A comparator 24 is provided for matching a target character 120 from the character register 20 with the reference character 123. If they match, the comparator 24 produces a “1” output as a match flag. In response to the match flag, each of the selectors 25, 26, 27 and 28 selects the matched (upper) side of its pair of input signals. When the comparator 24 detects a mismatch between the target character and the reference character, the comparator 24 produces a “0” as a mismatch flag and each of the selectors selects the mismatched (lower) side of its pair of input signals.
Therefore, matched transition flag 124, matched pattern number 126, matched hash function 128, and matched next address 130 are selected when the target character 120 from register 20 matches the reference character 123, while mismatched transition flag 125, mismatched pattern number 127, mismatched hash function 129, and mismatched next address 131 are selected when the target character 120 mismatches the reference character 123.
The output of flag selector 25 is delivered to an external circuit as a determined transition flag 102 as well as to the character register 20 to enable it to store an input character at the leading edge of a clock pulse. The output of pattern number selector 26 is delivered to the external circuit as a determined pattern number 103. Therefore, when the selector 25 produces a determined transition flag 102, the character register 20 is enabled and latches an input character in response to the leading edge of a clock pulse 100 and delivers the latched character to the comparator 24 and the hash calculator 21 in response to the next clock pulse.
The determined transition flag 102 is “1” when the current text search on the target character 120 is complete and is “0” when the current search is still in progress. The determined pattern number 103 is valid only when the determined transition flag 102 is “1”.
The output of hash function selector 27 is connected to a hash function register 29 for latching the selected hash function in response to the leading edge of a dock pulse and deliver the stored hash function to the hash calculator 21 in response to the next dock pulse. The output of next address selector 28 is connected to a next stage register 30 to latch the selected next address in response to a clock pulse and deliver the stored next address to the adder 22 in response to the next clock pulse.
Hash calculator 21 holds a plurality of character codes respectively corresponding to the input characters. Hash calculator 21 receives the target character 120 from the input register 20 and substitutes the character code of the target character 120 into a hash function that is defined for each transition state and supplied from the hash function register 29 and produces a hash value. For each transition state, the hash function is defined as “fn(x)” according to a rule which will be described later (where “n” represents the transition state and “x” denotes the character code of the character concerned). In a preferred embodiment, the hash function fn(x)=x % N, where the symbol % is an operator indicating the residue of an arithmetic division x/N (where N is a natural number). If the character code of a target character 120 is “7” and the hash function is x % 3, the hash value equals 1 (=7% 3).
The hash value obtained in this way is summed in the adder 22 with the next address from the next state register 30 to produce an address for accessing the state transition memory 23.
The state transition table of
It is assumed that for the sake of simplicity the input character string consists of a set of seven characters {A, B, C, D, E, F, G} and each character is assigned a unique code as shown in
In the case of state “0”, the hash function f0(x) is defined as x % 2. By successively substituting all character codes into f0(x), hash values 0, 1, 0, 1, 0, 1, 0 are obtained for characters “A” to “G” as shown in
Each character group is divided into a first sub-group that contains a character pointing a transition from the current state to the next and a second sub-group that contains the other characters of the same character group. In the case of state “0”, characters pointing to the next state are “A” and “B” as shown in
Next, the transition from state “0” to the next is determined for sub-groups {A}, {C, E, G}, {B} and {D, F}. From
From the foregoing the following list of data is determined for state “0”:
a) Hash function f0(x)=x % 2.
b) Reference character of the first character group is A.
c) Reference character of the second character group is B.
d) Next state of reference character A is state “1” and the next state of the other characters of the same character group is state “0”.
e) Next state of the reference character B is state “2” (i.e., matched transition flag is “1”) and the next state of the other characters of the same character group is state “0” (i.e., mismatched transition flag is “1”).
In the case of state “1”, the hash function f1(x) is defined as x % 1. By successively substituting all character codes into f1(x), hash values 0, 0, 0, 0, 0, 0, 0 are obtained for characters “A” to “G” as shown in
Next, the transition from state “1” to the next is determined for sub-groups {B} and {A, C, D, E, F, G}. From
From the foregoing the following list of data is determined for state “1”:
a) Hash function f1(x)=x % 1.
b) Reference character of the sole character group is B.
c) The next state of reference character B is state “3” (i.e., matched transition flag is “1”) and the next state of the other characters of the sole character group is state “0” and indefinite (i.e., mismatched transition flag is “0”).
In the case of state “2”, the hash function f2(x) is defined as x % 1. By successively substituting all character codes into f2(x), hash values 0, 0, 0, 0, 0, 0, 0 are obtained for characters “A” to “G” as shown in
From the foregoing the following list of data is determined for state “2”:
a) Hash function f2(x)=x %1.
b) Reference character of the sole character group is A.
c) The next state of reference character A is state “4” (i.e., matched transition flag is “1”) and the next state of the other characters of the sole character group is state “0” and indefinite (i.e., mismatched transition flag is “0”).
In the case of state “3”, the hash function f3(x) is defined as x % 3. By successively substituting all character codes into f3(x), hash values 0, 1, 2, 0, 1, 2, 0 are obtained for characters “A” to “G” as shown in
Since C, D, E and F are the characters for making a transition from state “3” to the next as seen from
Next, the transition from state “3” to the next is determined for sub-groups {D}, {A, G}, {E}, {B}, {C} and {F}. From
From the foregoing the following is a list of data determined for state “3”:
a) Hash function f3(x)=x % 3.
b) Reference character of the first character group {A, D, G} is D.
c) Reference character of the second character group {B, E} is E.
d) Reference character of the third character group {C, F} is C.
e) The next state of reference character D is state “6” (i.e., matched transition flag is “1”) and the next state of the other characters of the same character group is state “2” and indefinite (i.e., mismatched transition flag is “0”).
f) The next state of reference character E is state “7” (i.e., matched transition flag is “1”) and the next state of the character B of the same character group is state “2” (i.e., mismatched transition flag is “1”).
g) The next state of the reference character C is state “5” (i.e., matched transition flag is “1”) and the next state of the character F of the same character group is state “8” (i.e., matched transition flag is “1”).
A state transition diagram can be created using the lists of data obtained above as a modification of the state transition diagram of
The following description illustrates how the number of failure transitions can be reduced by comparison between
In
By using the lists of data obtained above with respect to states “0” to “3” a state transition table can be created as shown in
In the
Address “0” corresponds to character group {A, C, E, G},
Address “1” corresponds to character group {B, D, F},
Address “2” corresponds to character group {A, B, C, D, E, F, G},
Address “3” corresponds to character group {A, B, C, D, E, F, G},
Address “4” corresponds to character group {A, D, G},
Address “5” corresponds to character group {B, E}, and
Address “6” corresponds to character group {C, F}.
The columns of the
Reference character 123 in each address of
Corresponding to state “0”, for example, the top row (address 0) of the
Using the data stored in the
Note that, although not shown in
As shown in
During the fill-in process of column 130 if the next state indicated in the reference character's next state column 204 (
The matched hash function column 128 of address “i” of
During the fill-in process of column 128, if the next state indicated in the reference character's next state column 204 finds no corresponding state in the state column 200, the next state of a failure transition is used instead in a similar manner to that described with reference to the fill-in process of column 130 and therefore no description is given to avoid duplication.
The matched pattern number column 126 of address “i”,
Fill-in processes of columns 131, 129 and 127 of
The following is a description of the rule for defining the hash function fn(x) by using Σ to represent a set of all possible characters, Z to represent a set of all integers, Tn to represent a set of characters involved when transition is made from state “n”, and Gn(a) to represent a set of x (xεΣ) that satisfy fn(x)=a and aεZ. For ∀aεZ, the hash function fn(x) must satisfy both Equations (1) and (2) given below:
where |S|represents the number of elements of S, and sgn( ) is the signum function. At transition state “3” in the
With the hash function fn(x)=x % N, it is preferable to minimize the size of the state transition table. Since fn(x) ranges from 0 to (N−1), state “n” occupies N addresses (rows) of the state transition table. The size of the state transition table can be reduced to a minimum by selecting a hash function fn(x) that minimizes N while satisfying Equations (1) and (2). Since Equations (1) and (2) are not satisfied when N<|Tn|÷2, a search is made for selecting such a hash function by starting with N=|Tn|÷2, successively incrementing the N value by one and checking to see if the hash function satisfies Equations (1) and (2). The hash function that is obtained when Equations (1) and (2) are satisfied is the one that minimizes the size of the state transition table.
By appropriately determining the hash function, the number of different hash values can be made smaller than the number of different characters. For example, the number of different hash values for state “0” in the
The hash value is used as an incremental address value to be summed in the adder 22 with the next address value supplied from the next address register 30. If a given state has only one hash value, the given state has only one address, such as states “1” and “2” having unique addresses “2” and “3”, respectively. However, if a given state has more than one hash value, it has more than one address corresponding in number to the hash value, such as state “0” having addresses “0” and “1” and state “3” having addresses “4”, “5” and “6”.
If the next state is a single-address state, the address of the next state is uniquely determined by the next address supplied from the address register 30. In this case, the hash value is 0, which is summed with the next address, giving the same address value for accessing the state transition memory 23 as the next address value.
If the next address is a multi-address state, it is necessary to identify one of the addresses of the multi-address state. In this case, the hash value is one of “0”, “1” and “2”, which is summed with the next address from the address register 30. For example, if the next state corresponds to address “6” of multi-address state “3”, a hash value “2” is added to next address “4” to access the address “6” of state transition memory 23.
Returning to
The following is a description of the operation of the pattern matching system of
In the absence of clock pulses, the pattern matching system 1 is initialized at step 301 by setting the first character “A” into the input character register 20, the hash function of state “0” (i.e., x % 2) as matched hash functions 128 and 129 and “0” to transition flags 124, 125, and next addresses 130 and 131. As a result, flag selector 25 produces a “0” output, thus setting the transition flag 102 to “0”. Additionally, the has function selector 27 produces the hash function=x % 2, and the next address selector 28 produces address “0”.
In response to a clock pulse (step 302), the input register 20 supplies a target character 120 to both hash calculator 21 and comparator 24, the hash function register 29 supplies a hash function 133 to hash calculator 21 and the next address register 30 supplies a next address 134 to adder 22 (step 303).
Hash calculator 21 calculates a hash value 121 by substituting the target character 120 into the hash function 133 and supplies the hash value 121 to adder 22 (step 304). Adder 22 generates an address 122 by summing the hash value 121 and the next address value 134 and supplies the address 122 to the state transition memory 23 (step 305). State transition memory 23 reads the contents of columns 123 through 131 of a row identified by the address 122 for delivery to its output terminals (step 306).
Therefore, the comparator 24 is supplied with a target character 120 and a reference character 123 and determines whether they match or mismatch (step 307). If they match, the comparator 24 produces a “1” output, allowing the selectors 25, 26, 27 and 28 to output the matched transition flag 124 as a determined transition flag 102, matched pattern number 126 as a determined pattern number 103, matched hash function 128 and matched next address 130, respectively (step 308). If they mismatch, the comparator 24 produces a “0” output (step 309), allowing the selectors 25, 26, 27 and 28 to output the mismatched transition flag 125 as a determined transition flag 102, mismatched pattern number 127 as a determined pattern number 103, mismatched hash function 129 and mismatched next address 131, respectively.
If the transition flag 102 is “1” (step 310), and the target character 120 is not the last character (step 311), the input register 20 reads and stores the next character (step 312), and flow returns to step 302 to repeat the same process on receiving a subsequent clock pulse. Flow returns to step 302 to continue the process if the transition flag 102 is “0” (step 310). The operation of the system is terminated if the target character 120 is the last character of the input character string (step 311).
Therefore, in response to clock pulse #1, the input register 20 outputs the first character “A” to the hash calculator 21 and the comparator 24. Hash function register 29 outputs the hash function x % 2 as a hash function 133 to the hash calculator 21. Since the address selector 28 is supplied with “0” inputs, the next address register 30 outputs a next address 134 which is “0”. Since the character code of “A” is “1”, the hash calculator 21 produces a hash value “0”. This hash value is summed in the adder 22 with “0” from the address register 30. Thus, the adder 22 supplies an address 122 which is “0” to the memory 23.
Since the memory address is 0, the state transition memory 23 (
Reference character 123=A,
Matched transition flag 124=1,
Mismatched transition flag 125=1,
Matched pattern number 126=0,
Mismatched pattern number 127=0,
Matched hash function 128=x % 1,
Mismatched hash function 129=x % 2,
Matched next address 130=2, and
Mismatched next address 131=0.
As a result, the comparator 24 supplies a “1” output to all selectors 25, 26, 27, 28, which sets the determined transition flag 102 to “1” and the determined pattern number 103 to “0”. Additionally, the hash function 128=x % 1 is set in the function register 29 and the next address 130=2 is set in the address register 30. Since the transition flag 102 is set to “1”, the input register 20 stores the next character B.
In response to clock pulse #2, the input register 20 outputs the second character “B” to the hash calculator 21 and the comparator 24. Hash function register 29 outputs the hash function x % 1 as a hash function 133 to the hash calculator 21 and the address register 30 outputs the next address 134=2. Since the character code of “B” is “2”, the hash calculator 21 produces a hash value “0”. This hash value is summed in the adder 22 with “2” from the address register 30. Thus, the adder 22 supplies an address 122=2 to the memory 23. In response to the address “2”, the state transition memory 23 sets its outputs as follows:
Reference character 123=B,
Matched transition flag 124=1,
Mismatched transition flag 125=0,
Matched pattern number 126=0,
Mismatched pattern number 127=*(don't care),
Matched hash function 128=x % 3,
Mismatched hash function 129=x % 2,
Matched next address 130=4, and
Mismatched next address 131=0.
As a result, the comparator 24 supplies a “1” output to all selectors 25, 26, 27, 28, which sets the determined transition flag 102 to “1” and the determined pattern number 103 to “0”. Additionally, the hash function 128=x % 3 is set in the function register 29 and the next address 130=4 is set in the address register 30. Since the transition flag 102 is set to “1”, the input register 20 stores the third character A.
In response to clock pulse #3, the input register 20 outputs the third character “A” to the hash calculator 21 and the comparator 24. Hash function register 29 outputs a hash function 133=x % 3 to the hash calculator 21 and the address register 30 outputs the next address 134=4. Since the character code of “A” is “1”, the hash calculator 21 produces a hash value “0” again. This hash value is summed in the adder 22 with “4” from the address register 30. Thus, the adder 22 supplies an address 122=4 to the memory 23. In response to the address “4”, the state transition memory 23 sets its outputs as follows:
Reference character 123=D,
Matched transition flag 124=1,
Mismatched transition flag 125=0,
Matched pattern number 126=2,
Mismatched pattern number 127=*(don't care),
Matched hash function 128=x % 2,
Mismatched hash function 129=x % 1,
Matched next address 130=0, and
Mismatched next address 131=3.
As a result, the comparator 24 detects a mismatch and supplies a “0” output to all selectors 25, 26, 27, 28, which sets the determined transition flag 102 to “0” and the determined pattern number 103 to the “don't care” status. Additionally, the hash function 129=x % 1 is set in the function register 29 and the next address 130=3 is set in the address register 30. Since the transition flag 102 is set to “0”, the input register 20 do not store the next character.
In response to clock pulse #4, the input register 20 outputs the previous character “A” to the hash calculator 21 and the comparator 24. Hash function register 29 outputs a hash function 133=x % 1 to the hash calculator 21 and the address register 30 outputs the next address 134=3. Since the character code of “A” is “1”, the hash calculator 21 produces a hash value “0” again. This hash value is summed in the adder 22 with “3” from the address register 30. Thus, the adder 22 supplies an address 122=3 to the memory 23. In response to the address “3”, the state transition memory 23 sets its outputs as follows:
Reference character 123=A,
Matched transition flag 124=1,
Mismatched transition flag 125=0,
Matched pattern number 126=5,
Mismatched pattern number 127=*(don't care),
Matched hash function 128=x % 1,
Mismatched hash function 129=x % 2,
Matched next address 130=2, and
Mismatched next address 131=0.
As a result, the comparator 24 detects a match and supplies a “1” output to all selectors 25, 26, 27, 28, which sets the determined transition flag 102 to “1” and the determined pattern number 103 to “5”. Since the pattern number “5” corresponds to the pattern “BA” and the flag 102 is “1”, the pattern matching system 1 detects the pattern “BA” in the input character string in response to clock pulse #4. Additionally, the hash function 129=x % 1 is set in the function register 29 and the next address 130=2 is set in the address register 30. Since the transition flag 102 is set to “1”, the input register 20 latches the fourth character B. When the above process is repeated on the subsequent characters, the pattern “ABF” whose pattern number is “4” is detected in response to clock pulse #11.
Consider the amount of computations necessary to perform a pattern match. With the hash function being x % N, one residue calculation by hash calculator 21, one addition by adder 22 and one comparison by comparator 24 are performed in a single state transition. The amount of computations involved in these operations does not vary with the number of different characters, although the number of bits for representing the characters may slightly increases. However, the amount of such increase is considerably small in comparison with the amount of increase in different characters. If the number of different characters is increased 256 times, the number of bits for representing these characters increases by 8 bits (i.e., 8=log2256).
Accordingly, the speed of search for a pattern match is not affected by the number of different characters. With the prior art of
Number | Date | Country | Kind |
---|---|---|---|
2005-218382 | Jul 2005 | JP | national |