The present invention relates to the inspection and classification of high speed network traffic, and more particularly to the acceleration of classification of network content using pattern matching where the database of patterns used is relatively large in comparison to the available storage space.
Efficient transmission, dissemination and processing of data are essential in the current age of information. The Internet is an example of a technological development that relies heavily on the ability to process information efficiently. With the Internet gaining wider acceptance and usage, coupled with further improvements in technology such as higher bandwidth connections, the amount of data and information that needs to be processed is increasing substantially. Of the many uses of the Internet, such as world-wide-web surfing and electronic messaging, which includes e-mail and instant messaging, some are detrimental to its effectiveness as a medium of exchanging and distributing information. Malicious attackers and Internet-fraudsters have found ways of exploiting security holes in systems connected to the Internet to spread viruses and worms, gain access to restricted and private information, gain unauthorized control of systems, and in general disrupt the legitimate use of the Internet. The medium has also been exploited for mass marketing purposes through the transmission of unsolicited bulk e-mails, which is also known as spam. Apart from creating inconvenience for the user on the receiving end of a spam message, spam also consumes network bandwidth at a cost to network infrastructure owners. Furthermore, spam poses a threat to the security of a network because viruses are sometimes attached to the e-mail.
Network security solutions have become an important part of the Internet. Due to the growing amount of Internet traffic and the increasing sophistication of attacks, many network security applications are faced with the need to increase both complexity and processing speed. However, these two factors are inherently conflicting since increased complexity usually involves additional processing.
Pattern matching is an important technique in many information processing systems and has gained wide acceptance in most network security applications, such as anti-virus, anti-spam and intrusion detection systems. Increasing both complexity and processing speed requires improvements to the hardware and algorithms used for efficient pattern matching.
An important component of a pattern matching system is the database of patterns to which an input data stream is matched against. As network security applications evolve to handle more varied attacks, the sizes of pattern databases used increase. Pattern database sizes have increased to such a point that it is significantly taxing system memory resources, and this is especially true for specialized hardware solutions which scan data at high speed.
In accordance with one embodiment of the present invention, incoming network traffic is compressed using a hash function and the compressed result is used by a space-and-time efficient retrieval method that compares it with entries in a multitude of databases that store compressed data. In accordance with another embodiment of the present invention, incoming network traffic is used for comparison in the databases without being compressed using a hash function. The present invention, accordingly, accelerates the performance of content security applications and networked devices such as gateway anti-virus and email filtering appliances.
In some embodiments, the matching of the compressed data is performed by a pattern matching system and a data processing system which may be a network security system configured to perform one or more of anti-virus, anti-spam and intrusion detection algorithms. The pattern matching system is configured to support large pattern databases. In one embodiment, the pattern matching system includes, in part, a hash value calculator, a compressed database pattern retriever, and first and second memory tables.
Incoming data byte streams are received by the hash value calculator which is configured to compute the hash value for a substring of length N bytes of the input data byte stream (alternatively referred to hereinbelow as data stream). Compressed database pattern retriever compares the computed hash value to the patterns stored in first and second memory tables. If the compare results in a match, a matched state is returned to the data processing system. A matched state holds information related to the memory location at which the match occurs as well as other information related to the matched pattern, such as the match location in the input data stream. If the computed hash value is not matched to the compressed patterns stored in first and second memory tables either a no-match state is returned to the data processing system or alternatively nothing is returned to the data processing system.
A matched state may correspond to multiple uncompressed patterns. If so, the data processing system disambiguates the match by identifying a final match from among the many matches found. In such embodiments, the data processing system may be configured to maintain an internal database used to map the matched state to a multitude of original uncompressed patterns. These patterns are then compared by data processing system to the pattern in the input data stream at the location specified by the matched state so as to identify the final match.
In one embodiment, if the data read from the second memory table includes the corresponding address of the first memory table used to compute the address of the data read from the second memory table, the match validator generates a matched state signal. In such embodiments, if the data read from the second memory table does not include the corresponding address of the first memory table used to compute the address of the data read from the second memory table, the match validator generates a no-match signal. In another embodiment, if the data read from the second memory table matches an identifier stored in the corresponding address of the first memory table 150 used to compute the address of the data read from the second memory table, match validator generates a matched state signal. In such embodiments, if the data read from the second memory table does not match the identifier stored in the corresponding address of the first memory table used to compute the address of the data read from the second memory table, match validator generates a no-match signal. Match validator outputs a matched state that is used by a post processor to identify the pattern that matched.
In accordance with one embodiment of the present invention, incoming network traffic is compressed using a hash function and the compressed result is used by a space-and-time efficient retrieval method that compares it with entries in a multitude of databases that store compressed data. In accordance with another embodiment of the present invention, incoming network traffic is used for comparison in the databases without being compressed by the hash function. The present invention, accordingly, accelerates the performance of content security applications and networked devices such as gateway anti-virus and email filtering appliances.
Incoming data byte streams are received by the pattern matching system 110 hash value calculator 130. Hash value calculator 130 is configured to compute the hash value for a substring of length N bytes of the input data byte stream (alternatively referred to hereinbelow as data stream). Compressed database pattern retriever 140 compares the computed hash value to the patterns stored in first and second memory tables 150, and 160, as described further below. If the compare results in a match, a matched state is returned to the data processing system 120. A matched state holds information related to the memory location at which the match occurs as well as other information related to the matched pattern, such as the match location in the input data stream. In one embodiment, if the computed hash value is not matched to the compressed patterns stored in first and second memory tables 150, 160, a no-match state is returned to the data processing system 120. In another embodiment, if the computed hash value is not matched to the compressed patterns stored in first and second memory tables 150, 160, nothing is returned to the data processing system.
A matched state may correspond to multiple uncompressed patterns. If so, data processing system 120 disambiguates the match by identifying a final match from among the many candidate matches found. In such embodiments, data processing system 120 may be configured to maintain an internal database used to map the matched state to a multitude of original uncompressed patterns. These patterns are then compared by data processing system 120 to the pattern in the input data stream at the location specified by the matched state so as to identify the final match.
Since hash value calculator 130 maps many substrings of length N bytes of the input data stream into a fixed sized pattern search key, there may be instances where a matched state may not correspond to any uncompressed pattern. Data processing system 120 is further configured to disambiguate the matched state by verifying whether the detected matched state is a false positive. It is understood that although the data processing system 120 is operative to disambiguate and verify matched state, the present invention achieves a much faster matching than other known systems.
Compressed database pattern retriever 140 includes logic blocks configured to retrieve patterns from the memory tables that contain compressed databases. Such a format is non-ambiguous if overlapping patterns are not used, but becomes ambiguous when overlapping patterns are used. Such ambiguity of the patterns in the database is controlled via the compression algorithm used to generate the memory tables. Allowing ambiguous patterns increases the capacity of the database and also increases the amount of processing that data processing system 120 performs to resolve ambiguity in the patterns, as described above. The ambiguity just described does not relate to the collision of pattern search keys resulting from hashing operations. Instead, it applies only to the intentional overlapping of different pattern search keys in order to conserve memory.
Pattern search key segment 1 is modified by segment 1 modifier and supplied to memory accessor 240. Pattern search key segment 2 may or may not be modified by segment 2 modifier and subsequently supplied to memory accessor 245. Such modifications include, for example, arithmetic operations, bitwise logical operations, masking and permuting the order of bits. Memory accessor 240 receives the modified segment 1 as an address to perform a read operation on first memory table 150. The data read by memory accessor 240 from first memory table 150 is combined with the output of segment 2 modifier 235 by memory accessor 245 to compute the address for the read-out operation in second memory table 160. In some embodiments, memory accessor 245 adds the data read from first memory table 150 to the output of segment 2 modifier 235 to compute the address for the read-out operation in second memory table 160. In yet other embodiments, memory accessor 245 adds an offset to the sum of the data read from first memory table 150 and the output of segment 2 modifier 235 to compute the address for the read-out operation in second memory table 160. Data read from first memory table 150 and second memory table 160 is supplied to match validator 260 which is configured to determine if the input pattern search key is a valid pattern.
In one embodiment, if the data read from the second memory table 160 includes the corresponding address of the first memory table 150 used to compute the address of the data read from the second memory table 160, match validator 260 generates a matched state signal. In such embodiments, if the data read from the second memory table 160 does not include the corresponding full or partial address of the first memory table 150 used to compute the address of the data read from the second memory table 160, match validator 260 generates a no-match signal. In another embodiment, if the data read from the second memory table 160 matches an identifier stored in the corresponding address of the first memory table 150 used to compute the address of the data read from the second memory table 160, match validator 260 generates a matched state signal. In such embodiments, if the data read from the second memory table 160 does not match the identifier stored in the corresponding address of the first memory table 150 used to compute the address of the data read from the second memory table 160, match validator 260 generates a no-match signal. Match validator 260 outputs a matched state that is used by a post processor 220 to identify the pattern that matched. In one embodiment, the post-processor 220 is used to block the first N-1 invalid results where the N-gram recursive hash is used, as described further below.
It is understood that other embodiments of the present invention may use more than two memory tables, that may or may not be stored in the same physical memory banks or device.
Compressed database pattern retriever 305 may output a match state from the match validator module 365 prior to receiving the results from all of the K memory tables. In other words, match validator 365 may return a matched state after the i-th memory table has been read, where i is less than K and the first memory table is identified with i equal to zero. Such a situation may arise if match validator 365 receives sufficient information from reading the first i memory tables to enable match validator 365 to determine that the input pattern search key corresponds to a positive or negative match, thereby increasing the matching speed as less memory lookups are required. As such, compressed database pattern retriever 305 may bypass reading the remaining (K-(i+1)) memory tables and thus may begin to process the next pattern search key. Therefore, because pattern search keys are compared and matched state results are produced at a higher rate a pattern matching system, in accordance with the present invention, has an increased throughput.
In one embodiment, the pattern search keys passed to the compressed database pattern retriever, such as compressed database pattern retriever 140 and 305 of
In one embodiment, the hash function used by hash value calculator 130 is implemented using recursive hash functions based on cyclic polynomials, see for example, “Recursive Hashing Functions for n-Grams”, Jonathan D. Cohen, ACM Transactions on Information Systems, Vol. 15, No. 3, July 1997, pp. 291-320, the content of which is incorporated herein by reference in its entirety. The recursive hash function operates on N-grams of textual words or binary data. An N-gram is a textual word or binary data with N symbols, where N is defined by the application. In general, hash functions, including those that are non-recursive, generate an M-bit hash value from an input N-gram. Typically a symbol is represented by an 8-bit byte, thus resulting in N bytes in an N-gram. The hash functions enable mapping an N-gram into bins represented by M bits such that the N-grams are uniformly distributed over the 2M bins. An example of a typical value of N is 10, and M is 32.
Non-recursive hash functions re-calculate the complete hash function for every input N-gram, even if subsequent N-grams differ only in the first and last symbols. In contrast, the recursive variant can generate a hash value based on previously encountered symbols and the new input symbol. Therefore, computationally efficient recursive hash functions can be implemented in either software, hardware, or a combination of the two.
In one embodiment, the recursive hash function is based on cyclic polynomials. In another embodiment, the recursive hash function, may use self-annihilating algorithms and is also based on cyclic polynomials, but requires N and M to both be a power of two. In self-annihilating algorithms, the old symbol of an N-gram does not have to be explicitly removed. The following is an exemplary a recursive hash function based on cyclic polynomials, written in C++ and adapted for hardware implementation:
The inverse transformation table T′ is derived from the transformation table T, so the values in the table T determines the actual hash function that maps input symbols to hash values. The transformation table T is used to contribute an input symbol to the overall hash value. Conversely, the inverse transformation table T′ is used to remove the contribution of input symbol to the hash value. When an input symbol is encountered in the input stream, a new hash value is calculated from the input symbol, the transformation table T and the current hash value. The contribution of this symbol to the hash value is removed N symbols later. This description assumes that an input data symbol corresponds to a single 8-bit input byte; therefore each table has 256 entries. However, the size of the input data symbol may be greater or less than a single 8-bit byte, so the sizes of the tables are correspondingly larger or smaller.
Referring to
For illustration purposes, the exemplary embodiment shown in
In the embodiment shown in
In the above exemplary embodiment, each hash value is shown as including 32 bits. Allocating one extra bit to each hash value doubles the amount of overall space addressable by the hash value, thus reducing the probability of unwanted collisions in the compressed memory tables. However, it also increases the number of bits required for the FIRST_ID and/or SECOND_ID fields as more hash value bits would require validation. The sizes of FIRST_ID and SECOND_ID are limited by the width of the memories. Therefore, using 32 bit hash values require an extra bit for the FIRST_ID field and this can be accomplished by a corresponding reduction in the number of bits used to represent BASE_ADDR in the second memory table, because the full width of the memories are already utilized. In one embodiment, the number of bits allocated to BASE_ADDR does not need to be reduced when the number of bits allocated to FIRST_ID is increased. This is achieved by having FIRST_ID and BASE_ADDR sharing one or more bits. However, there are some restrictions on the values of FIRST_ID and BASE_ADDR that can be used. These restrictions depend on which bits of FIRST_ID and BASE_ADDR are shared.
In the above example, BASE_ADDR is represented by 20 bits; thus permitting the use of an offset into the second memory table 430 that can address up to 220=1,048,576 different locations. A reduction in the space addressable by BASE_ADDR reduces the total amount of usable space in the second memory table 430, which increases the number undesirable of pattern search key collisions. It is understood that more or fewer hash value bits may be used in order to increase or reduce the number of unwanted pattern search key collisions, and the number of bits available to BASE_ADDR may decrease to the point where the actual number of unwanted pattern search key collisions may actually increase due to the reduction in the amount of addressable space in the second memory table 430.
Referring to
The base address, BASE_ADDR, retrieved from the first memory table 415 at the location defined by the parameters KEYSEG1 and FIRST_OFFSET, is subsequently added to a second constant and pre-determined offset value, denoted as SECOND_OFFSET, and further added to parameter value KEYSEG2 to determine an address in the second memory table 430. The offset, SECOND_OFFSET, facilitates the use of multiple second-key-segment blocks that correspond to different hash functions. Therefore, multiple and independent pattern databases can be stored in the same memory tables by using appropriate values for SECOND_OFFSET.
Since in the above exemplary embodiment, the second memory table 430 is a 32-bit memory, the least significant bit of the computed address for the second memory table 430 is extracted and used to select one of the inputs of the multiplexer 435. The upper 21 bits are used as the actual address for the second memory table 430. This allows two SECOND_ID parameters to be stored for every 32-bit entry in second memory table 430. The least significant bit of the second memory table address is used to select the specific SECOND_ID. In
In order for a positive match to occur the use-bits, USE_F and USE_S, have to be set. During the pattern compression process, a use bit is set if the entry stores a corresponding training pattern, otherwise it is cleared. The use bits are set or cleared when the training patterns are compiled, compressed and loaded into the tables. Therefore, a cleared use bit indicates a no-match condition. In some embodiments, if the use-bit in the first memory table is cleared then the lookup of the second memory table 430 may be bypassed so that the next processing cycle can be allocated to the lookup of the first memory table 415 instead of the second memory table 430, therefore, the next match cycle begins in the first memory table 415 and the second memory table 430 is not accessed. In such situations, the match validator 260 has the ability to send a signal back to memory accessors down the chain of memory accessors that further reads are not required. Consequently, the overall system operates faster because extra memory lookups are not required.
In practice, it is desirable to have M as large as possible so that an input N-gram is mapped to a large universe of hash values with minimal overlapping between different input N-grams. Using a large value of M means that one cannot directly use the hash values to address a physical memory, because the number of required memory addresses will be too large. For example, using a value of 31 for M implies that a physical memory size of 231=2,147,483,648 entries is required in order for the hash values to directly address this memory space. However, the total number of unique N-grams that need to be represented is usually very much less than 231. In other words, the universe of all possible hash values is usually sparsely populated by the database of patterns that hash into it. The present invention takes advantage of this property to reduce the space required to store the hash values of a corresponding pattern database to one that is of the order of the number of unique N-grams.
In the embodiments described above, the training patterns with length less than N are not stored in the compressed memory format. In
Hash value calculator 910 generates the M-bit hash value, which is then used by the memory lookup module 210 to retrieve the corresponding entry in the compressed first and second memory tables 150, and 160. If a matching entry is detected in the memory tables, memory lookup module 210 outputs a valid matched state, where state is the address of the second memory table corresponding to the matched hash value. Due to the nature of the recursive hash function, match results corresponding to the first (N-1) symbols are invalid, which are discarded by the post-processor 220.
In one embodiment, the above invention may be used together with a finite state machine that also performs pattern matching. Instead of padding patterns with length less than N as described above and illustrated by
One or more of the memory accessor modules 240,245, 335,340, 345 can implement the identity operation. That is, they do not perform any memory lookups or functions other than passing the input to the output without modification. The input to the memory accessor modules are modified key segments. So, in this embodiment, the values of modified key segments transmitted to memory accessor modules implementing the identity operation are passed directly to the match validator 365. In such embodiments, match validator 365 contains decision logics that are functions of only the modified key segments and there are no dependencies on memory table values.
Although the foregoing invention has been described in some detail for purposes of clarity and understanding, those skilled in the art will appreciate that various adaptations and modifications of the just-described preferred embodiments can be configured without departing from the scope and spirit of the invention. For example, other pattern matching technologies may be used, or different network topologies may be present. Moreover, the described data flow of this invention may be implemented within separate network systems, or in a single network system, and running either as separate applications or as a single application. Therefore, the described embodiments should not be limited to the details given herein, but should be defined by the following claims and their full scope of equivalents.
The present application claims benefit under 35 USC 119(e) of U.S. provisional application No. 60/654,224, attorney docket number 021741-001900US, filed on Feb. 17, 2005, entitled “APPARATUS AND METHOD FOR FAST PATTERN MATCHING WITH LARGE DATABASES” the content of which is incorporated herein by reference in its entirety. The present application is related to copending application Ser. No. ______, entitled “COMPRESSION ALGORITHM FOR GENERATING COMPRESSED DATABASES”, filed contemporaneously herewith, attorney docket no. 021741-001920US, assigned to the same assignee, and incorporated herein by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| 60654224 | Feb 2005 | US |