Reference will now be made in detail to a particular embodiment of the invention an example of which is illustrated in the accompanying drawings. While the invention will be described in conjunction with the particular embodiment, it will be understood that it is not intended to limit the invention to the described embodiment. To the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims.
In the described embodiments, either a content addressable memory (CAM) or a CAM-equivalent collision-free hash-based lookup architecture with zero false positives is used for implementing large finite state machines (FSM) in hardware. In this way, state transitions and outputs are computed with a predictable latency consisting of a fixed and small number of memory lookups. Furthermore, using embedded memory, it is possible to optimize the memory bank architecture to suit the bit-widths and other lookup requirements of the FSM. In one embodiment, the arbitrary length string matching problem is formulated as a state machine traversal wherein the dictionary is represented as an FSM in which a state represents the past history of input characters received, a transition from one state to the next is predicated on the value of the current input character of the string, and the FSM is implemented as a CAM. The CAM stores the rows of the state transition table of the FSM such that each row contains the input, current state and corresponding next state for a transition.
Some states of the FSM are marked accepting states such that when one of these states is reached, a specific string is known to have been matched. The accepting state information is also stored with the state transition rows in the CAM. The arbitrary length string being matched is streamed in to the lookup architecture one or more input units (a character, for example) at a time. In general, the matching is performed by looking up the concatenation of the current input unit and the current state in the CAM to determine if a row with this combination is present in the FSM transition table. If such a row is detected, the corresponding next state is determined as part of the lookup. The traversal is further performed with the just determined next state becoming the next state and using the next input unit from the string if such an input unit is available. If an additional input unit is not available, the process is said to have completed. Also, during the CAM lookup, it is determined if the next state is an accepting state. If it is an accepting state, the string match signal is issued, otherwise it is not issued. If during the CAM lookup, no entry is found corresponding to the current input unit and current state, the default transition from the current state, as specified in the FSM, is performed, the match signal is not issued, and traversal is further performed as indicated above.
In a refinement of the above embodiment, the transition table for the dictionary FSM is implemented using the CAM-equivalent zero-false positive lookup architecture described herein. The concatenation of the current input unit and current state is k-way hashed into k addresses in a first memory arranged, for example, in rows and columns where each of the rows has a first data field that includes a Bloom bit used to identify those incoming strings that cannot be stored in the string dictionary. Each of the rows also includes a second data field that includes a unique bit that is used to determine a sub-set of the k hash locations that hold useful data (thereby eliminating false positives inherent with Bloom filters). Each of the rows also includes a third data field that includes information that identifies an address in a second memory that stores the FSM transition table that is used to determine if the incoming string is stored in the string dictionary or not. In the case where the incoming string is stored in the string dictionary, the string-matching engine issues a match signal; otherwise a no match signal is issued.
In the described refinement, a subset of the k addresses is identified that contain information that identifies an address in a second memory that stores the FSM transition table. The lookup is deemed successful when the one of the subset of the k addresses identified in the second memory contains the same input unit and state pair as being currently applied. Apart from this refinement in the lookup scheme, the subsequent FSM traversal and arbitrary length string matching is performed as above.
It should be noted that if the FSM is implemented as a true CAM, the string lookup in the direct scheme or the next state lookup in the FSM based scheme does not require a k-way hash since all that is required is to lookup the concatenation of the input and current state in the CAM. K-way hashing and associated filtering is only required when using the CAM equivalent architecture.
In this way, the described string matching engine implemented using a FSM provides for efficient string matching using a low memory collision-free hash-based look up scheme with low average case bandwidth and power requirements that overcomes prior art limitations by providing the ability to match against a large dictionary of long and arbitrary length strings at line speed. The described embodiments will now be described in terms of a string matching engine, system, and method useful in a number of applications where memory and computing resources are at a premium or, high performance is desired. Such applications are typically found in portable devices such as personal communication devices 100 (shown in
The described string matching engine can be deployed as a macro program executed by a central processing unit (CPU) or included in a co-processor having its own memory and computing resources arranged to filter any incoming traffic for strings that have been identified as potential malware (i.e., a computer virus). In this way, malware detection can be off-loaded from the CPU thereby freeing up computing and memory resources otherwise required for detection of malware that would have the potential to severely disrupt the operation of the personal communication device 100. In some cases, the strings are stored in a string dictionary and used by the string machine engine to detect such malware are supplied and periodically updated by a third party on either a subscription basis or as part of a service contract between a user and a service provider.
The cell phone 100 also includes a user input device 114 that allows a user to interact with the cell phone 100. For example, the user input device 114 can take a variety of forms, such as a button, keypad, dial, etc. Still further, the cell phone 100 includes a display 116 (screen display) that can be controlled by the processor 104 to display information to the user. A data bus can facilitate data transfer between at least the ROM 112, RAM 110, the processor 104, and a CODEC 118 that produces analog output signals for an audio output device 120 (such as a speaker). The speaker 120 can be a speaker internal to the cell phone 100 or external to the cell phone 100. For example, headphones or earphones that connect to the cell phone 100 would be considered an external speaker. A wireless interface 122 operates to receive information from the processor 104 that opens a channel (either voice or data) for transmission and reception typically using RF carrier waves.
During operation, the wireless interface 122 receives an RF transmission carrying an incoming data stream 124 in the form of data packets 126. Copies of the data packets are made and in some cases undergo additional processing prior to being forwarded to the co-processor 104 for examination by the string-matching engine 108 for possible inclusion of strings associated with known computer malware. In the described embodiment, the group of stored strings (referred to as a string dictionary) used by the string matching engine 108 are provided by a third party and are periodically updated with new strings in order to detect new computer malware. It should be noted that the inputs to the string-matching engine do not need to be derived solely from traffic. For example, inputs to the string-matching engine can take the form of files already resident in the cell phone memory (RAM 110, ROM 112).
The string-matching engine 108 will provide a match flag 128 in those situations where the incoming data stream 124 includes a string 130 that matches one of the entries in the string dictionary. The match flag 128 will notify the CPU 104 that the cell phone 100 has been exposed to potentially harmful computer malware and appropriate prophylactic measures must be taken. These measures can include malware sequestration, inoculation, quarantine, etc. provided by a security protocol.
On the other hand, if a row in the transition table 208 having the input value 222 is detected, then the processor determines a corresponding next state and if the next state is determined to be an accepting state, then the processor instructs the string-matching engine 200 to issue a match signal. If, however, the next state is not an accepting state, then the processor determines if there is a next input unit, and if so, then the matching process is continued as long as the input character stream is not exhausted. In this manner, anchored or unanchored string matching is performed for the input stream. It should be noted that this approach works well in situations where the number of state plus input combinations required for modeling all the FSM transitions can be accommodated in the CAM/CAM-equivalent dictionary.
Alternatively, if the string dictionary is configured using a CAM equivalent architecture shown in
In a particularly useful embodiment, the addition of a new string to the string dictionary entails modifications to the dictionary FSM including some modification of the transition structure involving addition and deletion of transition edges. The addition/deletions reflected in the primary hash lookup table are the changes in the edge transitions rather than the actual strings.
Referring back to
Returning to 806, if the FSM is implemented using a CAM-equivalent architecture, then at 826, the concatenation of the current input unit and current state is k-way hashed into k addresses, for example, in rows and columns. It is determined at 828 if there is a subset of the k addresses that contain information that identifies an address in a corresponding to a dictionary entry. The lookup is determined to be successful at 830 when the one of the subset of the k addresses identified in the string dictionary contains the same input unit and state pair as being currently applied at which point a row-match signal is issued at 832 otherwise at 834 a no row-match signal is issued. In either case, control is subsequently passed back to 812.
By using the CAM/CAM-equivalent architecture for implementing the string-matching automaton, the invention is able to achieve the ideal O(n) complexity of the Aho-Corasick algorithm for matching against the entire dictionary consisting of strings of arbitrary length. Furthermore, the invention provides the ability to advance the input stream by more than one character for reducing the complexity below O(n) in a manner that is superior to the Boyer-Moore algorithm that is restricted to matching against single strings. The CAM/CAM-equivalent lookup architecture allows the inventive string matching to overcome this limitation. The greatest benefit in the Boyer-Moore algorithm comes from the ability to advance the input stream by a large count when the last character does not occur in the dictionary string. With the inventive string matching engine, the pre-determined set of characters to look up to enable a Boyer-Moore-like jump, as well as the actual value of the jump, are stored in the CAM/CAM-equivalent lookup table. For example, all characters up to a certain distance from the end of the dictionary strings could be stored in the lookup table. Using the CAM/CAM-equivalent scheme, it is possible to determine in a single step whether the last character of the input stream segment matches any of these stored characters. If not, the stream is allowed to advance by the predetermined Boyer-Moore increment. This is a substantial performance boost since the lookup of the last character is performed in O(1) time for the entire set of dictionary strings. This scheme is further advanced by storing character sequences rather than single characters in the lookup table for computing the input stream increment. This increases the likelihood that the lookup returns a “no match”, thus making the use of the input stream increment more frequent.
Embodiments of the invention, including the apparatus disclosed herein, can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Apparatus embodiments of the invention can be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a programmable processor; and method steps of the invention can be performed by a programmable processor executing a program of instructions to perform functions of the invention by operating on input data and generating output. Embodiments of the invention can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired; and in any case, the language can be a compiled or interpreted language.
Suitable processors include, by way of example, both general and special purpose microprocessors. Generally, a processor will receive instructions and data from a read-only memory and/or a random access memory. Generally, a computer will include one or more mass storage devices for storing data files; such devices include magnetic disks, such as internal hard disks and removable disks; magneto-optical disks; and optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM disks. Any of the foregoing can be supplemented by, or incorporated in, ASICs (application-specific integrated circuits).
A number of implementations of the invention have been described. Nevertheless, it will be understood that various modifications may be made without departing from the spirit and scope of the invention. Accordingly, other embodiments are within the scope of the following claims.
This patent application takes priority under 35 U.S.C. 119(e) to (i) U.S. Provisional Patent Application No. 60/840,168, filed on Aug. 25, 2006 (Attorney Docket No. NETFP001P) entitled “STRING MATCHING ENGINE” by Choudhary et al. This application is also related to (i) co-pending application entitled, “STRING MATCHING ENGINE” by Choudhary et al (Attorney Docket No. NETFP001) having application Ser. No. 11/550,320 and filed Oct. 17, 2006 and (ii), co-pending application entitled, “REGULAR EXPRESSION MATCHING ENGINE” by Ashar et al (Attorney Docket No. NETFP003) having application Ser. No. ______ and filed ______ each of which are incorporated by reference in their entirety for all purposes.
Number | Date | Country | |
---|---|---|---|
60840168 | Aug 2006 | US |