This disclosure relates to pattern recognition and, more particularly, to substring detection for use in a pattern recognition system and method.
In some applications and systems, such as networking systems, it is desirable to recognize the appearance of particular data. For example, network security systems, such as firewalls or intrusion detection systems, may use pattern matching to recognize certain data strings received by the systems. Viruses and/or other types of unwanted data, for example, may be recognized for network protection. Pattern matching may also be used for other purposes, such as network monitoring or biological pattern matching (e.g., particular portions of a deoxyribonucleic acid (DNA) sequence may be of interest for medical research). To detect the appearance of desirable or undesirable data, one or more predefined data patterns may be compared to the data under study. If one or more of the predefined data patterns matches a portion of the data under study, appropriate action may be executed. If a pattern representative of a computer virus is detected, for example, the data that contains the virus may be removed to reduce infection. Existing multiple pattern matching algorithms are complex and may use relatively large memory footprints and a significant amount of time to perform multiple pattern matching on data strings.
The details of one or more implementations of this disclosure are set forth in the accompanying drawings and the description below. Other features and advantages will become apparent from the description, the drawings, and the claims.
Although the following Detailed Description will proceed with reference being made to illustrative embodiments, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art. Accordingly, it is intended that the claimed subject matter be viewed broadly, and be defined only as set forth in the accompanying claims
Referring to
In this embodiment, system 100 may include an internal network 118 that may include computer systems, network devices, and networks that are separate from external network 112. For example, internal network 118 may represent a network used by an entity such as a business, an educational institution, and the like. Internal network 118 may include a security device 120 (e.g., a switch, a router, a computer system, etc.) that provides a level of security (e.g., enforces security policies) to the internal network 118. For example, security device 120 may monitor inbound (and outbound) data for viruses and/or other types of undesirable data. By monitoring inbound data packet traffic, network security device 120 may reduce the probability of a virus (or other type of undesirable data) from being received at computer system 122, computer system 124 and/or computer system 126 (or other type of device connected within internal network 118).
To enforce security policies, security device 120 may include a security processor 130 that may be tasked to monitor the content of inbound data packets. Security processor 130 may include one input/output (I/O) port 132 for receiving inbound data packets and another I/O port 134 for appropriately passing inbound data packets within internal network 118 (e.g., passing packets to computer system 122). Security processor 130 may be implemented as a single processor or multiple processors. For example, security processor 130 may include one or more general processors (e.g., a microprocessor) and/or one or more specialized devices (e.g., a co-processor or an application specific integrated circuit (ASIC)). In some embodiments security processor 130 may be implemented as a monolithic structure (e.g., a single integrated circuit), or, in other embodiments, as a distributed structure. Although the exemplary embodiment shows the security processor 130 in the security device 120, a processor or device capable of the functions described herein may be used in any type of computing device including, but not limited to, desktop computers, laptop computers, mobile phones, handheld devices, and the like.
For illustrative purposes, security device 120 may receive a series of data packets 150. For example, computer system 114 may initiate sending data packet series 150 to computer system 122 via the Internet 116. Prior to being delivered to computer system 122, security device 120 may check data packet series 150 for a certain type of data such as viruses or other types of undesirable data. In some embodiments, security device 120 may be part of an intrusion detection system such as the open source network intrusion detection system available under the name Snorts.
Security processor 130 may include a pattern recognizer 136 for recognizing the data. In some embodiments, pattern recognizer 136 may be implemented in hardware, in software, or a combination thereof. Further, pattern recognizer 136 may implement one or more pattern recognition algorithms for checking data packet content of data packet series 150. One such pattern recognition algorithm is a multi-pattern matching algorithm known to those of ordinary skill in the art, such as the type used in Snort® intrusion detection systems, for determining if portions of the data packets match multiple patterns of interest. Such pattern recognition algorithms may use a considerable amount of time to check the content of all of the data in a data packet series 150. As other data packet series are received by security device 120, the amount of time needed for checking packet content may become significant. In aggregate, the delays may cause a transmission bottleneck to appear at security device 120 that may reduce throughput of the device.
To reduce the time needed to check data packet content (e.g., the contents of data packet series 150), security processor 130 may include a substring detector 138. Substring detector 138 provides a preliminary check on data packet content. To provide this check, substring detector 138 attempts to detect if one or more predefined substrings may be present in the data packet content. As illustrated below, each predefined substring may be a substring (e.g., a prefix) of one or more predefined patterns of interest. By detecting one or more predefined substrings in the data packet content, substring detector 138 may identify one or more suspect data packets that may contain a pattern of interest (e.g., a virus or other type of undesirable data). Upon detecting one or more predefined substrings in a suspect data packet, the suspect data packet may be provided to pattern recognizer 136 to perform further pattern recognition of the data packet contents, for example, using multi-pattern matching. If a substring is not detected, the data packet may be eliminated from further pattern recognition processing and passed on to its destination. Alternatively, a data packet may be placed aside for further pattern recognition processing (e.g., by pattern recognizer 136) at a later time. By identifying suspect data packets based upon one or more predefined substrings of predefined patterns of interest, pattern recognizer 136 may be more efficiently executed for recognizing the patterns of interest in the data packets. Thus, the probability of bottlenecking may be reduced and throughput may be increased.
In this embodiment, the predefined substrings used by substring detector 138 may be included in a substring library 140 that may be stored on a storage device 142. Storage device 142 may implement one or more storage techniques (e.g., magnetic, magneto-optical disks, or optical disks, etc.) and be accessible by security device 120. By being accessible by security device 120, predefined substrings stored in substring library 140 may be provided to substring detector 138. Other data may also be stored on storage device 142. For example, data representative of instructions executed by security processor 130 to perform the operations of substring detector 138, pattern recognizer 136, and/or I/O ports 132, 134 may be stored in one or more files on storage device 142. Along with storing substrings, data representative of the patterns of interest and/or fingerprints of substrings as described below, may also be stored in substring library 140.
Referring to
In the illustrated embodiment, the patterns of interest 242 and the substrings 244 are formed by ASCII characters. However, other types of data may be used to produce substrings for searching the contents of data packets. For example, numerical data (e.g., binary numbers, decimal numbers, hexadecimal numbers, etc.), graphical data, and the like may be used to produce substrings and to search packet content.
In some embodiments, the predefined substrings of the patterns of interest may be processed to produce a plurality of predefined fingerprints 246 prior to being used for searching packet content. In this embodiment, a fingerprint may be computed for each of the pattern substrings 244 obtained for the patterns of interest 242. In general, fingerprinting is an algorithmic operation that may transform a pattern of one length (W) into a pattern of a shorter length (F), wherein F<W. While the fingerprint of a substring may have a shorter length, the computed fingerprint has a relatively small probability of matching the fingerprint computed for a different substring (known as collision avoidance). Storing smaller sized fingerprints may conserve storage space in comparison to storing the entire substrings. Fingerprinting techniques known to those skilled in the art may be used to compute the predefined fingerprints. One example of a known fingerprinting technique that may be used is fingerprinting by random polynomials. According to this technique, randomly chosen irreducible polynomials may be used to fingerprint bit-strings. Such a technique is described in greater detail in “Fingerprinting by Random Polynomials”, M. Rabin, Technical Report TR-15-81, Harvard University, Department of Computer Science, 1981.
The computed fingerprints 246 of the substrings 244 may be stored, for example, in a lookup table. A substring table may be stored in a memory (e.g., volatile, nonvolatile, etc.) such that entries of the table may be accessed by security processor 130 and compared to the fingerprints computed for the content of one or more data packets. For example, the memory may include random access memory (RAM), read-only memory (ROM), static RAM (SRAM), and/or other type or volatile or nonvolatile memory. Other data structures may also be used to store the predefined fingerprints. Predefined fingerprints may also be stored in a substring library (e.g. substring library 140 in storage device 142 shown
Referring to
In other embodiments, however, the size of the substrings and portions of the input string may include more or less characters (e.g., two characters, six characters, etc.). In other embodiments, a substring detection operation may read portions of different sizes from the input string depending on the size of the predefined substrings. For example, four character string portions may be read from a data packet (and corresponding fingerprints computed) for comparing with predefined fingerprints computed from four character predefined substrings. Six character portions may also be read from a data packet (and corresponding fingerprints computed) for comparing with predefined fingerprints computed from six character predefined substrings.
In one embodiment, data packet 1 may be examined for substrings and data packets 2 and N may then be sequentially examined for the substrings. In other embodiments, one or more data packets may be skipped. For example, after examining data packet 1, data packet 2 may be skipped and data packet N may be examined. The contents of data packet 2 may or may not be examined at a later time. In another embodiment, two or more of the data packets may be examined in parallel. For example, data packets 1, 2 and 3 may be examined simultaneously to reduce processing time.
The input strings (e.g., data packets) examined may also be of variable length. For example, the data packets may be capable of storing more or less than the eight characters stored in each of data packets 1, 2 and N. In other embodiments, input strings may be provided as a relatively continuous stream of characters and may not be segmented into packets. In still another embodiment, input strings may be included in a stream of fixed or variable length packets.
Substring detector 338 may compute a fingerprint of the portion of the input string and may compare the computed fingerprint to the predefined fingerprints 346. For example, the fingerprint of the portion “inte” may be compared to each of the fingerprints F(“inte”), F(“rate”), F(“elec”), F(“sele”). If a match is detected, data packet 1 may be provided to pattern recognizer 336 for executing further pattern recognition on the contents of the packet. In this particular example, a match may be detected since the fingerprint of “inte” is present in the predefined fingerprints F(“inte”), F(“rate”), F(“elec”), F(“sele”). If no match is detected in the substring of the data packet, another portion indicated by bracket 354 may be accessed, for example, the next four character portion (e.g., “ntel”). The substring detector 338 may then compute a fingerprint for this next portion of the input string and may compare this next computed fingerprint to the predefined fingerprints 346.
This process may be repeated for an entire data string (e.g., data packet 1) by sliding across the data string a window of a fixed size (e.g., represented by brackets 352, 354) corresponding to the size of the predefined substrings (e.g., a four character sliding window is shown). In this embodiment, a window of a fixed size is slid across the data string by one character at a time such that fingerprints are computed for successive fixed-length portions of the input string and compared to the predefined fingerprints. In the exemplary embodiment where the input string is a data packet, the fixed size of the window may correspond to a fixed number of bytes in a packet payload and the window may slide by one byte at a time.
Referring to
When the window is sliding by more than one character at a time, the predefined fingerprints 446 may include fingerprints of portions of the substrings in addition to fingerprints of the entire substrings. Comparing the computed fingerprints to fingerprints of portions of the substring prevents patterns from being missed when sliding the window by more than one character.
As shown in
The number of additional characters to be combined with substring portions to form augmented substring portions depends on the slide (S) of the window having the fixed length (W). In particular, the augmented substring portions may include one to S−1 upfront characters combined with the appropriately shortened substring. Where the slide (S) is two (2) characters, therefore, the augmented substring portions include substring portions with only one (1) upfront character (e.g., “*int” where * represents a character). If the slide (S) is three (3) characters, the augmented substring portions may include substring portions with one (1) additional character (e.g., “*int”) and substring portions with two (2) additional characters (e.g., “*#in” where * and # represent any possible character). If the slide (S) is four (4) characters, the augmented substring portions may include substring portions with one (1) additional character (e.g., “*int”), substring portions with two (2) additional characters (e.g., “*#in”), and substring portions with three (3) additional characters (e.g., “*#$i” where *, # and $ represent any possible character). The predefined fingerprint table 546 may thus be populated by performing a fingerprinting operation over the predefined substrings of each pattern and over the possible augmented portions and storing the results in a lookup table.
During substring detection, computed fingerprints of portions of the input string 450 corresponding to the sliding window may be compared to each of the predefined fingerprints of substrings and predefined fingerprints of augmented substrings. As shown in
In an alternative embodiment, shown in
During substring detection, computed fingerprints of portions of the input string 440 corresponding to the sliding window may be compared to the tables 646a including predefined fingerprints of the substrings. Predefined fingerprints of one or more subsets of the sliding window may be computed for comparison to the tables 646b including fingerprints of shortened substring portions. The subset(s) of the sliding window includes the corresponding portion of the string without the first one to S−1 characters. In
Although the illustrated embodiments show a limited number of patterns of interest and predefined substrings of those patterns, substring detection may be used with other numbers of patterns. In one example, a sliding window of W=8 and S=2 may be used to detect about 500 patterns with a fingerprint table size of about 3.32 KB. This example of substring detection may be capable of eliminating over 80% of the input string that would have otherwise been processed using a multi-pattern matching algorithm.
Sliding windows that slide by more than one character at a time may also be used in substring detection without computing fingerprints. The predefined substrings and portions of the predefined substrings may be stored in the lookup tables without computing fingerprints. The portions of the input strings corresponding to the sliding windows may then be compared directly to the predefined substrings and portions of the predefined substrings.
Referring to
The method of composing predefined fingerprints may also include computing 706 fingerprints for each substring and if applicable, the augmented substring portions and/or the shortened substring portions. As mentioned above, fingerprinting techniques known to those skilled in the art, such as fingerprinting by random polynomials, may be implemented for computing the fingerprints. The method of composing predefined fingerprints may further include storing 708 the predefined fingerprints, for example, in a lookup table. The predefined fingerprints (e.g., lookup tables) may be physically stored on a storage device such as storage device 142 and/or in a memory (e.g., volatile memory, nonvolatile memory, etc.).
Referring to
The method of detecting substrings may also include comparing 804 the computed fingerprints with one or more predefined fingerprints. In some embodiments, the predefined fingerprint(s) may include one or more lookup tables of predefined fingerprints. Comparing the fingerprints, a match 806 may be detected that indicates the substring associated with the predefined fingerprint is included in the portion of the input string associated with the computed fingerprint. If a match is detected, the method of detecting substrings may include initiating 808 pattern recognition processing on a data string (e.g., the contents of a data packet) that includes the substring. The pattern recognition processing may include processing using a multi-pattern matching algorithm as described above. After initiating pattern recognition processing, the method may be repeated for another input string or may continue to process subsequent portions of the same input string.
If a match is not detected, the method may include determining 810 if there is another subsequent portion of the input string to be processed. If the method uses a sliding window, for example, the method may determine if the sliding window is able to move (e.g., by one or more characters) to a subsequent location within the input string. If there is a subsequent portion to be processed, the method may then be repeated by producing 802 a fingerprint of the subsequent portion. As illustrated in
One or more of the operations associated with flowchart 700 and/or flowchart 800 may be performed by one or more programmable processors (e.g., a microprocessor, an ASIC, etc.) such as security processor 130 executing a computer program. The execution of one or more computer programs may include operating on input data (e.g., data provided from a memory and/or storage device, etc.) and generating output (e.g., sending data to a computer system, etc.). The operations may also be performed by a processor implemented as special purpose logic circuitry (e.g., an FPGA (field programmable gate array), an ASIC (application-specific integrated circuit), etc.).
Operation execution may also be executed by digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The operations described in flowchart 700 and/or flowchart 800 may be implemented as a computer program product, e.g., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device (e.g., RAM, ROM, hard-drive, CD-ROM, etc.) or in a propagated signal. The computer program product may be executed by or control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program may be written in one or more forms of programming languages, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computing device (e.g., controller, computer system, etc.) or on multiple computing devices (e.g., multiple controllers) at one site or distributed across multiple sites and interconnected by a communication network.
While the substring detection is described above in the context of network protection, other types of data such as DNA sequences may be searched using substring detection. A substring detection method may also be used in any computing device including, but not limited to, desktop computers, laptop computers, handheld devices, and mobile phones. A substring detection method may also be used in other devices, such as a bio-sequence analyzer, that carry out complex pattern matching.
Referring to
The line card(s) 920 may be coupled to switch card 910 via a serial interconnect. The line card(s) 920 may include a local ASI switch component 922 that is linked to local framer/media access control (MAC)/physical layer (PHY) component(s) 924, NPU(s) 926, and/or CPU(s) 928. The framer/MAC/PHY component(s) 924 may be used to connect the line card(s) 920 to other locations via an I/O data link and may also be coupled directly to ASI line card switch component 922, for example, via an ASI link. The line card(s) 920 may also include memory and/or storage components (not shown) coupled to the CPU 928. The line card(s) 920 may further include a security processor 930 capable of supporting the NPU 926 by performing substring detection. The security processor 930 may perform substring detection before further pattern recognition processing performed by security processor 930 or NPU 926. Alternatively or additionally, the switch card(s) 910 may include a security processor 930a to perform the substring detection and/or further pattern recognition.
The control card(s) 940 may include a CPU 942 coupled between a memory 944 and a storage 946. The switch card(s) 910, line card(s) 920 and control card(s) 940 may be implemented in modular systems that employ serial-based interconnect fabrics, such as PCI Express™ components. One example of such modular communication systems includes Advanced Telecommunications Computing Architecture (AdvancedTCA) systems.
Accordingly, an apparatus, consistent with one embodiment, may include an integrated circuit configured to produce a fingerprint of at least one portion of an input data string, to compare the fingerprint of the portion of the data string to at least one predefined fingerprint for a predefined substring, and to initiate pattern recognition processing on the data string if the fingerprints match. The predefined substring includes a portion of a predefined pattern of interest.
A method, consistent with one embodiment, may include receiving at least one portion of an input string of data; producing a fingerprint of the portion of the input string of data; comparing the fingerprint of the portion of the string of data to at least one predefined fingerprint of a predefined substring, wherein the predefined substring includes a portion of at least one predefined string; and if the fingerprints match, initiating pattern recognition on the input string of data.
A system, consistent with a further embodiment, may include a switch fabric configured to route data packets and a plurality of line cards coupled to the switch fabric and configured to receive data packets. At least one of the line cards may include a security processor configured to produce a fingerprint of at least one portion of an input string from at least one of the data packets, to compare the fingerprint of the portion of the data string to at least one predefined fingerprint for a predefined substring, and to initiate pattern recognition processing on the data string if the fingerprints match. The predefined substring includes a portion of a predefined pattern of interest.
Another method consistent with an embodiment of the present invention may include sliding a window of a fixed length across an input string of data to obtain successive portions of the input string of data, wherein the sliding window slides across the input string by more than one character at a time; comparing the portions of the string of data to a plurality of predefined substrings of predefined patterns of interest and to portions of the predefined substrings; and if any one of the portions match, initiating pattern recognition on the input string of data.
The terms and expressions which have been employed herein are used as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding any equivalents of the features shown and described (or portions thereof), and it is recognized that various modifications are possible within the scope of the claims. Other modifications, variations, and alternatives are also possible. Accordingly, the claims are intended to cover all such equivalents.
Number | Name | Date | Kind |
---|---|---|---|
6012049 | Kawan | Jan 2000 | A |
6493709 | Aiken | Dec 2002 | B1 |
6611213 | Bentley et al. | Aug 2003 | B1 |
6804316 | Shectman | Oct 2004 | B1 |
7110540 | Rajagopal et al. | Sep 2006 | B2 |
7430670 | Horning et al. | Sep 2008 | B1 |
7454418 | Wang | Nov 2008 | B1 |
7487542 | Boulanger et al. | Feb 2009 | B2 |
Number | Date | Country | |
---|---|---|---|
20080010278 A1 | Jan 2008 | US |