The invention relates to data transmission in packet-based transmission systems and, more particularly, to providing in-line, adaptive data compression using a deep packet inspection (DPI) process.
The communications bandwidth in conventional electronic component systems and networks is usually limited by the processing capabilities of the electronic systems, as well as the overall network characteristics. Some traditional attempts at addressing bandwidth limitations involve compression of the information included in a communication packet. Network equipment providers are continually pressed to increase the efficiency of their equipment to overcome these bandwidth limitations and provide improved compression techniques. The cost and hardware requirements to improve efficiency are significant. The typical solution requires a full “store and compression” approach, which requires large temporary storage for the stream until compression is completed, introducing unwanted delay into the system.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The present invention relates to a method of creating data compression in a packet stream to be transmitted, the method comprising analyzing an initial sample of the packet stream to identify data patterns, building a dictionary of identified data patterns and associating a unique token ID with each identified data pattern, creating a ruleset based on the dictionary, providing the ruleset to a deep packet inspection engine and directing the remainder of the packet stream through the deep packet inspection engine to scan and recognize data patterns from the ruleset, replacing each recognized data pattern with its associated token ID and identifying a start offset within the packet stream where the recognized data pattern was removed.
Additional embodiments of the invention are described in the remainder of the application, including the claims.
Embodiments of the present invention will become apparent from the following detailed description, the appended claims, and the accompanying drawings in which like reference numerals identify similar or identical elements.
Reference herein to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments necessarily mutually exclusive of other embodiments. The same applies to the term “implementation”.
It should be understood that the steps of the exemplary methods set forth herein are not necessarily required to be performed in the order described, and the order of the steps of such methods should be understood to be merely exemplary. Likewise, additional steps might be included in such methods, and certain steps might be omitted or combined, in methods consistent with various embodiments of the present invention.
Also for purposes of this description, the terms “couple”, “coupling”, “coupled”, “connect”, “connecting”, or “connected” refer to any manner known in the art or later developed in which energy is allowed to be transferred between two or more elements, and the interposition Of one or more additional elements is contemplated, although not required. Conversely, the terms “directly coupled”, “directly connected”, etc., imply the absence of such additional elements. Signals and corresponding nodes or ports might be referred to by the same name and are interchangeable for purposes here. The term “or” should be interpreted as inclusive unless stated otherwise. Further, elements in a figure having subscripted reference numbers, (e.g. 1001, 1002, . . . 100K) might be collectively referred to herein using the reference number 100.
Moreover, the terms “system,” “component,” “module,” “interface,” “model,” or the like are generally intended to refer to a computer-related entity, either hardware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a controller and the controller can be a component. One or more components may reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers.
Deep packet inspection (DPI) is the process of identifying signatures (i.e., patterns or regular expressions) in the payload portion of a data packet. DPI is generally used as a security check to search for malicious types of internet traffic that can be buried in the data portion of the packet.
In accordance with one or more of the embodiments of the present invention, DPI techniques related to searching the payload portion of data packets are utilized to perform data compression, particularly necessary in many bandwidth-limited communication systems. A separate processor is initially used within a transmission source to scan, in real time, a data packet stream and recognize repetitive patterns that are occurring in the data. The processor builds a dictionary (ruleset), storing the set of repetitive patterns and defining a unique token identification (ID) to be associated with each pattern. Thereafter, the DPI engine uses this ruleset to recognize the repetitive data patterns in the packets being scanned, and replaces each relatively long data pattern with its short token ID, creating a stream of compressed data packets.
As long as the receiver of the compressed data stream has a dictionary of the same “data pattern-token ID” pairs as the original ruleset, the receiver will be able to re-create the initial data stream. The process as outlined below is able to work with long pattern lengths, replacing these long patterns with a relatively short token ID (e.g., 4 bytes, with additional bytes used to identify insertion location (i.e., “start” position) in the data stream).
In accordance with this illustrated embodiment of the present invention, MPP 14 is used to determine if the particular type of data stream being prepared for transmission is an appropriate candidate for data compression (i.e., is it a data stream that is likely to include repetitive patterns or sequences, such as email). It is also possible to employ a user-configurable option to define specific data flows that need compression. Two different output paths from MPP 14 are shown, where first output path O1 is shown as directly coupled to an output interface adapter 16. The data traffic that does not require data compression is directed onto this signal path and is thereafter prepared in output interface adapter 16 for transmission into the communication network (not shown).
Alternatively, if MPP 14 determines that the current packet stream is suitable for compression, the packets will be directed along a second output path O2 as shown, where this traffic is then applied as an input to a DPI engine 18. As shown in
CPU 20 then creates a ruleset R for use by DPI engine 18, the rule set including both the identified data patterns and a set of unique token IDs which CPU 20 assigns to the data patterns in a one-to-one relationship. An exemplary dictionary showing this ruleset R is shown in
With this ruleset in place, DPI engine 18 scans the incoming packets for data patterns as defined by ruleset R. When found, DPI engine 18 reports pattern's token ID and location in the packet, with MPP 14 (or another module, such as a packet assembly engine) then removing this section of data and replacing it with the appropriate unique token ID and start location of the long data pattern. It is to be understood that DPI engine 18 will continue to perform, in parallel, its conventional function of scanning the payload portion of data packets for malicious program data while performing this data compression operation.
Once DPI engine 18 reaches the end of a particular packet, the set of token IDs and start locations are grouped together and added to the compressed packet (either at the beginning or end of the packet header) within a packet assembler 22. Once properly ordered, the final compressed packet is sent to output interface 16 for transmission across the communication network to the designated receiving location.
The compressed packet output from DPI engine 18 is shown on the right-hand side of
Referring to
As will be discussed below with an alternative embodiment of the present invention, the in-line compression arrangement may also perform a comparison of the length of the original packet to the compressed packet to define the “compression ratio” that is achieved by using the DPI pattern replacement process. The compression ratio is considered to be a measure of the efficiency of the compression process. An embodiment of the present invention allows for periodic monitoring of the compression ratio, providing the capability to recalculate the ruleset in an adaptive fashion.
Over time, it is possible that the initial data patterns identified by CPU 20 have become “outdated”, while newer patterns are not being recognized and, therefore, the compression process becomes inefficient. Thus, in an alternative embodiment of the present invention, CPU 20 receives feedback information from DPI engine 18 in terms of the current length of the compressed data traffic. CPU 20 uses this information to monitor the compression ratio on a periodic basis (the compression ratio defined as a ratio of the length compressed data stream to the length of the “original” data stream and sends an “update” signal to MPP 14 when the compression ratio becomes too high (i.e., approaches the value of “1”). In response to this update request, MPP 14 sends a current portion of the data stream to CPU 20, which performs the same pattern recognition analysis to generate a new, updated ruleset (sent to both DPI 18 and the receiver). During the period of time that CPU 20 is performing this update, MPP 14 is instructed to send all of the traffic through output O1, so that the ruleset for DPI engine 18 can be updated without interruption.
In one embodiment of the present invention, a modular packet processor within a communication processor is used for identifying the packet type. This “type” information can then be used to make a determination regarding whether or not data compression would be appropriate. For example, email is known to be replete with patterns, particularly in an email “chain” where portions are copied multiple times within the body of the email. Thus, when the MPP recognizes a current data flow as being an email transmission, this data flow would be directed into the data compression process as described in detail below. As mentioned above, a user-configurable flag can be used to identify a data flow to be sent through a compression process.
Referring now to the particulars of the flow chart of
Returning to step 120, the compression process continues by sending the copy of the initial flow to a processor (step 130) which employs a predetermined algorithm to detect patterns in the data bits forming the stream (step 140). Coding algorithms such as Ziv-Lempel or Huffman may be used for this purpose, but should be considered as exemplary choices only. As the processor recognizes patterns, it builds a ruleset (step 150), creating linked pairs of the recognized pattern and a unique token ID.
The process continues searching until the entire copied portion of the data stream has been evaluated (step 160). At this point, the initial ruleset is defined as “complete”, containing a set of recognized data patterns, with a unique token ID being assigned to each data pattern. As shown in the flowchart of
In an alternative embodiment of the present invention, the processor also monitors the compression ratio on a periodic basis to evaluate the efficiency of the compression process on an on-going basis.
With an established threshold value, the process of
On the other hand, if the result of the comparison of step 230 is that the current compression ratio has gone above the threshold value, the process moves to request the modular packet processor to send a current portion of the incoming data stream to the central processing unit (step 240). At this point, the central processing unit re-initiates the pattern recognition process as described above in association with the flowchart of
Once the new ruleset is completed, the process as shown in
The process involved at the receiving end of the data flow to reassemble the data packet from the compressed version is rather straightforward. The receiver extracts the token match field from the header, where as mentioned, above this header includes the total number of patterns that need to be re-inserted. The assembler then replaces each token ID with its associated data pattern, as extracted from the current version of the ruleset. The start offset value indicates to the receiver the proper location to insert the associated data pattern.
It is also possible in an alternative embodiment of the present invention to provide inter-packet compression. This will occur when the DPI engine recognizes a pattern that begins in one packet and ends in the following packet. This possibility is illustrated in
Various arrangements of the present invention may be embodied in the form of methods and apparatuses for practicing those methods. Indeed, components and elements as used in one or more embodiments of the present invention may be embodied in the form of program code, for example, whether stored in a storage medium, loaded into and/or executed by a machine, or transmitted over some transmission medium or carrier, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the program code is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention. When implemented on a general-purpose processor, the program code segments combine with the processor to provide a unique device that operates analogously to specific logic circuits.
The present invention can also be embodied in the form of a bitstream or other sequence of signal values electrically or optically transmitted through a medium, stored magnetic-field variations in a magnetic recording medium, etc., generated using a method and/or an apparatus of the present invention.
It will be further understood that various changes in the details, materials, and arrangements of the parts which have been described and illustrated in order to explain the nature of this invention may be made by those skilled in the art without departing from the scope of the invention as expressed in the following claims.