Most compression techniques are concerned with processing a single data stream. Delta compression on the other hand is the compression of one data stream, referred to as the target data stream, in terms of another data stream, called the reference data stream, by computing a delta. The delta can be viewed as an encoding of the difference between the target and the reference data stream. The target data stream can be later recovered from the delta and the reference data stream. Delta compression can be based on byte-to-byte comparisons. Delta compression is different from hash-based deduplication methods. Delta compression can provide for a finer comparison result than hash-based deduplication methods.
Delta compression is used in revision control systems. By storing deltas of different versions instead of the actual data, these systems are able to reduce storage requirements significantly. For example, Xdelta File System (XDFS) developed by Joshua MacDonald is a file system implemented with delta compression. Another application of delta compression is software distribution; especially the software that is distributed over the Internet. By distributing deltas, or essentially patches, one can significantly reduce network traffic. Delta compression can also be used to improve HTTP performance. By exploiting the similarity between different pages on a given website or between the different versions of a given web page, one can reduce the latency for web access. VCDIFF is defined in RFC 3284 to support this kind of usage.
But in many cases, due to deleting or inserting operations, the reference data is no longer aligned with target data. If reference data and target data are misaligned too much, the incoming target data can't find matched pattern in reference data window. The compression ratio will then be dramatically degraded. There are already several delta compressors available including xdetla, vdelta (and its newer variant VCDIFF) and zdelta. None of them avoids the problem.
Delta compression logic pattern matches using a reference window and a target window. The reference and target data is aligned during delta compression so that the reference and target windows contain similar data. In this way, a better compression ratio can be achieved.
An intelligent alignment can be implemented by indentifying one or more anchor pairs by examining the target and reference data streams. The anchor pairs can be determined by using Rabin-Karp or a similar rolling hash algorithm. Each byte in the target or reference data stream has a rolling hash result that corresponds to a hash of a multiple byte window,
A reference anchor candidate is located when a feature pattern is found in the rolling hash results of reference data stream. The rolling hash result is also referred to as fingerprint value of the reference anchor candidate. If an anchor candidate from the target data has the same fingerprint value as a counterpart from the reference data, an anchor pair is identified.
With such anchor pair, the invention can use much smaller reference window than other tools. This can simplify the computation complexity and improve performance. The use of a smaller reference window also makes hardware implementation feasible by saving memory resources on a chip.
Adding or deleting data from an old file version to a new file version can happen daily. The advantage of delta compression is that only the difference need be stored.
The segment y can be identified and encoded in delta compression. Delta compression saves storage space by referring to data in the reference stream.
All delta compressors compare the incoming target data stream with a reference data stream. Some compressors also compare the incoming target data with the previous target data (target history).
A delta compression method can comprise determining anchors to align a reference window and target window for compression of a target data stream in terms of a reference data stream. The anchors can be determined by examining the target data stream and reference data stream. The target data stream can then be aligned with respect to the reference data stream. Pattern matching between the aligned target data stream and reference data stream can be used to delta compress the target data stream.
A delta decompression method can comprise using anchors to align a reference window and target window for decompressing of a target data stream in terms of a reference data stream using a delta. The anchors can be previously determined by examining the target data stream and reference data stream during compression of the target data stream. The target data stream can be decompressed using the aligned reference and target windows.
A delta compressor 600 can include a reference window 602, a target window 604, and an anchor determining block 606 to determine anchors by examining the target data stream and reference data stream. As discussed below, the anchor determining block 606 can use a rolling hash algorithm. An aligning block 608 can align the target data stream with respect to the reference data stream in the reference and target windows using the anchors. A pattern matching block 610 can pattern match between the aligned target data stream and reference data stream to delta compress the target data stream using encoder 612.
The compressed delta 616 can be stored on a computer readable medium 614. The delta 616 can be later used to decompress the target data stream along with the reference data stream. The delta 616 can include anchor pairs 618 from the anchor determining block 606 indicating where to align the reference and target data streams. Delta information 620 from the encoder 612 can indicate how to decompress the aligned target data stream with respect to the reference data stream. Thus, the delta 616 can be decompressed to produce the target data stream using the computer readable medium.
A delta decompressor 700 can use a computer readable medium 614 to provide the delta information 620 and anchor pairs 610. The anchor pairs 610 are provided to an alignment block 702 that aligns the reference window 704 and target window 706. The decompressor block 708 receives the delta information 620 and the aligned reference data and produces the target data stream that is sent to the target window 706.
The anchors can be selected by using a hash method, such as a rolling hash method. The hash method can be implemented in hardware.
The reference data stream and target data stream can be streamed into the reference window and target window respectively until an anchor value is reached. At that time, one of the reference or target data streams is stalled until the reference and target data streams are aligned.
Anchors label the same parts of content between reference and target data stream. Anchors can be represented as the pair (offset of reference anchor, offset of target anchor).
Anchors can be determined by rolling hash algorithms. A rolling hash is a hash function where the input is hashed in a sliding window that moves through the input. A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the hash window, and the new value added to the hash window—similar to the way a moving average function can be computed much more quickly than other low-pass filters. Hash functions can also be efficiently implemented in hardware.
Let us take Rabin-Karp algorithm as example. The Rabin-Karp algorithm is normally used with a very simple rolling hash function that only uses multiplications and additions:
H
k=(c1αk−1+c2αk−2+c3k−3+ . . . +ckα0) mod M, where a M is a constant and c1, . . . , ck are the input characters.
In order to avoid manipulating huge H values, all math is done modulo M.
Removing and adding characters simply involves adding or subtracting the first or last term. Shifting all characters by one position to the left requires multiplying the entire sum Hk by α. The calculation of Hk+1 can be simplified as:
H
k+1=((Hk−c1αk−1)*α+ck+1) mod M
So sweeping through the whole reference data stream, each rolling hash sliding window can generate a hash result. If the hash result is matched with the predefined feature pattern (e.g. a selected number of least significant bit “0”s), the hash result and reference offset are recorded as reference anchor candidate. The hash result is also referred as the fingerprint of the anchor candidate. An anchor candidate can be represented as the pair (anchor offset, anchor fingerprint).
The target anchor candidates can be determined in the same way. If the fingerprint of the target anchor candidate is same as a reference anchor candidate, an anchor pair is identified.
The hash result can be updated at the byte level such that a hash value is determined for each byte of the target and reference data stream. For example, for the following data stream: Byte0, Byte1, . . . , ByteN-1, ByteN, ByteN+1 . . . , if we define the window size to N, the first rolling hash result can be calculated on [Byte0, Byte1, . . . , ByteN-1], the second rolling hash result can be calculated on [Byte1, Byte2, . . . ByteN] and the third result can be calculated on [Byte2, Byte3, . . . , ByteN+1]. In this way, each byte can correspond to a rolling hash result. Since the rolling hash drops the oldest byte each time, the complexity of the computation is linear.
Anchor density can be adjusted. For example, we can configure to identify an anchor pair every 2 KB in average by configuring the feature pattern with the least significant 11 bit “0”s. For density of 1 KB, by configuring the feature pattern with the 10 least significant “0”s. Higher density will result in better delta compression ratio, but more processing in the anchor determination.
The workflow of one embodiment is shown in
Target data and reference data are streamed in for pattern match in step 503. During the pattern match processing in step 504, if an anchor pair is detected, the compressor has to align the reference window and target window. The anchor pair can be represented as the pair (offset of reference anchor, offset of target anchor). The compressor can maintain a reference offset counter and a target offset counter. The reference offset counter can be incremented when a new character is moved into the reference window. The target offset counter can be incremented when a new character is moved into the target window. An anchor is detected when either offset counter hits an anchor in step 506.
In the alignment process, if reference data stream is ahead of the target data stream, i.e., the compressor meets reference anchor before the corresponding target anchor 507, the compressor can stall the reference window, while target data is streamed in and do pattern match in step 508, until the target offset of the same anchor is met in step 509.
If target data stream is ahead of reference data stream, i.e., the compressor meets target offset of an anchor first in step 510, the compressor can stall the target window and stream reference data in the reference window in step 511 until the reference offset of the same anchor is met in step 512. No pattern match is performed.
The post pattern match result is encoded and output.
During decompression, the same anchor pairs are input to decompressor before decompression. When anchors are detected by decompressor during the processing, decompressor is able to align the reference window and target window to recover data back.
Experiments show that the invention can use much smaller reference window than other tools. This could simplify the computation complexity and improve performance. Smaller reference window also makes hardware implementation feasible by saving a lot of memory resources on chip.
The foregoing description of preferred embodiments of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the art to understand the invention for various embodiments and with various modifications that are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.