Not applicable.
Not applicable.
Not applicable.
1. Field of the Invention
The present invention relates generally to data filtering and archiving. More particularly the present invention relates to a system and method for efficiently detecting and storing multiple files that contain similar or approximately duplicates of each other data based on their attributes. More specifically, the method relates to a system of detecting the most likely similar data pairs out of an original group of input data. In an archiving system, these similar pairs can be exploited by using delta encoding (differences between files) rather than compressing each file of the pair individually.
2. Discussion of Related Art Including Information Disclosed Under 37 CFR §§1.97, 1.98
Archiving software such as STUFFIT®, ZIP®, RAR®, and similar utilities, enable users to combine or package multiple files into a single archive for distribution. At the same time, these products enable users to compress and encrypt the files so that bandwidth costs and storage requirements are minimized when sending the resulting archive across a communication channel or when storing it in a storage medium.
Files added to an archive are frequently approximate duplicates of other files already archived or are very similar based on their respective attributes. Current archiving software, such as the utilities mentioned above, compress each data set as a whole, without detecting duplicate sets and therefore without being able to use differencing technology rather than “compression” on approximately duplicate or most likely similar data sets (i.e., most likely duplicate files). It would be advantageous, therefore, to detect when a subset of data set being added to an archive is nearly identical on the basis of having the same or similar actual data, and instead of compressing and storing additional copies of the file data, simply storing a reference to the compressed data already present in the first archived copy of the file. Moreover, it is desirable that the detection and coding of the identical files be as time efficient as possible.
Using a brute force method of comparing an input set of files for those files that have the greatest benefit (smallest size) from using a differencing method rather than a standard compression method is far too costly in terms of processing speed, temporary storage, and memory requirements—mathematically, the brute force method would require nearly O(n̂2) differences to be actually attempted, and then use the smallest result out of the various combinations.
Current products, such as backup software, use diffing technology to archive files smaller than files produced by compression of the individual file, but if the diffing algorithm bases the files it compares/differences based on the file locations in the file system (i.e. Backup software), it has a much better hint as to what are possible matches.
In contrast with prior art systems and products, the present invention narrows N number of randomly selected files to be compressed into an archive to a small subset of possible matched pairs, thereby reducing the large number of potential file pairs down to the most likely to benefit from using a differencing technique. It takes this approach rather than any of the well known compression techniques, including Huffman, Arithmetic Coding, Lempel Ziv variants, as well as others.
Accordingly, the present invention provides a system and method that efficiently detects approximately duplicate files; then, rather than compress the second and subsequent occurrences of the duplicate data, the inventive method simply stores differences in a reference to the first compressed copy of the data. This process effectively compresses multiple copies of data by nearly 100% (only small amounts of reference information is stored), without repeated compression of the matching data.
Further, unlike the “block” or “solid” mode currently used by state of the art archiving products, the presently inventive method is not in any way dependent on the size of the files, compression history, or window size.
It must also be emphasized that while decompressing/extracting archived files, the present inventive method of storing references to the original data requires the extraction process to only use decompression, such as Lempel Ziv, Huffman, etc., of only the first occurrence of duplicate data; subsequent duplicates are processed during extraction by applying differences to the first set of data after it has been processed. As matching files are encountered, this method simply copies the already decompressed first occurrence data portions if there was an exact match, or applies the differencing instructions if the data was nearly identical, but not exactly identical to the data or file fork in question.
Additionally, the present invention provides a method that is not in any way tied to the actual differencing method used to generate a “diff” from the file/data pairs which the method detects as the most likely matches.
The foregoing summary broadly sets out the more important features of the present invention so that the detailed description that follows may be better understood, and so that the present contributions to the art may be better appreciated. There are additional features of the invention that will be described in the detailed description of the preferred embodiments of the invention which will form the subject matter of the claims appended hereto.
The foregoing summary, as well as the following detailed description of the invention, will be better understood when read in conjunction with the appended drawings. For the purpose of illustrating the invention, there are shown in the drawings embodiments which are presently preferred. It should be understood, however, that the invention is not limited to the precise arrangements and instrumentalities shown.
The invention will be better understood and objects other than those set forth will become apparent when consideration is given to the following detailed description thereof. Such description makes reference to the annexed drawings.
Definitions: The following written description makes use of the following terms and phrases. As used herein the underlined terms have the indicated meaning.
Data set: a set of one or more typed files or data, also possessing attributes (including but not limited to directory, name, extension, type, creator, creation time, modification time, and access time.)
Archive: a collection of files created for the purpose of storage or transmission, usually in compressed and otherwise transformed form; an archive consists of structural information and archive data.
Attributes: parts of an archive that contain information about files/data, including, but not limited to type, pre- and post-archive transform sizes, extension, type, creator, creation time, modification time, and access time.
Fixed attributes: Some file attributes are fixed. That is, they are established when the file is created and cannot be changed (such as creation name, creator, file type)
Variable Attributes: The attributes of a file that can change each time a file is accessed or modified (such as size, name, modification date and hash values.)
Set of Attribute Weights: A table comprising and maintaining a list of each individual attribute with “weights” assigned to the attributes based on how accurate each attribute has been in determining approximate matches in the past—i.e. “type” by itself has a higher weight than “mod date.” Weights are initialized using some predefined values and updated over time during data processing.
Probable matches: Two or more files or data elements that are likely to be similar based on the weighted calculation for attributes done on them.
Delta encoding: a technique of storing data in the form of differences between sequential data rather than complete files.
Archive data: “data set” data in transformed form.
Archive creation: the process of combining multiple data sets and their attributes into an archive.
Archive expansion, full archive expansion: the process of recreating data sets, files, and their attributes from an archive.
Approximately duplicate files: two or more files having same set of attributes such as file size, type, creation date, creator, or calculated attributes.
Most likely duplicate files: When using the weighted attribute database in combination with the fixed and calculated attributes, “most likely duplicate files” are two or more files that appear to be most likely similar, and would thus benefit from a diffing process rather than stand-alone compression.
Archive transform, forward archive transform: transformation of data stored in an archive by application of algorithms including, but not limited to, compression, encryption, cryptographic signing, filtering, format detection, format-specific recompression, hash calculation, error protection and forward error correction.
Inverse archive transform: transformation of data that is the inverse of the forward archive transform, by application of algorithms including, but not limited to, decompression, decryption, verification of cryptographic signatures, inverse filtering, format-specific decompression, hash verification, error detection, and error correction.
Segment: part of a data set that is read in one operation.
When creating an archive from a set of files/data sets, a straightforward way to detect full or partial duplicates is to compare all incoming file forks, such as data forks and resource forks.
Efficient detection of exact or approximately duplicate data or files is achieved as follows:
Referring to
Using an “Exact Encoding Technique” the exactly duplicate data elements are filtered out 101, and stored separately 102. These steps effectively remove all files which are exact duplicates of each other, leaving only those files that are potentially approximate duplicates to be further identified using this technique.
The remaining data set is passed to the algorithm to find most likely similar files. This starts with the attributes for each data element being extracted and generated 103.
The extracted attributes—Fixed attributes 104 and Calculated Attributes 105 are extracted for each data element. Original data elements are extracted 106 and passed to the Calculated Attributes extraction step 105.
Initial attributes are weighted and assigned an “Initial Attribute Weighting” 107 for storage in a “Set of Attribute Weights” 108, and after extraction the attributes from each data element are assigned a weight as per the values stored in the Set of Attribute Weights. Then the assigned weights for these attributes are used in the weighted prediction process to create an ordered list of most likely matches for the current element 109. Thus step 109 includes two inputs for each of one or more attributes: (1) the currently predicted match between a pair of files or other data—for example 0 to 100% likelihood of a match or other metric; and (2) how accurate that particular prediction has been in the past, i.e., a success rate for that attribute's prediction, possibly 0-100% accuracy or some other metric. These two metrics for each of the possible attributes are then merged into a single weighted “Result,” using a method taught in U.S. patent application Ser. No. 12/329,480, incorporated in its entirety by reference herein.
From the Weighted prediction process an ordered list of the most probable matches for the given data sets is prepared 110.
Based on the list of probable matches, delta encoding is performed on the set of files in the order of the files having higher to lower weighted prediction 111. The delta encoding is stopped when an increase in size is detected.
The data element is also compressed separately by standard compression techniques according to file attributes 114 and the result is stored in a “Compression by Attribute” database 115 which stores/learns the “Average” compression for a file with the given attributes.
The results from the delta compression and standard compression are compared and the best result for either the smallest delta encoding or standard compression is stored 113.
Based on the results from the comparison, the Set of Attribute Weights is updated 112 and the process for assigning a weight to each attribute is repeated for each input data element.
It should also be noted that file pairs that have been identified as pairs are also removed from future comparisons for the remaining data sets/files still to be compared.
Although the invention has been described in language specific to structural features and/or methodological steps, it is to be understood that the invention defined in the appended claims is not necessarily limited to the specific features or steps described. Rather, the specific features and steps are disclosed as preferred forms of implementing the claimed invention.
Therefore, the above description and illustrations should not be construed as limiting the scope of the invention, which is defined by the appended claims.
The present application is a continuation-in-part of each of U.S. Utility patent application Ser. No. 12/208,296, filed Sep. 10, 2008 (Sep. 10, 2008), entitled EFFICIENT FULL OR PARTIAL DUPLICATE FORK DETECTION AND ARCHIVING; and U.S. Utility patent application Ser. No. 12/329,480, filed Dec. 5, 2008 (Dec. 5, 2008), entitled PREDICTION WEIGHTING METHOD BASED ON PREDICTION CONTEXTS, each of which application is incorporated in its entirety by reference herein.
| Number | Date | Country | |
|---|---|---|---|
| 60971739 | Sep 2007 | US |
| Number | Date | Country | |
|---|---|---|---|
| Parent | 12329480 | Dec 2008 | US |
| Child | 12559315 | US | |
| Parent | 12208296 | Sep 2008 | US |
| Child | 12329480 | US |