Compression may be carried out to increase the amount of data that may be stored on a data storage device. Unlike movies, pictures or audio recording that may support lossy compression, user data must be retrievable in the same condition as it was stored. This means lossless compression. For many applications that are not particularly time-sensitive, a high degree of compression may be achieved, albeit at the expense of processing cycles, storage resources and time. For other applications, a faster compression is more important than a slower compression that may be somewhat better.
An embodiment defines a method of compressing data that, although not as efficient as some other compression methods, is faster and may be preferred in many applications. This may be termed “good enough” compression. What it lacks in efficiency, it gains in speed.
According to one embodiment, backward compound pointers are used to reference repeated byte pairs. Such pointers may be stored in a reverse pointer buffer during compression. This data structure comprises pointers to previous instances of repeated data and to previous matches. For example, bytes of the data may be examined using a sliding window of width, for example, of two bytes. The window, for example, may be configured to slide across the data in one-byte increments.
This method assumes that bytes pairs that match previous byte pairs are likely to repeat again in the future (further down the data). In one embodiment, a sliding compression window is used. According to one embodiment, the window is two bytes in width, although windows of other widths may be used.
According to one embodiment, a table of all possible two bytes values is created and a pointer may be provided to all the locations where these two bytes values are used. Once a primary string (a string that is to be reduced in size by finding a match in the buffer) is encountered that starts with these two bytes values, the table may be consulted and each string may be examined that starts with these two bytes and compared to the primary string.
Table 106 may be called a starting location table. The starting location table 106 may be configured to store, according to one embodiment, the first instance of all possible values within a given byte length. In the example of
In the example of
As shown in
According to one embodiment, using the starting location table 106 and the pointers table 104, the values of the string “2 B C D X” may be replaced by a pointer to the first instance of the beginning of the string, and a length of the string that is repeated. Here, only the values “2 B C D” of the string “2 B C D X” are repeated, as the value “X” does not follow the first instance of the string “2 B C D” at locations 6, 7, 8 and 9. In this case, therefore, the repeated string “2 B C D” at locations 13, 14, 15 and 16 may be replaced by “P6, 4”, indicating that the next 4 values may be found at the four consecutive locations beginning at location 6. Since the “X” value of the string “2 B C D X” is not repeated in the string “2 B C D Y”, the value “X” is simply appended to the expression “P6, 4” indicative of the repeated string. In this manner and according to one embodiment, the repeated string “2 B C D X” at locations 13, 14, 15, 16 and 17 may be replaced with the compressed string “P6, 4, X”.
According to one embodiment, in comparing strings, once a repeated byte is found, bytes pairs may be compared until the byte pairs no longer match. For example, having identified that the value “2” is present in location 6 and repeated at location 13, the byte pair at locations 7 and 14 may be compared. If a match is found, byte pair at locations 8 and 15 may be compared, and so on until byte pair 10 and 17 are compared and found not to match. Having identified a non-matching byte pair, the preceding matching bytes, if sufficient in number, may be compressed as detailed above and shown relative to
According to one embodiment, an antecedent step may be carried out to determine whether the original, non-compressed data is deemed to be compressible or deemed to be sufficiently compressible so as to make the compression effort worthwhile. There are many different methods of determining whether data is compressible and any such methods may be utilized within the context of the present disclosure.
Moreover, according to one embodiment, a determination may be made whether the repeated string has a predetermined minimum length. For example, the exemplary string “2 B C D” is 4 bytes long, whereas the compressed version thereof; namely, “P6, 4” is two bytes long. It may not be useful or a judicious use of computing resources to compress any repeated string of less than, for example, 3 bytes in length. This minimum repeated length threshold may be set as desired. A larger threshold may result in a somewhat decreased compression ratio, but such compression may be carried out somewhat faster. Conversely, a smaller repeated length threshold may yield somewhat better compression, at the cost of a somewhat greater utilization of time and resources.
According to one embodiment, the repeated sequences of values may be determined across the entire chunk of data 102 being processed. In the example developed above, the chunk of data 102 was 1 MB in size. According to one embodiment, however, sequences of values may be considered to be “repeated” only if instances thereof appear within a predetermined span of data that is smaller than the size of the chunk of data 102 under current consideration. Such predetermined span may be, for example, 4 KB in length, 8 KB in length or most any length up to the size of the chunk of data 102 under consideration. In this manner, instances of values that would otherwise be identified as being “repeated” may not be so identified if they are more than the predetermined span away from the starting location of the sequence of values currently under consideration. Accordingly, a larger predetermined span (e.g., 500 KB or 1 MB) may achieve a better compression ratio (i.e., the large size of the span may capture more “repeats” of the sequences of values and/or longer repeated sequences) than a comparatively smaller predetermined span. However, such better compression ratio may be associated with increased use of processing and memory resources, which may lead to increased processing time. Similarly, a smaller predetermined span (e.g., 4 KB or 8 KB) may utilize comparatively fewer computational and memory resources (and thus may achieve somewhat better performance). A smaller span may be associated with a comparatively lesser compression ratio (i.e., the smaller size of the span may cause fewer “repeats” of sequences of values to be identified and/or the size of the repeated sequences may be smaller), but may carry out that compression faster.
According to one embodiment, after all of the data in the table 102 is processed to populate the starting location table 106 and the pointers table 104 and the data in the data table 102 is compressed as detailed above, another chunk (e.g., 1 MB) of data may be acquired, and the values in the starting location table 106 and the pointers table 104 discarded. The same tables 106, 104 may then be re-populated with starting values and pointers, respectively. Alternatively, the staring location table 106 and the pointers table 104 may be discarded and a new starting location table 106 and a new pointers table 104 may be instantiated upon the analysis of the new chunk of data. Successive chunks of data may be analyzed and compressed until all of the data has thus been analyzed and compressed.
This process may be carried out rapidly. Although other forms of compression may yield a greater compression ratio, embodiments in this disclosure favor speed of compression over achieving the maximum compression ratio.
When references (e.g., pointers) to the starting locations of all data values have populated the starting location table 106 and when references (e.g., pointers) to all second and subsequent instances of those data values have populated the reference table 104 (YES branch of B24), the separate instances of repeated sequences of values (such as the exemplary repeated sequence “2BCD” in
While certain embodiments of the disclosure have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the disclosure. Indeed, the novel methods, devices and systems described herein may be embodied in a variety of other forms. Furthermore, various omissions, substitutions and changes in the form of the methods and systems described herein may be made without departing from the spirit of the disclosure. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the disclosure. For example, those skilled in the art will appreciate that in various embodiments, the actual physical and logical structures may differ from those shown in the figures. Depending on the embodiment, certain steps described in the example above may be removed, others may be added. Also, the features and attributes of the specific embodiments disclosed above may be combined in different ways to form additional embodiments, all of which fall within the scope of the present disclosure. Although the present disclosure provides certain preferred embodiments and applications, other embodiments that are apparent to those of ordinary skill in the art, including embodiments which do not provide all of the features and advantages set forth herein, are also within the scope of this disclosure. Accordingly, the scope of the present disclosure is intended to be defined only by reference to the appended claims.
This application claims benefit of U.S. Provisional Patent Application Ser. No. 61/870,051 entitled “FASTER FILE COMPRESSION USING SLIDING COMPRESSION WINDOW AND BACKWARD COMPOUND POINTERS” filed Aug. 26, 2013, the disclosure of which is incorporated by reference herein in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
5446888 | Pyne | Aug 1995 | A |
5574906 | Morris | Nov 1996 | A |
6233589 | Balcha et al. | May 2001 | B1 |
6499054 | Hesselink et al. | Dec 2002 | B1 |
6732158 | Hesselink et al. | May 2004 | B1 |
6738779 | Shapira | May 2004 | B1 |
7120692 | Hesselink et al. | Oct 2006 | B2 |
7454443 | Ram et al. | Nov 2008 | B2 |
7467187 | Hesselink et al. | Dec 2008 | B2 |
7546353 | Hesselink et al. | Jun 2009 | B2 |
7587467 | Hesselink et al. | Sep 2009 | B2 |
7600036 | Hesselink et al. | Oct 2009 | B2 |
7788404 | Hesselink et al. | Aug 2010 | B2 |
7917628 | Hesselink et al. | Mar 2011 | B2 |
7934251 | Hesselink et al. | Apr 2011 | B2 |
7949564 | Hughes et al. | May 2011 | B1 |
8004791 | Szeremeta et al. | Aug 2011 | B2 |
8171063 | Janakiraman et al. | May 2012 | B1 |
8255661 | Karr et al. | Aug 2012 | B2 |
8275755 | Hirsch et al. | Sep 2012 | B2 |
8285965 | Karr et al. | Oct 2012 | B2 |
8341117 | Ram et al. | Dec 2012 | B2 |
8341275 | Hesselink et al. | Dec 2012 | B1 |
8352567 | Hesselink et al. | Jan 2013 | B2 |
8526798 | Hesselink | Sep 2013 | B2 |
8631284 | Stevens | Jan 2014 | B2 |
8646054 | Karr et al. | Feb 2014 | B1 |
8661507 | Hesselink et al. | Feb 2014 | B1 |
8688797 | Hesselink et al. | Apr 2014 | B2 |
8713265 | Rutledge | Apr 2014 | B1 |
8762682 | Stevens | Jun 2014 | B1 |
8780004 | Chin | Jul 2014 | B1 |
8793374 | Hesselink et al. | Jul 2014 | B2 |
8819443 | Lin | Aug 2014 | B2 |
20010037323 | Moulton et al. | Nov 2001 | A1 |
20030167275 | Rjaibi | Sep 2003 | A1 |
20050144195 | Hesselink et al. | Jun 2005 | A1 |
20050144200 | Hesselink et al. | Jun 2005 | A1 |
20080016131 | Sandorfi et al. | Jan 2008 | A1 |
20090228455 | Hirsch et al. | Sep 2009 | A1 |
20090228456 | Hirsch et al. | Sep 2009 | A1 |
20090228534 | Hirsch et al. | Sep 2009 | A1 |
20090234855 | Hirsch et al. | Sep 2009 | A1 |
20090271402 | Srinivasan | Oct 2009 | A1 |
20090300301 | Vaghani | Dec 2009 | A1 |
20110099154 | Maydew et al. | Apr 2011 | A1 |
20120036041 | Hesselink | Feb 2012 | A1 |
20130086353 | Colgrove et al. | Apr 2013 | A1 |
20130179647 | Park | Jul 2013 | A1 |
20130212401 | Lin | Aug 2013 | A1 |
20130266137 | Blankenbeckler et al. | Oct 2013 | A1 |
20130268749 | Blankenbeckler et al. | Oct 2013 | A1 |
20130268759 | Blankenbeckler et al. | Oct 2013 | A1 |
20130268771 | Blankenbeckler et al. | Oct 2013 | A1 |
20140095439 | Ram | Apr 2014 | A1 |
20140169921 | Carey | Jun 2014 | A1 |
20140173215 | Lin et al. | Jun 2014 | A1 |
Number | Date | Country | |
---|---|---|---|
61870051 | Aug 2013 | US |