This invention relates to the field of data processing and data storage systems, and more particularly to data compression within such systems.
Data processing systems employ and generate extensive amounts of data, much of which is stored by data storage systems. A way of reducing the physical amount of storage is to compress at least some of the data. Many data compression engines employing various data compression algorithms exist to accomplish the data compression, some providing greater degrees of compression of some types of data, others greater degrees of compression of other types of data, and some with all types of data. Some data compression engines can recompress data that has already been compressed. A prevalent means of measuring the degree of data compression is the compression ratio, meaning the ratio of the amount of original data to the amount of compressed data, such as 10:1. Another aspect of data compression is the latency of the compression of data and of the uncompression or decompression of compressed data. Typically, the greater the compression ratio, the more complex the data compression, and the greater the latency. Of course, the shortest latency is to data that is uncompressed.
As the result, the user manually selects the data compression engines based on expected needs, and the amount of data storage expected to be available, which is often cost driven. Also, the data compression engines may vary widely in cost, effecting the decision. The user may then establish a policy regarding what data and how much data is to be compressed versus what data and how much data is left uncompressed, as discussed in the incorporated '388 application. The policies may remain fixed and may or may not result in data storage space efficiency or in rapid access over time.
Methods, data storage subsystems, and computer program products are provided for data storage and data compression.
In one embodiment, a data storage subsystem comprises data storage; a plurality of data compression engines configured to compress data, each having a different compression algorithm; at least one input configured to receive data to be compressed and stored by the data storage; and at least one data handling system configured to perform steps comprising:
selecting at least one sample of data to be compressed and stored;
operating a plurality of the data compression engines to compress the selected sample(s);
determining the compression ratios of the operated data compression engines with respect to the selected sample(s);
operating the data compression engine determined to have the greatest compression ratio with respect to the selected sample(s), to compress the data; and
storing the compressed data.
In a further embodiment, the step of selecting the sample(s) comprises selecting a plurality of samples randomly from the data.
In another embodiment, the step of selecting the sample(s) comprises selecting a plurality of samples in accordance with a hashing algorithm.
In still another embodiment, the data storage subsystem comprises data storage; a plurality of data compression engines configured to compress data, each having a different compression algorithm, the compression engines arranged in a plurality of tiers from low to high in accordance with expected latency to compress data and to uncompress compressed data; at least one input configured to receive data to be compressed and stored by the data storage; and at least one data handling system configured to perform steps comprising:
determining a present rate of access to data;
if the rate of access indicates the data is to be compressed, selecting a tier of data compression engine with respect to the data that is inverse to the present rate of access;
selecting at least one sample of the data;
operating a plurality of the data compression engines at the selected tier to compress the selected sample(s);
determining the greatest degree of compression of the operated data compression engines with respect to the selected sample(s); and
operating the data compression engine determined to have the greatest degree of compression with respect to the selected sample(s), to compress the data.
In a further embodiment, the present rate of access to the data comprises the inactivity time from the most recent access to at least a portion of the data, the less inactivity time, the greater the rate of access.
In another embodiment, the present rate of access to the data comprises the number of accesses to at least a portion of the data within a time window, the greater the number of accesses, the greater the rate of access.
In a further embodiment, the data handling system is additionally configured to perform the step of storing the compressed data in a repository of the data storage, the repository associated with the data compression engine employed to compress the data.
In a still further embodiment, at least two of the repositories are classified into parent and child categories, each at a different tier, the parent repository associated with a data compression engine having a lesser degree of compression than the child, and the data handling system is additionally configured to perform the step of moving data between the parent and the child category repositories in accordance with the inverse of the present rate of access.
In another embodiment, the data handling system is configured to perform the step of compressing the data for the parent category repository, and recompressing the data for the child category repository.
In still another embodiment, the data handling system is configured to perform the step of compressing the data for the parent category repository, and uncompressing and again compressing the data for the child category repository.
In another embodiment, the data handling system is configured to perform the steps of determining whether none of the compression engines exceeds a minimum degree of compression, and, if so, disallowing compression of the data.
For a fuller understanding of the present invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings.
This invention is described in preferred embodiments in the following description with reference to the Figures, in which like numbers represent the same or similar elements. While this invention is described in terms of the best mode for achieving this invention's objectives, it will be appreciated by those skilled in the art that variations may be accomplished in view of these teachings without deviating from the spirit or scope of the invention.
Referring to
Referring to
Host interfaces 150, 154 may comprise one or more host adapters, Fibre Channel ports, one or more PICON ports, one or more ESCON ports, one or more SCSI ports, or other suitable ports. Each host interface is configured to communicate with a host system 105 or network 107 of
Data storage interfaces 160, 164 may comprise communication links for communication with data storage 120. The communication links may comprise device adapters, for example for serial interconnections, such as RS-232 or RS-422, Ethernet connections, SCSI interconnections, ESCON interconnections, PICON interconnections, a Local Area Network (LAN), a private Wide Area Network (WAN), a public wide area network, Storage Area Network (SAN), Transmission Control Protocol/Internet Protocol (TCP/IP), the Internet, and combinations thereof.
An example of a data handling subsystem 140 comprises an IBM® Enterprise Storage Server, IBM® Model DS 8000, or other comparable system.
The data may be arranged in various units of data of constant or variable sizes. In one example, data is arranged in units of data called “files”. In other examples, data is arranged in units of data called “data sets”, or “extents” or “sectors”. Another example is related to addressing and is called “LUN” s for Logical Units. Other examples are known to those of skill in the art. Hence, herein, the generic terms used are “data” or “extent”.
Alternatively, for data that has been sampled and the samples compressed according to various algorithms, the compression engines are arranged in tiers given what the sampling returned. So, for a given data, the latency to compress the data, based on compressing the samples, may differ from the latency to compress other data to a degree that the tiering is different from that of the other data.
Referring to
As an example, the data extent is to be compressed by one of the Tier 1 compression engines 180, 182, 185. The data handling system operates a plurality of the data compression engines 180, 182, 185 to compress the selected sample(s). In one example, the data handling system determines the compression ratios of the data compression engines 180, 182, 185 with respect to the selected sample(s), and determines the data compression engine having the greatest compression ratio. The data handling system then operates the data compression engine determined to have the greatest compression ratio with respect to the selected sample(s), to compress the data extent. The compressed data may be stored in the repository associated with the data compression engine.
The same process may be employed to further compress data or to compress data with data compression engines 190, 192, 193 that compress data according to algorithms that have the greatest degree of compression. In one example, it may be desirable to further compress data that has been compressed and is stored at one of the Tier 1 repositories 180, 182, 185. The further compression may occur in one of two ways, taking the existing compressed data and further compressing it, or uncompressing the existing compressed data and compressing it anew. Alternatively, certain compression engines may employ algorithms that are compatible, such that they have a parent and child relationship in that the parent compression engine 182 has low access latency, and one or more children compression engines 192, 193 have the greatest degree of compression.
Algorithms may be complementary and arranged in tiers as parent and child, in one example, because using them together up front would cause the compression and uncompression latency to be unacceptable. Thus, data is initially compressed by a parent compression engine and, when the rate of access indicates further compression is desirable, it is compressed further using the complementary child compression engine.
In one example, it may be desirable to further compress a compressed data extent of Tier 1 stored in repository 182 with a child compression engine 192 or 193. In one example, a child compression engine is direct in that the already compressed data may be compressed by the child compression engine without being uncompressed first. In another example, the child compression engine could compress the data only after it has been uncompressed. The data handling system 140 selects at least one sample of the data extent that is to be further compressed from repository 182. As an example, the data handling system operates the data compression engines 192, 193 to compress the selected sample(s). In one example, the data handling system determines the compression ratios of the data compression engines 192, 193 with respect to the selected sample(s), and determines the data compression engine having the greatest compression ratio. The data handling system then operates the data compression engine determined to have the greatest compression ratio with respect to the selected sample(s), to further compress the data extent. The compressed data may be stored in the repository associated with the data compression engine.
Compressing the selected samples may result in a determination that none of the compression engines exceeds a minimum degree of compression. For example, all of the compression engines may result in expanding the selected samples rather than actually compressing them, or the compression ratios may be unacceptably small. Thus, there is no point in compressing or further compressing the data, and compression of the data is disallowed, leaving the data uncompressed, or compressed to the level of the previous tier.
In another example, the data of the extent may have been compressed by the host system, and the host requires or advises that the data not be further compressed. Still alternatively, the host may require that certain data be left uncompressed.
Referring to
Either periodically or at each access to data, in step 303, the extent object 200 for each accessed data is updated. In one example, the last access (e.g. time of access) is updated, and/or the running total of accesses (e.g. within a window) is updated.
In step 305, the rate of access for the data that has been accessed is determined from the information of the updated extent objects. The present rate of access may be determined from the inactivity time from the most recent access to at least a portion of the data, the less inactivity time, the greater the rate of access. The inactivity time is based on the difference between the last access time from the extent object 200 and the present time. Alternatively, the present rate of access to the data comprises the number of accesses tracked in the extent object 200 to at least a portion of the data within a time window, the greater the number of accesses, the greater the rate of access.
As discussed above, the data storage 120 comprises a plurality of data compression engines, each having a different compression algorithm, and arranged in a plurality of tiers from low to high in accordance with expected latency to compress data and to uncompress compressed data.
In step 307, a determination is made whether the rate of access for a particular extent of data meets or exceeds a first threshold “a”, which is a threshold indicating that the rate of access is high enough to warrant moving the data to a tier having lower compression latency. If the threshold is met or exceeded, in step 308, the uncompress instructions are obtained from the extent object 200 and the uncompress algorithm is obtained from the repository object 210, and the data is uncompressed and sent to a lower latency tier and compressed again, if needed, to a repository at the lower latency tier. In one example, a data extent of repository 185 is found to meet or exceed the rate of access threshold “a”, and is uncompressed, for example by the associated compression engine, and provided to the extent pool 170. Alternatively, since the data may have been uncompressed at the last access, that uncompressed data may already be stored in the extent pool 170, and the metadata will be updated to reflect the change. The space of the data in the repository 185 is made available for storing additional data, for example, as free space. In step 303, the move to the new repository is noted in the extent object 200, and the list of extents is changed accordingly in the repository objects for repository 170 and repository 185.
In step 309, the compression status of the extent object 200 for the extent of data is checked to determine whether the data is compressible, or further compressible. For example, none of the compression engines may exceed a minimum degree of compression with respect to the extent of data, as discussed above. The compression status may thus indicate that the extent of data is to be left uncompressed, or not further compressed.
Alternatively, the host may require that the data of the extent not be compressed, either because it has been compressed sufficiently by the host, or because the host requires that it be left uncompressed.
In one example, the compression status may be designated as “is Compressible”, and set to “False” if it is not compressible, or not further compressible.
If the compression status indicates the data is not compressible, or not further compressible, step 309 returns to step 303 disallowing compression or further compression of the data.
If the rate of access does not meet or exceed threshold “a”, in step 310, a determination is made whether the rate of access for a particular extent of data meets or is below a second threshold “b”, which is a threshold indicating that the rate of access is low enough to warrant moving the data to a higher latency, greater degree of compression, tier. Thus, if the rate of access indicates the data is to be compressed, a tier of data compression engine is selected with respect to the data that is inverse to the present rate of access. For example, the rate of access may be derived from the low number of or no accesses of the data within a prescribed time window, or the rate of access may comprise a time since the last access occurred.
If the rate of access is within the boundaries of the thresholds of steps 307 and 310, the data remains in the present repository and the process returns to step 303 for updating other extent objects as needed.
If the rate of access for a data extent meets or is below the threshold “b” of step 310, step 313 determines whether a higher compression repository is available. For example, the data extent may presently reside in Tier 1 repository 185 of data storage 120 be earmarked by a host as limiting its compressibility as indicated by the compression status of the extent object 200. Alternatively, the repository object 210 may indicate that there is no direct child compression engine and repository and no child compression engine and repository that is available even after uncompression. In either case, the data remains in the present repository and the process returns to step 303 for updating other extent objects as needed.
If step 313 indicates that at least one higher compression engine and repository is available, the process discussed above to select the compression engine and repository is employed, beginning at step 320. In step 320, the data handling system 140 selects at least one sample of the data that is to be compressed, for example samples of a data extent that was compressed by one of the Tier 1 compression engines 182 and is to be further compressed. As discussed above, the samples may be selected randomly from the data to be compressed, or may be selected from the data to be compressed in accordance with a hashing algorithm.
In the example, the repository object information 210 for repository 182 may indicate that it is a parent repository and that two direct child repositories are repositories 192 and 193. The data handling system operates the data compression engines 192, 193, in step 325 to compress the selected sample(s). In one example, in step 327, the data handling system determines the compression ratios of the data compression engines 192 and 193 with respect to the selected sample(s), and determines the data compression engine having the greatest compression ratio. Step 330 determines if any of the data compression engines has provided a usable compression of the data sample(s). If not, for example, the compression ratio is less than a predetermined amount or is negative, negative compression resulting in more data after the compression, step 333 resets the compression status for the extent in the extent object 200 to indicate that the data of the extent is not compressible, or not further compressible, and returns to step 303.
If at least one of the data compression engines, for example, engines 192 and 193, does comprise an acceptable compression for the selected samples, step 335 sets the compression status in the extent object 200 to indicate that the extent can or should be able to be compressed further.
Step 340 comprises selecting the desired data compression engine, for example choosing between engines 192 and 193. In one example, the choice is the data compression engine having the greatest compression ratio with respect to the selected sample(s).
Step 341 determines whether the data of the extent is uncompressed, or, if compressed, may be compressed by the selected compression engine without being uncompressed, or whether the data of the extent must be uncompressed before being compressed again by the selected compression engine. For example, complementary parent and child compression engines may allow recompression of compressed data, whereas non-complementary compression engines may require uncompression, for example, as identified in the repository object 210.
If uncompression is required, the data of the extent is uncompressed in step 342.
The data handling system 140, in step 343, then operates the data compression engine determined to have the greatest compression ratio with respect to the selected sample(s), to compress the data extent in accordance with the compression algorithm of repository object 210. The compressed data may be stored in the repository associated with the data compression engine.
In the example of repositories and data compression engines 182, 192 and 193, the repositories are classified into parent and child categories, each at a different tier, the parent repository 182 associated with a data compression engine having a lesser degree of compression than the children 192 and 193, and the data handling system 140 moves data between the parent and the child category repositories in accordance with the inverse of the present rate of access. Thus, the data for the parent category repository 182 is compressed data, and is further compressed or recompressed for the selected child category 192, 193. Repositories 192 and 193 may comprise direct children of repository 182, in which the already compressed data may be recompressed.
In another example, repository 190 may comprise a child to parent repository 180 in which the compression engine could compress the data only after it has been uncompressed. Thus, step 343 comprises uncompressing the data compressed in repository 180 in accordance with the uncompress instructions of the extent object 200 for the extent and the algorithm of the repository object 210 for the data compression engine 180. Then, step 343 comprises compressing the uncompressed data in accordance with the compression algorithm of the repository object for the data compression engine 190.
Upon completion of the compression or recompression of the data in step 343, the process returns to step 303 to update both the extent object 200 with the current repository and the repository history, and the repository objects for the repositories respectively losing and gaining the data extent.
Referring to
Those of skill in the art will understand that changes may be made with respect to the methods discussed above, including changes to the ordering of the steps. Further, those of skill in the art will understand that differing specific component arrangements may be employed than those illustrated herein.
While the preferred embodiments of the present invention have been illustrated in detail, it should be apparent that modifications and adaptations to those embodiments may occur to one skilled in the art without departing from the scope of the present invention as set forth in the following claims.
This application is a Continuation of U.S. patent application Ser. No. 14/220,790, filed on Mar. 20, 2014, which is a Continuation of U.S. patent application Ser. No. 12/574,470, filed on Oct. 6, 2009 that is now U.S. Pat. No. 8,688,654, and commonly assigned U.S. patent application Ser. No. 12/128,388, filed May 28, 2008, now U.S. Patent Application Publication 2008/0307191, is incorporated for its showing of a data storage system having data compression.
Number | Date | Country | |
---|---|---|---|
Parent | 14220790 | Mar 2014 | US |
Child | 15398859 | US | |
Parent | 12574470 | Oct 2009 | US |
Child | 14220790 | US |