The present invention relates to storage and transfer of data in a virtual tape library environment. More specifically, the invention is a method and system for compressing data in a virtual tape library to conserve disk space and for computing an estimated compression ratio to provide a one-to-one correspondence in size between virtual and physical tapes.
Virtual tape libraries emulate physical tape libraries to more efficiently handle backup data. For those occasions when users need to generate physical tapes for off-site storage or data interchange, for example, the desired data must be written from the virtual tapes (which at least initially contain the data) to physical tapes. Unlike in mainframe virtual tape systems that stack multiple virtual tapes onto a single physical tape, a one-to-one correspondence in size between virtual and physical tapes is desirable in the open systems world.
The majority of modern physical tape drives perform data compression before data is stored on the physical tape media. The data compression is dependent not only on the data itself, but also a physical tape's storage capacity and a tape drive's compression algorithm. These dependencies make it impossible to statically select an appropriate data capacity for the virtual tape a priori. As an illustration, if a virtual tape is fixed to 20 GB because the physical tape is 20 GB, a large portion of the physical tape may not be used once the data is transferred from the virtual tape to the physical tape. This is because the tape drive might compress the data down to 10 GB, for example. If, on the other hand, a 30 GB virtual tape is created for a 20 GB physical tape in an attempt to account for data compression and use the physical media more efficiently, it is possible that uncompressible (random) data is written to that virtual tape. In this case, when the virtual tape is exported onto the physical tape, only the first 20 GB will fit on the physical media. While the first option is preferable, clearly neither choice is satisfactory. Therefore, it would be desirable to dynamically ensure that the amount of data written to each virtual tape is large enough not to waste physical resources while being small enough to not exceed the capacity of the physical tape.
Additionally, while physical tape drives typically compress data before writing it to tape, existing virtual tape libraries do not include this feature. This is mostly due to the great amount of processing power that is required to compress high-bandwidth data streams in real-time. It would, therefore, also be desirable to store data compactly on random access media, either in real-time (as data is written to the virtual device) or at a later time when more processing power is available. In either case, this is preferable because it keeps the footprint of virtual tapes low and thus saves comparatively expensive random access storage space.
A need therefore exists for a method and system for compressing data written to a virtual tape library (in real-time or otherwise) for efficient storage thereof and for computing an estimated compression ratio in real time to dynamically provide a one-to-one correspondence in size between virtual tapes and physical tapes.
The invention is a method and system for compressing data in a virtual tape library (VTL) and for dynamically computing an estimated compression ratio. With respect to compression, data written to the VTL may be compressed for efficient storage within the virtual tape library. That is, the method and system of the present invention keeps the footprint of the virtual tapes low to conserve random access storage space by storing data compactly on random access media, either in real-time as data is being written to a virtual device or later when additional processing power is available.
With respect to the writing of data itself, an estimated compression ratio is dynamically computed so that when data is written to a VTL there is a one-to-one correspondence in size between virtual and physical tapes. The compression ratio is estimated in real time taking into account how the data will be compressed by a corresponding physical tape library. That is, the method and system of the present invention may dynamically (i.e. while data is being written) adjust the virtual tape size and return an “End-Of-Tape” (EOT) signal to the DPA once the VTL determines that the physical tape would be full if the data were to be exported. The compression and estimation aspects of the invention may be implemented together or independently, as desired.
In one embodiment, data stored in a VTL is decompressed prior to being exported. In another embodiment, the compression feature of a corresponding physical tape drive is disabled and data is exported in compressed format.
Referring initially to
Where the data is compressed asynchronously (i.e. where step 14 is negative), the data written to the VTL is saved on disk and may be compressed later (i.e. outside of the backup window), as desired. A combination of asynchronous and real-time compression may also be implemented. For example, incoming data can be compressed in real-time while enough processing power is available and stored uncompressed during periods where processing resources are being utilized for other functions. Even a 100% post-compression scheme typically works very well in this environment because typical backup configurations take advantage of a backup window at night. The remaining time can be spent compressing the data without interfering with regular business processes. This ensures that a suitable amount of free disk space is available to fit the backup data of the following day.
Regardless of whether data written to the VTL is compressed asynchronously (step 18) or in real-time (step 16) or a combination of the two, the method 10 preferably proceeds to step 20. In step 20, data is written to the VTL and compressed in accordance with the decision made at step 14.
Where the compression portion of the invention (i.e. steps 14, 16, and 18) is being implemented independently (i.e. without compression estimation), it should be noted that the method shown in
Continuing with
Computing the estimated compression ratio varies depending on whether the data stream being written to the VTL is compressed in real time or asynchronously. Where data written to the VTL is compressed asynchronously, the compression ratio may be estimated by compressing random samples of data written to the VTL and compressing them according to the compression algorithm of a particular tape drive(s) with which the VTL corresponds or is otherwise associated with. Where the VTL is associated with more than one tape drive, each tape drive's respective compression algorithm could potentially be used for dynamically computing an estimated compression ratio for data written to any of the tape drives. Alternatively, multiple algorithms could be run and the lowest estimate could be used.
More specifically,
Once the deviation in the compression ratios has been measured, it is determined in step 58 whether the deviation between the two measures is below a first predetermined value. If yes, the method 50 proceeds to step 60 wherein the frequency of the random samples that are compressed is reduced (unless it is already at a predefined minimum sample frequency). If no, it is determined in step 62 whether the deviation is above a second predetermined value. If yes, the method 50 proceeds to step 64 wherein the frequency of the random samples that are compressed is increased (up to a maximum sample frequency, which could potentially be equivalent to compressing all data). If no, the deviation is within an acceptable range and there is no need to change the frequency of the random samples. In this case, the method 50 proceeds from step 62 to step 66, which simply maintains the current frequency at which the random samples are compressed. Typically, it is not necessary to change the sample frequency because these sequences converge rapidly. However, if it does become necessary to change the sample frequency, it is important to also adjust the weighting of the samples accordingly in order to avoid skewing the results. The overall goal of these procedures is to make the error bound on the estimated average compression ratio as small as possible.
It is important to note that it is not recommended to keep the sampling frequency fixed at this interval. In practice, it is preferable to use the frequency only as a guideline and to randomize the samples around the guideline. For example, before each record is written: (1) Generate a random number in the interval [0,1]; and (2) If the random number is equal to or below the sample frequency (for example 1%= 1/100=0.01), use the record as a sample by compressing it—otherwise don't take the sample. This ensures that the samples are independent. In particular, it becomes possible for two consecutive data blocks to be sampled, even if the sample frequency is only 1/100.
From steps 60, 64, and 66, the method 50 proceeds to step 68. In step 68, the average compression ratio is estimated (along with a confidence interval, etc) for use in step 22 of
If, referring again to
It is possible, however, that the compression ratios achieved using the VTL's compression algorithms may differ from the compression ratios achieved using the tape drive's compression algorithm. This may be due to a difference in the algorithms themselves or simply as a result of variations arising from implementing similar algorithms in different environments (i.e. using different dictionary sizes, etc). Therefore, this approach should be implemented conservatively. The fact that a VTL compressed a certain data set at a 2.3:1 compression ratio does not mean that the physical tape drive, using a different compression algorithm, will achieve exactly the same compression ratio with the same data. Although the two ratios are likely to be close, it is important to use a more conservative estimate when using this approach, as it is obviously preferable to err on the side of not completely using tape space as opposed to exceeding it (i.e. it is better to under-estimate the compression ratio rather than over-estimate it). That is, it is generally more desirable to err on the side of not using all of a physical tape's capacity rather than introducing the possibility that the virtual tape does not fit on a corresponding physical tape. This is because the latter case needs to be dealt with manually or at least introduces an additional layer of indirection to the restore process.
For example, if all data is compressed in real-time and a 2.3:1 compression ratio is achieved, an estimate of 2.1:1 may be sufficient. Using this lower estimate ensures that the physical tape is used relatively efficiently, while considerably reducing the probability that the virtual tape will not fit on the physical tape upon export. Statistical methods can be used to make this probability bound arbitrarily small. Other methods for estimating the compressibility of data may also be used, as desired.
Once the estimated compression ratio is computed, an “End-Of-Tape” (EOT) signal is provided in step 24 to provide a one-to-one correspondence in size between physical and virtual tapes. The EOT signal is provided based on the estimated compression ratio independently of any compression being implemented by the VTL. It is important to note that the compression ratio is computed dynamically so that the EOT signal may be sent back to the DPA in real-time (i.e. while the DPA is writing data) before the end of a physical tape would be reached if the data were exported from the virtual tape library to the physical tape. For example, assuming a 20 GB physical tape, the EOT signal may be provided after 20 GB of uncompressible data is written to the VTL or after 100 GB of very compressible data is written. In the latter case, the VTL predicts a very high compression ratio for the given data and the algorithm of the tape drive and therefore estimates that five times more data will fit on the physical tape in a compressed format than in an uncompressed format. If this prediction were not performed, the EOT signal would be sent after only 20 GB of data transfer (i.e. once the native capacity of the virtual tape is reached). Consequently, 80% of the physical tape would be wasted. This makes the importance of compression estimation quite apparent and is why it is preferable to implement the compression-ratio-estimation embodiment together with compression as shown in
When data is exported from a VTL to a physical tape drive, the data compressed by the VTL may be decompressed and exported to the physical drive or directly exported to the physical drive in compressed format. Where the data is decompressed prior to being exported, the data is read (i.e. decompressed) by the VTL before being exported to the physical tape drive. The physical tape drive compresses the data prior to writing it to a physical tape. This approach is utilized where the VTL does not utilize exactly the same compression algorithm of the tape drive. That is, in such situations, decompression of the data by the VTL is necessary because the data has not been compressed in a format which can be decompressed by the physical tape drive. Where data is exported in compressed format, the compression algorithm of the tape drive is disabled and the compression algorithm implemented by the VTL is exactly the same as the compression algorithm implemented by the physical tape drive. That is, the VTL does not just utilize a physical tape drive's compression algorithm to achieve a similar compression result as previously described, but actually uses exactly the same implementation and format as if the algorithm was performed by the tape drive itself. This ensures that data compressed in the VTL may be read (i.e. decompressed) by the physical tape drive when the compressed data is exported directly thereto. In this case, compression performed by the VTL may be used as the estimated compression ratio. This approach provides the benefit of removing a compression/decompression cycle from the workflow, but, as mentioned, requires the VTL to implement exactly the compression algorithm of the physical tape drive.
Referring now to
A compression estimator/real-time compressor 110 is functionally disposed between the tape emulation 108 and VTL 102, or more specifically the VTL's 102 random access disks that hold virtual tapes. As previously explained, data written to the VTL 102 may be compressed in real-time or asynchronously to optimize the use of disk space within the VTL 102. A processor 112 is provided for dynamically computing an estimated compression ratio as explained herein.
The estimated compression ratio of the physical tape drive enables data stored on virtual tapes 102a . . . 102n to be written to their corresponding physical tapes 104a . . . 104n with a one-to-one correspondence in size. For example, if the tape drive has an estimated compression ratio of 2:1 for a given data set and the storage capacity of a given tape 104a is 20 Gigabytes, the actual storage capacity of tape 104a is 40 Gigabytes for this data. The compression ratio estimated by the processor 112 either based on compression performed by the VTL's 102 compression estimator/real-time compressor 110 or based on method 50 in
The compression estimator/real-time compressor 110 may perform real-time compression as long as there are sufficient resources. Data written to the VTL 102, however, may also be compressed asynchronously. To perform asynchronous compression, the VTL 102 includes a back-end asynchronous compression agent 114. If the incoming data stream is not being compressed in real-time, the agent 114 may compress the incoming data at any time, as desired. The compression of data within the VTL 102 maximizes storage capacity within the VTL 102. If a physical tape needs to be created, the data resident on the corresponding virtual tape is uncompressed and exported to the physical tape. The data may also be exported to a physical tape drive in compressed format where the VTL implements exactly the compression algorithm of the physical tape drive and the physical tape drive's compression feature is disabled.
Although the present invention has been described in detail, it is to be understood that the invention is not limited thereto, and that various changes can be made therein while remaining within scope of the invention, which is defined by the attached claims.
Number | Name | Date | Kind |
---|---|---|---|
4635145 | Horie et al. | Jan 1987 | A |
4727512 | Birkner et al. | Feb 1988 | A |
4775969 | Osterlund | Oct 1988 | A |
5235695 | Pence | Aug 1993 | A |
5297124 | Plotkin et al. | Mar 1994 | A |
5438674 | Keele et al. | Aug 1995 | A |
5455926 | Keele et al. | Oct 1995 | A |
5485321 | Leonhardt et al. | Jan 1996 | A |
5666538 | DeNicola | Sep 1997 | A |
5673382 | Cannon et al. | Sep 1997 | A |
5774292 | Georgiou et al. | Jun 1998 | A |
5774715 | Madany et al. | Jun 1998 | A |
5805864 | Carlson et al. | Sep 1998 | A |
5809511 | Peake | Sep 1998 | A |
5809543 | Byers et al. | Sep 1998 | A |
5854720 | Shrinkle et al. | Dec 1998 | A |
5864346 | Yokoi et al. | Jan 1999 | A |
5872669 | Morehouse et al. | Feb 1999 | A |
5875479 | Blount et al. | Feb 1999 | A |
5911779 | Stallmo et al. | Jun 1999 | A |
5949970 | Sipple et al. | Sep 1999 | A |
5961613 | DeNicola | Oct 1999 | A |
5963971 | Fosler et al. | Oct 1999 | A |
6021408 | Ledain et al. | Feb 2000 | A |
6023709 | Anglin et al. | Feb 2000 | A |
6029179 | Kishi | Feb 2000 | A |
6041329 | Kishi | Mar 2000 | A |
6044442 | Jesionowski | Mar 2000 | A |
6049848 | Yates et al. | Apr 2000 | A |
6061309 | Gallo et al. | May 2000 | A |
6067587 | Miller et al. | May 2000 | A |
6070224 | LeCrone et al. | May 2000 | A |
6098148 | Carlson | Aug 2000 | A |
6128698 | Georgis | Oct 2000 | A |
6131142 | Kamo et al. | Oct 2000 | A |
6131148 | West et al. | Oct 2000 | A |
6163856 | Dion et al. | Dec 2000 | A |
6173359 | Carlson et al. | Jan 2001 | B1 |
6195730 | West | Feb 2001 | B1 |
6225709 | Nakajima | May 2001 | B1 |
6247096 | Fisher et al. | Jun 2001 | B1 |
6260110 | LeCrone et al. | Jul 2001 | B1 |
6266784 | Hsiao et al. | Jul 2001 | B1 |
6269423 | Kishi | Jul 2001 | B1 |
6269431 | Dunham | Jul 2001 | B1 |
6282609 | Carlson | Aug 2001 | B1 |
6289425 | Blendermann et al. | Sep 2001 | B1 |
6292889 | Fitzgerald et al. | Sep 2001 | B1 |
6301677 | Squibb | Oct 2001 | B1 |
6304880 | Kishi | Oct 2001 | B1 |
6304882 | Strellis et al. | Oct 2001 | B1 |
6317814 | Blendermann et al. | Nov 2001 | B1 |
6324497 | Yates et al. | Nov 2001 | B1 |
6327418 | Barton | Dec 2001 | B1 |
6336163 | Brewer et al. | Jan 2002 | B1 |
6336173 | Day et al. | Jan 2002 | B1 |
6339778 | Kishi | Jan 2002 | B1 |
6341329 | LeCrone et al. | Jan 2002 | B1 |
6343342 | Carlson | Jan 2002 | B1 |
6353837 | Blumenau | Mar 2002 | B1 |
6360232 | Brewer et al. | Mar 2002 | B1 |
6389503 | Georgis et al. | May 2002 | B1 |
6408359 | Ito et al. | Jun 2002 | B1 |
6487561 | Ofek et al. | Nov 2002 | B1 |
6496791 | Yates et al. | Dec 2002 | B1 |
6499026 | Rivette et al. | Dec 2002 | B1 |
6557073 | Fujiwara | Apr 2003 | B1 |
6557089 | Reed et al. | Apr 2003 | B1 |
6578120 | Crockett et al. | Jun 2003 | B1 |
6615365 | Jenevein et al. | Sep 2003 | B1 |
6625704 | Winokur | Sep 2003 | B2 |
6654912 | Viswanathan et al. | Nov 2003 | B1 |
6658435 | McCall | Dec 2003 | B1 |
6694447 | Leach et al. | Feb 2004 | B1 |
6725331 | Kedem | Apr 2004 | B1 |
6766520 | Rieschl et al. | Jul 2004 | B1 |
6779057 | Masters et al. | Aug 2004 | B2 |
6779058 | Kishi et al. | Aug 2004 | B2 |
6779081 | Arakawa et al. | Aug 2004 | B2 |
6816941 | Carlson et al. | Nov 2004 | B1 |
6816942 | Okada et al. | Nov 2004 | B2 |
6834324 | Wood | Dec 2004 | B1 |
6850964 | Brough et al. | Feb 2005 | B1 |
6877016 | Hart et al. | Apr 2005 | B1 |
6915397 | Lubbers et al. | Jul 2005 | B2 |
6931557 | Togawa | Aug 2005 | B2 |
6950263 | Suzuki et al. | Sep 2005 | B2 |
6957291 | Moon et al. | Oct 2005 | B2 |
6973369 | Trimmer et al. | Dec 2005 | B2 |
6973534 | Dawson | Dec 2005 | B2 |
6978325 | Gibble | Dec 2005 | B2 |
7032126 | Zalewski et al. | Apr 2006 | B2 |
7032131 | Lubbers et al. | Apr 2006 | B2 |
7055009 | Factor et al. | May 2006 | B2 |
7096331 | Haase et al. | Aug 2006 | B1 |
7100089 | Phelps | Aug 2006 | B1 |
7107417 | Gibble et al. | Sep 2006 | B2 |
7111136 | Yamagami | Sep 2006 | B2 |
7127388 | Yates et al. | Oct 2006 | B2 |
7143307 | Witte et al. | Nov 2006 | B1 |
7155586 | Wagner et al. | Dec 2006 | B1 |
7200546 | Nourmohamadian et al. | Apr 2007 | B1 |
20020004835 | Yarbrough | Jan 2002 | A1 |
20020016827 | McCabe et al. | Feb 2002 | A1 |
20020026595 | Saitiu et al. | Feb 2002 | A1 |
20020095557 | Constable et al. | Jul 2002 | A1 |
20020133491 | Sim et al. | Sep 2002 | A1 |
20020144057 | Li et al. | Oct 2002 | A1 |
20020163760 | Lindsey et al. | Nov 2002 | A1 |
20020166079 | Ulrich et al. | Nov 2002 | A1 |
20020171546 | Evans et al. | Nov 2002 | A1 |
20020199129 | Bohrer et al. | Dec 2002 | A1 |
20030004980 | Kishi et al. | Jan 2003 | A1 |
20030037211 | Winokur | Feb 2003 | A1 |
20030097462 | Parent et al. | May 2003 | A1 |
20030120476 | Yates et al. | Jun 2003 | A1 |
20030120676 | Holavanahalli et al. | Jun 2003 | A1 |
20030126388 | Yamagami | Jul 2003 | A1 |
20030135672 | Yip et al. | Jul 2003 | A1 |
20030149700 | Bolt | Aug 2003 | A1 |
20030182350 | Dewey | Sep 2003 | A1 |
20030188208 | Fung | Oct 2003 | A1 |
20030225800 | Kavuri | Dec 2003 | A1 |
20040015731 | Chu et al. | Jan 2004 | A1 |
20040098244 | Dailey et al. | May 2004 | A1 |
20040181388 | Yip et al. | Sep 2004 | A1 |
20040181707 | Fujibayashi | Sep 2004 | A1 |
20050010529 | Zalewski et al. | Jan 2005 | A1 |
20050044166 | Liang et al. | Feb 2005 | A1 |
20050063374 | Rowan et al. | Mar 2005 | A1 |
20050065962 | Rowan et al. | Mar 2005 | A1 |
20050066118 | Perry et al. | Mar 2005 | A1 |
20050066222 | Rowan et al. | Mar 2005 | A1 |
20050066225 | Rowan et al. | Mar 2005 | A1 |
20050076264 | Rowan et al. | Mar 2005 | A1 |
20050076070 | Mikari | Apr 2005 | A1 |
20050076261 | Rowan et al. | Apr 2005 | A1 |
20050076262 | Rowan et al. | Apr 2005 | A1 |
20050144407 | Colgrove et al. | Jun 2005 | A1 |
20060047895 | Rowan et al. | Mar 2006 | A1 |
20060047902 | Passerini | Mar 2006 | A1 |
20060047903 | Passerini | Mar 2006 | A1 |
20060047905 | Matze et al. | Mar 2006 | A1 |
20060047925 | Passerini | Mar 2006 | A1 |
20060047989 | Delgado et al. | Mar 2006 | A1 |
20060047998 | Darcy | Mar 2006 | A1 |
20060047999 | Passerini et al. | Mar 2006 | A1 |
20060143376 | Matze et al. | Jun 2006 | A1 |
Number | Date | Country |
---|---|---|
1333379 | Apr 2006 | EP |
1671231 | Jun 2006 | EP |
WO199903098 | Jan 1999 | WO |
WO199906912 | Feb 1999 | WO |
WO2005031576 | Apr 2005 | WO |
WO2006023990 | Mar 2006 | WO |
WO2006023991 | Mar 2006 | WO |
WO2006023992 | Mar 2006 | WO |
WO2006023993 | Mar 2006 | WO |
WO2006023994 | Mar 2006 | WO |
WO2006023995 | Mar 2006 | WO |
Number | Date | Country | |
---|---|---|---|
20040230724 A1 | Nov 2004 | US |