1. Field of the Invention
The invention relates to data storage cartridges and tape libraries.
2. Background Art
The amount of digital data being created annually is increasing. It has been estimated that 5 EB of digital data were created in 2002, 161 EB of digital data were created in 2006, and 281 EB of digital data were created in 2007. It is projected that at least 1,773 EB of digital data will be created in 2011. Of this vast quantity of data, it is predicted that some 35% (600+ EB) will need to be safely preserved (archived) for ten years or more. This will inevitably result in very substantial costs for both the storage equipment required and the power needed to store the data for extended periods. Simply anticipating that it will be practical, and perhaps even feasible, to store these vast quantities of digital data on rigid disk (HDD) for extended periods is highly problematic.
A simple analysis, based on published data, reveals firstly that even with spinning down archival HDDs to idle mode it will still cost at least a billion dollars per year to store 600 EB of data. Secondly, it will be challenging for the HDD industry to produce sufficient high capacity, enterprise class drives on which to store this data. The cost of these HDDs alone could approach 50 billion dollars. Finally, the irrecoverable read error rate of rigid disk drives is today specified as one error per 1015 bits read. Hence, without implementing additional data protection schemes such as dual parity RAID or more advanced error correction codes (ECC), with the inevitable increase in data storage overhead, these error rates will potentially result in data corruption during either a RAID re-build, or the necessary migration of data from one HDD sub-assembly to an upgraded system, or even during normal access over the extended lifetimes of the archived data.
In contrast, storing vast quantities of archival data on tape storage systems will continue to be the most cost effective, in terms of both cost per TB and power use, and practical long term solution for the foreseeable future. Tape storage areal densities have been growing at greater than 40% compound annual growth rate in recent years and it is today feasible to store many TB of data on a single data cartridge containing some 1,000 m of tape.
However, storing these or greater quantities of data on a single cartridge presents several issues to the archival system. It takes time to access the data as each tape load is very time consuming and affects the reliability of the cartridge and tape drive. The speed that data can be written to and read from a single tape drive is limited by the data rate of that drive, and during this process data stored elsewhere on the cartridge is not available to the host system. Structuring the data, for example, through the use of associated metadata is impractical, and requires the use of an external independent file system. Additionally, updating metadata on a sequential access device can be problematic and may require rewriting user data that has not been modified.
In addition to the above problems there is also a performance issue that needs to be addressed in high performance computing (HPC) environments. Storing large amounts of digital data on a single data cartridge presents several major technical issues. It can take time to access the data and to write the data to a single drive which is highly problematic for large data sets such as those routinely used in the high performance computing (HPC) environment. During this process, data stored elsewhere on the cartridge is not available. In many HPC applications, vast quantities of data must be cached before application computing can start. In these environments, it often takes days, or even weeks, to download the computational data set. The bottleneck in this environment is the speed that a single tape drive can transfer data. Providing the ability to stripe a data set across several cartridges, which could be accessed in parallel, would increase the performance as a multiple of how many tape cartridges were assigned to the data set. This high performance configuration would be ideal for many HPC applications that now take days to stage data.
Finally, the need to manage archive data cost effectively requires the ability to have policy driven tiered storage management in which the metadata is stored with the files being archived.
For the foregoing reasons, there is a need for an improved data storage cartridge and tape library.
It is an object of the invention to provide an improved data storage cartridge and tape library.
In one embodiment of the invention, a data storage system is provided. The data storage system comprises a tape cartridge library. The tape cartridge library includes a plurality of storage cells. Each storage cell is configured to store a tape cartridge. The tape cartridge library further includes a plurality of tape drives. Each tape drive is configured to access a tape cartridge when the tape cartridge is received in the tape drive. The data storage system further comprises a plurality of tape cartridges in the tape cartridge library. Each tape cartridge includes a length of tape media and an amount of flash memory.
A robotic tape mover is provided for moving tape cartridges between the plurality of storage cells and the plurality of tape drives. The robotic tape mover may also be used for loading cartridges into the library and positioning them in the correct slots. A flash memory access mechanism such as a serial or parallel electrical connection, wireless connection, or other physical interface is configured in the tape cartridge library to access the flash memory of received cartridges at the plurality of tape drives and to access the flash memory of stored cartridges at the plurality of storage cells. The flash memory access mechanism may be located on an arm of the robotic tape mover.
It is appreciated that the flash memory access mechanism may be configured in a variety of ways. The flash memory access mechanism may be configured to access the flash memory of received cartridges at the plurality of tape drives when a received cartridge is loaded into a tape drive. The flash memory access mechanism may be configured to access the flash memory of stored cartridges at the plurality of storage cells when a stored cartridge is at rest in a storage cell. The flash memory access mechanism may include a wireless access device, or may include a wired access device.
In another embodiment of the invention, a data storage system for use with a plurality of tape cartridges, each tape cartridge including a length of tape media and an amount of flash memory, is provided. The data storage system comprises a tape cartridge library including a plurality of storage cells. Each storage cell is configured to store a tape cartridge. The tape cartridge library further includes a plurality of tape drives. Each tape drive is configured to access a tape cartridge when the tape cartridge is received in the tape drive.
A robotic tape mover is provided for moving tape cartridges between the plurality of storage cells and the plurality of tape drives. A flash memory access mechanism is configured in the tape cartridge library to access the flash memory of a tape cartridge when the tape cartridge is in the tape cartridge library. The flash memory access mechanism may be configured in a variety of ways.
Still further, the invention comprehends a tape cartridge for use in a data storage system. The tape cartridge comprises a housing, a length of tape media contained in the housing for storing data, and an amount of flash memory attached to the housing. An amount of flash memory greater than 1 GB is suitable in some embodiments of the invention.
In one embodiment of the invention, flash memory is embedded in a tape data cartridge to enable significant amounts of metadata to be written and accessed both when the cartridge is at rest in a data storage system (for example, in a tape library) and when the cartridge is loaded into the tape drive. Appropriate connectivity to access the flash memory in both the tape drive and in the storage cell of the library is provided with the tape library. In an alternative, data may be read from or written to the flash memory while a cartridge is being inserted or removed from the library, or inserted or removed from a library slot. The flash memory access mechanism may be a serial or parallel electrical connection, wireless connection, or other physical interface. The flash memory access mechanism may be located on the arm of the robotic tape mover. The robotic tape mover moves tape cartridges between the tape drives and the storage cells, and may load cartridges into the library and position them in the current slots.
It is appreciated that the overall system architecture may vary depending on the implementation. For example, access by the host application to the flash memory may be provided in any suitable way. As well, the particular connection to the flash memory may take any appropriate form such as, for example, known wireless communication approaches (WIFI) or known wired approaches (USB, SCSI).
With continuing reference to
The inclusion of the flash memory 24 in the data cartridge 20 has many advantages. For example, current performance limitations in HPC environments are addressed by allowing the association of formatting information across multiple data cartridges. This information can then be used to intelligently stripe data across a set of data cartridges, thereby significantly increasing the data rate to and from the library. In this environment, the application will know where all the data is located, both physically and logically, and has access to several GB of metadata and format information for each cartridge. Thus, a set of data cartridges can be simultaneously accessed by a corresponding set of tape drives, each running at up to several hundred MB/s. Hence, the aggregate data rate for the system would easily match the data rate of any foreseeable HPC back-bone.
Business continuity and availability for an archive system is critical to help ensure that any failures in the archive system do not result in loss of data. By intelligently striping the content of a given data set, and providing distributed parity across several independent data cartridges, significant protection against such potential data loss or corruption may be provided. In addition, data cartridges can be very simply and easily removed from the library for transport to a remote facility where, once loaded into the remote system, the entire content of the cartridge metadata can be very quickly accessed. Hence, system level mirroring and replication for long term storage can be very easily accomplished as a background task. This allows search and index engines to use this highly portable metadata in a model that is independent of database, operating or file system limitations associated with storing metadata information on a server.
The ability to persistently store the metadata associated with the content of a cartridge also greatly facilitates data deduplication. Data deduplication is a method of reducing storage requirements by eliminating redundant data and only storing one unique instance of a data unit (bit, byte or file) on a storage medium such as a tape cartridge. Deduplication technology identifies variable-length blocks of data across various files and file types and then stores unique blocks once, replacing redundant blocks with data pointers. When an incoming data block is a duplicate of something that has already been stored, the block is not stored again. Each portion of ingested data is processed using a hash algorithm which generates a unique number for that piece of data which is then stored in an index. If a file is updated, only the changed data is saved, thus avoiding the necessity for storing an entirely new file. Although highly efficient in terms of storage capacity, data deduplication can result in very large indexes creating scalability issues as the data deduplication system grows. In embodiments of the invention, the persistent flash memory embedded in the data cartridge may be utilized to store the relevant indexes for the updated data fragments written in the content of the cartridge. Thus, the host system will be able to simultaneously write deduplicated data to many drives in parallel and keep track of the indices for each cartridge in the entire library while doing this. Data indexing and metadata are also important not only in establishing a mechanism for locating information at a later date, but for exposing the appropriate content and context for application of the relevant established business data access policies.
Policy binding, through the use of metadata stored in the embedded flash memory in each data cartridge, may securely limit the access to the content of each file contained on that data cartridge. Additionally, it will be possible to provide encryption of the content stored on the data cartridge independently from the metadata associated with this content which will be stored in the persistent flash memory in the same data cartridge. Hence, the archival storage system will be able to discern the nature of the content contained on a given data cartridge, but without access to the necessary encryption keys will be unable to read the content of the data. To aid in addressing compliance requirements, an archive system must also prevent unauthorized access, modification, or deletion of documents.
By appropriately configuring the flash memory controller contained in the data cartridge, it will be possible to prevent over-writing, or deletion of the metadata stored on a given data cartridge. In addition, the proposed system will facilitate data protection through the use of write once, read many times (WORM) data cartridges based on both magnetic tape storage and optical tape storage technologies. The use of embedded persistent flash memory may also enable a detailed record of content access to be maintained. This may provide definitive information to the system for audit-logging and documentation purposes. With the significant increase in tape based storage areal data densities recently demonstrated, it will be feasible to shorten the length of the tape in the data cartridge while still providing at least one TB cartridge capacity.
The need to manage archive data cost effectively also requires the ability to have policy-driven tiered storage management in which the metadata is stored with the files being archived. Embodiments of the invention provide the ability to update metadata without tape access, and have the metadata physically stored with the tape cartridge.
Advantageously, using such an approach, a sizeable (many TB) flash cache is now available to the file system which can use it to intelligently and efficiently drain the file content to the tape archive medium according to established archive policies.
In yet another advantage, embodiments of the invention may provide standardization of an open format for both the physical and logical interfaces of the cartridge, together with backward read capability over several generations of data cartridges which may enable, and protect, the archival nature of the stored data. This will also facilitate any transition to new storage devices and technologies as they become available.
In some embodiments of the invention, the library may become a very large, fast access, intelligent storage repository, which can be flexibly expanded and provisioned as necessary (by simply adding more cartridge slots). For example, embodiments of the invention may be employed in a data storage system that utilizes an object based, parallel file system.
While embodiments of the invention have been illustrated and described, it is not intended that these embodiments illustrate and describe all possible forms of the invention. Rather, the words used in the specification are words of description rather than limitation, and it is understood that various changes may be made without departing from the spirit and scope of the invention.