The present invention relates generally to the field of storage systems. More particularly, the present invention relates to increasing the fault tolerance of RAID storage systems.
Storage systems are used to store data. The amount of data that is being stored by storage systems is increasing rapidly. To cope up with the increasing amount of data being stored, storage systems combine a large number of independent disk drives. These disk drives are organized as a Redundant Array of Independent Disks (RAID).
RAID storage systems can store a large amount of data. To store large amounts of data, they use a number of disk drives. Each disk drive has a fixed service life. The failure of a drive can be defined as its inability to store and retrieve data reliably. The failure of any one drive in a RAID system will result in the failure of the RAID storage system as a whole. Since RAID systems use data redundancy, data loss due to the failure of the storage system is avoided. The probability of the failure of such a RAID system can be quite high. This is because the probability of the failure of such a RAID system is the sum of probabilities of the failures of all individual disk drives in the system.
Since the probability of the failure of a RAID storage system is a function of the number of disk drives in the system, many RAID storage systems are organized into a number of smaller RAID sets. Each RAID set comprises a small number of disk drives. If one disk drive in a RAID set fails, it does not cause the loss of availability of data in the RAID storage system.
RAID storage systems support fault tolerance to disk drive failures, and therefore prevent loss of data in the case of disk drive failure. Fault tolerance is provided by either mirroring data onto a mirrored disk drive, or using one or more parity disk drives to store parity information for data stored on the other disk drives in the RAID set. In the event of the failure of a disk drive, the mirrored disk drive is used to restore lost data, or the parity disk drive is used to regenerate lost data by Exclusive ORing the data on the remaining drives in the RAID set. In the event of the failure of a disk drive in a RAID set, the RAID set goes critical. However, a critical RAID set will not cause loss of data, but there will be loss of data if another disk drive in the critical RAID set fails.
One approach to increase fault tolerance in RAID storage systems is to provide an additional parity drive in each RAID set. If one drive in a RAID set fails, the RAID set does not become critical, and the additional parity drive can be used to reconstruct data. Another approach of increasing fault tolerance is to mirror the entire RAID set. However, these approaches suffer from an increased drive overhead due to multiple writes of the same data. Another disadvantage is the decreased usable or effective storage capacity, defined as the ratio of the number of drives used for user data to the total number of drives in the RAID system.
In order to increase fault tolerance to multiple drive failures and increase data availability, RAID storage systems migrate data from a failing disk drive to a spare disk drive before the disk drive completely fails. One such system is described in U.S. Pat. No. 6,598,174, titled “Method and Apparatus for Storage Unit Replacement in Non-redundant Array”, assigned to Dell Products L.P. This patent describes a storage system in which data from disk drives that are about to fail is migrated onto a spare disk drive. This system uses an intermediate disk drive to migrate the data onto a spare disk drive. Additionally, this system is applicable to a non-redundant array, such as a RAID 0 configuration. The system uses Self Monitoring Analysis and Reporting Technology (SMART) that is provided with disk drives to predict drive failure. A description of SMART can be found in the paper titled “Improved Disk Drive Failure Warnings” by Hughes, et al, published in IEEE transactions on reliability, September, 2002, pages 350-357.
Another system that employs data migration before drive failure, to increase fault tolerance, is described in U.S. Pat. No. 5,727,144 titled “Failure Prediction for Disk Arrays”, assigned to International Business Machines Corporation. This patent describes a system that copies data from a failing disk drive to a spare disk drive. In case the disk drive fails before the entire data is copied, the system uses RAID regeneration techniques to reconstruct lost data.
However, the systems described above do not entirely solve the problem of maintaining fault tolerance in the case of multiple drive failures in a RAID set. The spare drives, which are used to replace a failed disk drive, are kept in power-on condition until required. This reduces the expected service life of the spare disk drive, making it susceptible to failure and increases its vulnerability to data loss. These systems only use the SMART feature of disk drives to predict drive failure but not to extend the service life of the drives. From the foregoing discussion, it is clear that there is a need for a system that increases the fault tolerance, and resulting data availability in RAID storage systems. The system should be able to predict the failure of a disk drive, using multiple sources so that it can reduce the possibility of the RAID sets becoming critical. The system should provide a high ratio of usable to total RAID storage capacity. This system should also be able to efficiently manage power to the spare disk drives that are used to replace failed disk drives in a RAID storage system.
An object of the present invention is to increase fault tolerance and the resulting data availability of storage systems, by proactively replacing disk drives before their failure.
Another object of the present invention is to increase the ratio of the usable storage capacity to the total storage capacity of a storage system by powering on a spare disk drive only after a disk drive has been identified as failing.
Yet another object of the present invention is to proactively monitor drive attributes, such as those reported by SMART, and environmental sensor data.
The present invention is directed towards an apparatus and method for increasing the fault tolerance of RAID storage systems. The present invention is embodied within a storage controller of a RAID storage system. The apparatus comprises a first set of disk drives, that are constantly monitored to identify failing disk drives; a second set of disk drives, that are used to replace failing disk drives; a processing unit that identifies failing disk drives, and replaces these failing disk drives with disk drives selected from the second set of disk drives; and a memory unit that stores drive attributes obtained from the disk drives and sensor data. The processing unit further comprises a drive replacement logic unit and a drive control unit. The drive replacement logic unit identifies a failing disk drive from the first set of disk drives, based on drive attributes stored in the memory, and initiates drive replacement. The drive control unit powers on a second disk drive selected from the second set of disk drives, to replace the failing disk drive.
The second disk drive that is selected to replace a failing disk drive is not powered on until drive replacement is initiated. Data is copied from the failing disk drive to the second disk drive. Once all data is copied, the failing disk drive can be powered off and marked for replacement.
The present invention increases the ratio of usable storage capacity to the total storage capacity of the storage system, because the spare disk drives are not powered on and are not a part of the storage system until replacement is initiated. Additionally, this increases the service life of spare disk drives, since they are powered off until the time they are added to the RAID system. This also reduces the power consumption of the storage system. Since data is copied from a failing disk drive to a second disk drive, additional performance overheads for regeneration of data, using RAID parity techniques, are also reduced.
The preferred embodiments of the invention will hereinafter be described in conjunction with the appended drawings, provided to illustrate and not to limit the invention, wherein like designations denote like elements, and in which:
For the sake of convenience, the terms used to describe the various embodiments are defined below. It should be noted that these definitions are provided to merely aid the understanding of the description, and that they in no way limit the scope of the invention.
RAID—RAID is a storage architecture that enables high-capacity, high-speed data transfer at a low cost. A description of RAID can be found in the paper titled “A Case for RAID” by Patterson, et al, International Conference on Management of Data (1988), pages 109-116.
Power-on State—In this state, power is supplied to a device. The device may not be in use, but it is consuming power. In the case of disk drives, a drive in power-on state is continuously spinning but data may or may not be read from or written onto it.
Power-off State—In this state, power is not supplied to a device and the device is in an inactive state. In the case of disk drives, no power is supplied to a drive in the power-off state.
Spare Drive—A spare drive is a disk drive that is not being currently used for any data read/write operations and is kept to replace a disk drive that has failed or has been predicted to fail. It may be in a power-on or a power-off state.
The disclosed invention is directed to a method and system for achieving fault tolerance in a storage system, by the replacement of failing disk drives. The replacement is carried out before a disk drive in the system completely fails. Conditions leading to the failure of a disk drive are detected, in order to carry out the replacement of the disk drive before its failure.
Command/data router 212 is used as an interface between processing unit 208 and active disk drives 216 and spare disk drives 218. This interface may be in the form of a switch or a bus interconnect. Command/data router 212 routes the I/O requests and the data to be written on disk drive 108, specified by processing unit 208. Therefore, command/data router 212 connects a plurality of disk drives 108 to a plurality of data-processing systems 102.
Drive replacement logic unit 302 determines if one of the active disk drives 216 is going to fail. This decision is based on a number of factors, such as drive health statistics and the number of hours the drives have been in use. Drive health statistics include drive temperature, vibration, number of remapped sectors, error counts, access time to data, data throughput and read/write errors. Storage system 104 uses sensors 306 to monitor temperature and vibrations in the vicinity of active disk drives 216 and spare disk drives 218. Sensors 306 also include sensors for monitoring temperature and vibrations, such as a sensor to monitor the operation status of cooling fans to indicate the temperature of active disk drives 216 and spare disk drives 218. It should be apparent to one skilled in the art that other means of obtaining disk drive health statistics are also possible and do not limit the scope of the invention. A drive attributes unit 308 scans data from sensors 306 continually or periodically. In an embodiment of the present invention, drive attributes unit 308 also scans drive health statistics using the hard drive industry standard “Self Monitoring Analysis and Reporting Technology” (SMART), which is integrated in active disk drives 216 and spare disk drives 218. A failure profiles unit 310 keeps track of expected failure rates and failure profiles of the active disk drives 216 and spare disk drives 218. The expected failure rates and failure profiles determine the time to failure of active disk drives 216. They are calculated based on attributes that include number of power-on hours, predicted mean time to failure (MTTF), temperature of active disk drives 216 and number of start-stops of active disk drives 216. It will be apparent to one skilled in the art that other attributes can also be used to calculate expected failure rates and failure profiles, without deviating from the scope of the invention. Threshold unit 312 stores data relating to the threshold limits of active disk drives 216 and spare disk drives 218. This data includes drive temperature thresholds, limits for error counts, limits for data throughput rates, limits on access time to data, etc. The threshold values can change with time and the operation of active disk drives 216 and spare disk drives 218. For example, if a drive operates at an elevated temperature, which is below the threshold limit for that drive, the MTTF for that particular drive is reduced from that expected at a lower temperature. This is because usage of that drive at elevated temperatures increases the probability of the failure of that drive at temperatures below the threshold limit.
Drive replacement logic unit 302 uses information provided by drive attributes unit 308, failure profiles unit 310 and thresholds unit 312, to determine if a drive is nearing failure and if it needs replacement before it actually fails. Drive replacement unit 302 sends a signal to drive control unit to power-on a spare drive, copy data from the failing drive to the spare drive, and replace the failing drive.
In an embodiment of the present invention, active disk drives 216 and spare disk drives 218 are arranged to form RAID sets or arrays.
In order to replace a failing disk drive, data has to be copied from the failing disk drive to a spare disk drive.
During the process of the replacement of active disk drive 404a with spare disk drive 406a, a data write or read requests may be directed to active disk drive 404a.
In one embodiment, the present invention is implemented in a power-managed RAID. Power-managed RAID has been described in co-pending US Patent Publication number 20040054939, published on Mar. 18, 2004, titled “Method and Apparatus for Power-efficient High-capacity Scalable Storage System”, assigned to Copan Systems, Inc. This publication is incorporated herein by reference. Disk drives 108 are power managed. This means that they are switched on only when data read/write requests are directed to the disk drives. When such a power-managed disk drive is predicted to fail, it is powered on along with the selected spare disk drive that will replace it, and data is copied from the power-managed disk drive to the selected spare disk drive. The spare disk drive is also power managed, and if no read/write requests are directed to it for a long time, the power-managed spare disk drive is powered off.
In another embodiment, the present invention is implemented in an n-way mirror of mirrors. In an n-way mirror of mirrors, n drives are mirrors of one another. In such an arrangement, multiple disk drives store a copy of the data stored on a primary drive. If the failure of the primary drive or any one of the multiple disk drives is predicted, a spare can be powered on to replace the failing disk drive.
In another embodiment, the present invention is implemented in an existing RAID storage system. If the RAID storage system supports the creation of bi-level arrays, then the present invention can be implemented by making the spare disk drives appear as virtual disk drives to the RAID storage system. Virtual spare disk drives are not actually present in a bi-level array or mirror set, but appear as if present to the RAID storage system. The RAID storage system directs data transfer to both drives in the mirrored set. However, data is actually written only to the active disk drive, and data directed to the virtual spare disk drive is not saved on the disk drive. A software layer is created to handle these I/O requests and ensures that the virtual spare disk drive is not powered on and allocated to a mirror set until the failure of an active disk drive is predicted.
In another embodiment of the present invention, if a failing disk drive fails before all the data from the failing disk drive is copied onto a spare disk drive or if a disk drive fails without warning and its replacement cannot be initiated, RAID engine 305 uses RAID parity regeneration techniques to regenerate the data which has not been copied to the spare disk drive. It will be apparent to one skilled in the art that alternate techniques of regeneration of data are also possible without deviating from the scope of the invention.
An advantage of the present invention is that spare disk drives are not powered on until the failure of a disk drive is predicted. This increases the service life of spare disk drives because they are not in operation during the time period when active disk drives are functioning without errors. Hence, the number of failures of spare disk drives is reduced. Another advantage of the present invention is that no data reconstruction, using RAID regeneration techniques, is required because data is copied from a failing drive before its failure. This reduces the performance overheads caused by regeneration of data. Another advantage of the present invention is that the ratio of available storage capacity to total storage capacity is high, because spare disk drives are not in use until the failure of an active disk drive is predicted. Yet another advantage of the present invention is that multiple failing disk drives can be replaced in parallel. The system also consumes less power, generates less heat and vibrations. This is also due to the fact that spare disk drives are not always in a power-on condition.
While the preferred embodiments of the invention have been illustrated and described, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions and equivalents will be apparent to those skilled in the art, without departing from the spirit and scope of the invention, as described in the claims.
This application claims the priority of U.S. Provisional Patent Application No. 60/475,904, entitled “Method and Apparatus for Efficient Fault-tolerant Disk Drive Replacement in RAID Storage Systems” by Guha, et al., filed Jun. 5, 2003, which is hereby incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
4467421 | White | Aug 1984 | A |
5088081 | Farr | Feb 1992 | A |
5438674 | Keele et al. | Aug 1995 | A |
5530658 | Hafner et al. | Jun 1996 | A |
5557183 | Bates et al. | Sep 1996 | A |
5666538 | DeNicola | Sep 1997 | A |
5680579 | Young et al. | Oct 1997 | A |
5720025 | Wilkes et al. | Feb 1998 | A |
5727144 | Brady | Mar 1998 | A |
5787462 | Hafner et al. | Jul 1998 | A |
5805864 | Carlson et al. | Sep 1998 | A |
5828583 | Bush et al. | Oct 1998 | A |
5913927 | Nagaraj et al. | Jun 1999 | A |
5917724 | Brousseau et al. | Jun 1999 | A |
5961613 | DeNicola | Oct 1999 | A |
6078455 | Enarson et al. | Jun 2000 | A |
6128698 | Georgis | Oct 2000 | A |
6401214 | Li | Jun 2002 | B1 |
6598174 | Parks | Jul 2003 | B1 |
6600614 | Lenny et al. | Jul 2003 | B2 |
6680806 | Smith | Jan 2004 | B2 |
6735549 | Ridolfo | May 2004 | B2 |
6771440 | Smith | Aug 2004 | B2 |
6816982 | Ravid | Nov 2004 | B2 |
6892276 | Chatterjee et al. | May 2005 | B2 |
6957291 | Moon et al. | Oct 2005 | B2 |
6959399 | King et al. | Oct 2005 | B2 |
6982842 | Jing et al. | Jan 2006 | B2 |
6986075 | Ackaret et al. | Jan 2006 | B2 |
7035972 | Guha et al. | Apr 2006 | B2 |
7107491 | Graichen et al. | Sep 2006 | B2 |
7210005 | Guha et al. | Apr 2007 | B2 |
7266668 | Hartung et al. | Sep 2007 | B2 |
20020007464 | Fung | Jan 2002 | A1 |
20020062454 | Fung | May 2002 | A1 |
20020144057 | Li et al. | Oct 2002 | A1 |
20030196126 | Fung | Oct 2003 | A1 |
20030200473 | Fung | Oct 2003 | A1 |
20040006702 | Johnson | Jan 2004 | A1 |
20040054939 | Guha | Mar 2004 | A1 |
20040111251 | Trimmer et al. | Jun 2004 | A1 |
20040153614 | Bitner et al. | Aug 2004 | A1 |
20050060618 | Guha | Mar 2005 | A1 |
20050177755 | Fung | Aug 2005 | A1 |
20050210304 | Hartung | Sep 2005 | A1 |
20050268119 | Guha et al. | Dec 2005 | A9 |
20060053338 | Cousins | Mar 2006 | A1 |
20060075283 | Hartung et al. | Apr 2006 | A1 |
20060090098 | Le et al. | Apr 2006 | A1 |
20070028041 | Hallyal et al. | Feb 2007 | A1 |
20070220316 | Guha et al. | Sep 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
20040260967 A1 | Dec 2004 | US |
Number | Date | Country | |
---|---|---|---|
60475904 | Jun 2003 | US |