Claims
- 1. An apparatus for improving fault tolerance of a storage system, the apparatus comprising:
a. a first set of disk drives; b. a second set of disk drives, the second set of disk drives in power-off condition; c. a processing unit, the processing unit comprising:
i. a drive replacement logic unit, the drive replacement logic unit identifying a failing disk drive from the first set of disk drives; ii. a drive control unit, the drive control unit receiving an indication from the drive replacement logic unit to replace the failing disk drive with a spare disk drive from the second set of disk drives, the drive control unit powering-on the spare disk drive to replace the failing disk drive; and d. a memory unit, the memory unit storing drive health status data and information for the first set of disk drives.
- 2. The apparatus as recited in claim 1, wherein the processing unit comprises a RAID engine, the RAID engine performing data striping, data mirroring and parity functions.
- 3. The apparatus as recited in claim 1, wherein the first and second set of disk drives are individually controllable to power on or off independent of the remainder of disk drives.
- 4. The apparatus as recited in claim 1, wherein the first set of disk drives form one or more RAID sets.
- 5. The apparatus as recited in claim 1, wherein the memory unit comprises:
a. a drive attributes unit, the drive attributes unit receiving and storing disk drive attribute data from each disk drive from of the first set of disk drives; b. a failure profile unit, the failure profile unit storing failure profiles for each disk drive from the first set of disk drives; and c. a threshold unit, the threshold unit storing attribute thresholds for various health factors for the first set of disk drives, the attribute thresholds indicating levels above which disk drives from the first set of disk drives are likely to fail.
- 6. The apparatus as recited in claim 5 further comprising at least one environmental sensor.
- 7. The apparatus as recited in claim 6, wherein the environmental sensors comprise at least one temperature sensor, the temperature sensor monitoring temperature of at least one disk drive from the first set of disk drives.
- 8. The apparatus as recited in claim 6, wherein the environmental sensors comprise at least one vibration sensor, the vibration sensor monitoring vibrations of at least one disk drive from the first set of disk drives.
- 9. The apparatus as recited in claim 6, wherein the memory unit receives drive attribute data from at least one environmental sensor.
- 10. The apparatus as recited in claim 5, wherein the memory unit receives drive attributes data from the first set of disk drives.
- 11. The apparatus as recited in claim 10, wherein the drive attributes data is received using the SMART standard.
- 12. A processing unit for improving fault tolerance of a storage system, the storage system comprising a first set of disk drives storing data and a second set of disk drives, the processing unit comprising:
a. a drive replacement logic unit, the drive replacement logic unit identifying a failing disk drive from the first set of disk drives; and b. a drive control unit, the drive control unit receiving an indication from the drive replacement logic unit to replace the failing disk drive with a spare disk drive from the second set of disk drives, the drive control unit powering-on the spare disk drive to replace the failing disk drive.
- 13. The processing unit as recited in claim 12, wherein the storage system comprises a RAID system.
- 14. The processing unit as recited in claim 12, wherein each of the first and second set of disk drives are individually controllable to power on or off independent of the remainder of disk drives.
- 15. The processing unit as recited in claim 12, wherein the first set of disk drives are arranged form one or more RAID sets.
- 16. The processing unit as recited in claim 12, further comprising a memory unit.
- 17. The processing unit as recited in claim 16, wherein the memory unit comprises:
a. a drive attributes unit, the drive attributes unit receiving and storing disk drive attribute data from each disk drive from of the first set of disk drives; b. a failure profile unit, the failure profile unit storing failure profiles for the first set of disk drives; and c. a threshold unit, the threshold unit storing attribute thresholds for various health factors for the first set of disk drives.
- 18. The processing unit as recited in claim 17 further comprising at least one environmental sensor.
- 19. The processing unit as recited in claim 18, wherein the environmental sensors comprise at least one temperature sensor, the temperature sensor monitoring temperature of at least one disk drive from the first set of disk drives.
- 20. The processing unit as recited in claim 18, wherein the environmental sensors comprise at least one vibration sensor, the vibration sensor monitoring vibrations of at least one disk drive from the first set of disk drives.
- 21. The processing unit as recited in claim 18, wherein the memory unit receives drive attribute data from at least one environmental sensor.
- 22. The processing unit as recited in claim 17, wherein the memory unit receives drive attributes data from the first set of disk drives.
- 23. The apparatus as recited in claim 24, wherein the drive attributes data is received using the SMART standard.
- 24. A method for improving fault tolerance of a storage system, the storage system comprising a first set of disk drives and a second set of disk drives in power-off condition, the method comprising the steps of:
a. monitoring the first set of disk drives to identify a failing disk drive from the first set of disk drives; b. powering-on a spare disk drive from the second set of disk drives on receipt of signal to replace the failing disk drive from the first set of disk drives; and c. copying data from the failing disk drive from the first set of disk drives to the spare disk drive from the second set of disk drives.
- 25. The method as recited in claim 24, wherein the step of monitoring the first set of disk drives further comprises the steps of:
a. receiving information regarding temperature and vibrations of the first set of disk drives; b. receiving drive status information from the first set of disk drives; and c. comparing the received information to identify a failing drive.
- 26. The method as recited in claim 24 further comprising the step of adding the spare disk drive to the first set of disk drives.
- 27. The method as recited in claim 24 further comprising the step of removing the failing disk drive from the first set of disk drives.
- 28. The method as recited in claim 27 further comprising the step of powering off the failing disk drive after copying data from the failing disk drive to the spare disk drive replacing the failing disk drive.
- 29. The method as recited in claim 24, wherein the step of copying data further comprises the step of storing data received by the storage system to the failing disk drive and to the spare disk drive.
- 30. The method as recited in claim 24, wherein the step of copying data further comprises the step of reading data requested by the storage system from the failing disk drive.
REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the priority of U.S. Provisional Patent Application No. 60/475,904, entitled “Method and Apparatus for Efficient Fault-tolerant Disk Drive Replacement in RAID Storage Systems” by Guha, et al., filed Jun. 5, 2003, which is hereby incorporated by reference in its entirety.
Provisional Applications (1)
|
Number |
Date |
Country |
|
60475904 |
Jun 2003 |
US |