1. Field of the Invention
The present invention relates to a technology for managing storage units in redundant arrays of independent disks (RAID).
2. Description of the Related Art
Disk array apparatuses are widely used to improve the reliability of data storage and to increase the access speed. In such disk array apparatuses, a plurality of disk devices are connected to a loop such as a fiber channel arbitrated loop (FC-AL) to configure a RAID.
Sometimes a defect occurs in one of the disk devices and the defective disk device needs to be recovered. In that case, the loop becomes fully occupied during the recovery processing, and therefore, access to the other disk devices is inhibited.
A countermeasure has been disclosed in Japanese Patent Application Laid-Open No. 2004-94774. Specifically, the defective disk is isolated from the loop so that other disk devices can be used without problem.
However, if the defective disk device is important from the viewpoint of maintaining the RAID configuration, it can not be isolated.
It is an object of the present invention to at least solve the problems in the conventional technology.
An apparatus according to one aspect of the present invention manages a plurality of storage units forming a RAID and includes a determining unit that determines whether to isolate a defective storage unit from among the storage units based on a configuration of the defective storage unit; and an isolating unit that isolates the defective storage unit when the determining unit determines to isolate a defective storage unit.
A method according to another aspect of the present invention is for managing a plurality of storage units forming a RAID and includes determining whether to isolate a defective storage unit based on a configuration of the defective storage unit; and isolating the defective storage unit when it is determined at the determining to isolate a defective storage unit.
The above and other objects, features, advantages and technical and industrial significance of this invention will be better understood by reading the following detailed description of presently preferred embodiments of the invention, when considered in connection with the accompanying drawings.
Exemplary embodiments of the present invention will be described below with reference to accompanying drawings. The present invention is not limited to these embodiments.
The concept of a storage-unit management apparatus according to an embodiment is explained first. A plurality of disk devices is connected to a loop, forming a RAID configuration. When defects intermittently occur in one of the disk devices (hereinafter, “defective disk device”), the storage-unit management apparatus determines whether the defective disk device is important for maintaining the RAID configuration. If the defective disk device is not important, the defective disk device is isolated from the loop. If the defective disk device is important, the defective disk device is inhibited from being isolated from the loop, and the defective disk device is isolated only after copying data stored in the defective disk device to a backup disk device that is in the loop. As a result, the RAID configuration can be maintained while the defective disk device is being recovered.
The disk array apparatus 500 includes channel adaptors 10a to 10d, a front end router 20, controller modules 10a to 10d, and disk storage units 200a to 200p.
The channel adaptor 10a connects the disk array apparatus 500 to an external host computer (not shown). The channel adaptor 10a passes the data that it obtains from the host computer to the controller module 100a. Because the channel adaptors 10b to 10d have similar configuration and similar functions to those of the channel adaptor 10a, they will not be described in detail.
The front end router 20 connects the controller modules 100a to 100d to each other. The front end router 20 enables communication of data among the controller modules 100a to 100d. The disk storage units 200a to 200p configure RAID. The controller module 100a holds information relating to RAID configuration, and controls the disk storage units 200a to 200p based on that. Because the controller modules 100b to 100p have similar configurations and similar functions to those of the controller module 100a, they will not be explained in detail.
The DMA unit 110 enables communication of the controller module 100a with the other controller modules 100b to 100d via the front end router 20. The DMA unit 110 uses predetermined communication protocols in communicating with the other controller modules 100b to 100d. The interface unit 120 enables communication of the controller module 100a with the channel adaptor 10a and/or the disk storage units 200a to 200p. The interface unit 120 uses predetermined communication protocols in communicating with the channel adaptor 10a or the disk storage apparatuses 200a to 200p.
The disk-device managing unit 130 manages the disk storage units 200a to 200p. When the disk-device managing unit 130 receives a notification that a defective disk device has been detected, it transmits an isolation permission table 140a to the disk storage unit that is the source of the notification. The isolation permission table 140a is stored in the recording unit 140.
The isolation permission table 140a records information relating to the RAID configuration of disk devices stored in the disk storage units 200a to 200p.
The isolation permission table 140a includes items such as “Disk No.”, “Mount DE-No.”, “Mount SLOT-No.”, “RAID Group Category”, “RAID Level”, “RAID Status”, and “Permit/Prohibit Isolation”. Any other items may be added to this list if necessary.
“Disk No.” is a number that uniquely identifies the physical position of a disk device mounted on a system, and is expressed by combining the “Mount DE-No.” and the “Mount SLOT-No.” For example, when the “Mount DE-No.” is “00” and the “Mount SLOT-No.” is “01”, the “Disk No.” is “0001”.
“RAID Group Category” is information for identifying disk devices included in the same RAID group. In the example shown in
“RAID Level” is the level of the RAID formed by each disk device. The “RAID Level” ranges from RAIDs 0 to 5. In the example shown in
“RAID Status” represents the status of the RAID. In the example shown in
“Permit/Prohibit Isolation” is information indicating whether to permit isolation of a disk device when a defect occurs in that disk device. A disk device for which “Permit/Prohibit Isolation” status is “Prohibit”, it means that that disk device is essential for maintaining the RAID configuration so that it can not be isolated.
When the disk-device managing unit 130 obtains information relating to a disk device from the disk storage units 200a to 200p (information such as the start or end of recovery), the disk-device managing unit 130 updates the contents of the isolation permission table 140a based on the obtained information.
In the example shown in
Although not shown in
Referring back to
When a disk device of the disk storage unit 200a becomes defective, the disk storage unit 200a notifies the controller module 100a. In response, the controller module 100a sends the isolation permission table 140a to the disk storage unit 200a. Based on the isolation permission table 140a, the disk storage unit 200a determines whether the defective disk device can be isolated. The defective disk device is isolated only if it can be isolated. In other words, if “Permit/Prohibit Isolation” status of the defective disk device is “Prohibit” it can not be isolated, and if it is “Permit” it can be isolated.
The loop control unit 300 includes an interface unit 310, an environment/communication controller 320, an FC buffer 330, a switch unit 340, a power controller 350, a light emitting diode (LED) display controller 360, a voltage controller 370, a temperature monitoring unit 380, a fan-rotation-signal monitoring unit 390, a memory 400, and a switch controller 410.
The interface unit 310 uses predetermined protocols in communicating with the controller modules 100a to 100d. The environment/communication controller 320 controls communication with the disk devices 210 to 230, and manages various units (not shown) included in the disk storage unit 200a. A power unit, a fan, an LED, and the like, are examples of the various unit included in the disk storage unit 200a.
When a defective disk device is being recovered, the environment/communication controller 320 notifies the controller module 100a of information relating to the defective disk device. Once the recovery is complete, the environment/communication controller 320 notifies the controller module 100a of this fact.
The FC buffer 330 temporarily stores data exchanged between the controller modules 100a to 100d and the disk devices 210 to 230.
The switch unit 340 isolates one of the disk devices connected to the loop, according to instructions from the switch controller 410. As a result, a defective disk device can be isolated from other disk devices in the loop.
The power controller 350 controls the power unit according to instructions from the environment/communication controller 320. Upon receiving an instruction from the environment/communication controller 320 of presence of a defective disk device, the LED display controller 360 flashes an LED (not shown) so that an administrator can know that there is a defective disk device.
The voltage controller 370 monitors and controls a voltage of the disk storage unit 200a. The temperature monitoring unit 380 monitors a temperature inside the disk storage unit 200a, and notifies the environment/communication controller 320 of information relating to the temperature. The fan-rotation-signal monitoring unit 390 monitors the number of rotations of a fan inside a casing (not shown). The memory 400 stores information for controlling hardware (for example, disk devices, the power unit, the fan, and the LED) relating to the disk storage unit 200a.
The switch controller 410 controls the switch unit 340. Specifically, when there is a defective disk device, the switch controller 410 notifies the controller module 100a of this defect and obtains the isolation permission table 140a. Based on the isolation permission table 140a, the switch controller 410 determines whether the defective disk device can be isolated. The defective disk device is isolated only if it can be isolated. In other words, if “Permit/Prohibit Isolation” status of the defective disk device is “Prohibit” it can not be isolated, and if it is “Permit” it can be isolated. If the defective disk device can be isolated, the switch controller 410 isolates the defective disk device from the loop.
On the other hand, if the switch controller 410 can not be isolated, the switch controller 410 copies data recorded on the defective disk device to a backup disk device that is operating normally. Because the backup disk device can now be used instead of the defective disk device, the defective disk device is isolated.
For example, when a defect occurs in the disk device 210 that cannot be isolated, and the disk device 230 is the backup disk device, the switch controller 410 copies data recorded in the disk device 210 to the disk device 230. Thereafter, data to be written into the disk device 210 is written into the disk device 230.
The switch controller 410 determines, based on the isolation permission table 140a, whether to isolate the defective disk device (step S103). If the defective disk device can be isolated (step S104, Yes), the switch controller 410 isolates the defective disk device (step S105).
On the other hand, when the defective disk device cannot be isolated (step S104, No), the switch controller 410 copies data stored in the defective disk device to a backup disk device that is operating normally (step S106). Because the backup disk device can now be used instead of the defective disk device, the switch controller 410 isolates the defective disk device (step S105).
Thus, when a defect is detected in a disk device connected to the loop, the switch controller 410 obtains the isolation permission table 140a from the controller module 100a and determines, based on the isolation permission table 140a, whether to isolate the disk device. Accordingly, a defective disk device can be recovered while maintaining the RAID configuration.
As described above, in the disk array apparatus 500 according to this embodiment, when one of the disk storage units 200a to 200p (for example, the disk storage unit 200a) detects a defect in a disk device stored therein, disk storage unit 200a requests the isolation permission table 140a from the controller module 100a. The disk storage unit 200a then determines, based on the isolation permission table 140a, whether the defective disk device can be isolated. Since the defective disk device is isolated only when permitted, defects in the disk device can be recovered while maintaining the RAID configuration.
Similar results can be obtained by storing an isolation permission table in each of the disk storage units 200a to 200p. In this case, the disk storage units 200a to 200p mutually exchange information to update the isolation permission tables.
According to the above embodiment, a RAID configuration to which a defective storage unit belongs can be maintained while recovering the defective storage unit.
Moreover, a defect in a storage unit can be efficiently recovered while maintaining a RAID configuration to which the defective storage unit belongs.
Although the invention has been described with respect to a specific embodiment for a complete and clear disclosure, the appended claims are not to be thus limited but are to be construed as embodying all modifications and alternative constructions that may occur to one skilled in the art that fairly fall within the basic teaching herein set forth.
Number | Date | Country | Kind |
---|---|---|---|
2005-181115 | Jun 2005 | JP | national |