It is important that computer data storage systems can reliably maintain data stored in a disk drive in case of a particular memory failure or error. One commercially available technique for increasing disk drive reliability relies upon redundant data storage on disks such as “Redundant Array of Inexpensive Disks” (RAID) systems. Raid systems are currently used in many computer systems that demand improved reliability for data storage while utilizing relatively inexpensive memory disks. For example, banking and secure transactions over the Internet often rely upon RAID systems.
Certain levels of RAID involve the use of mirror disks, in which data is copied from among different disks. Today, changing such mirror-related configurations of RAID arrays has the disk volume go off-line. It would therefore be desirable to provide a mechanism to improve the reliability and reduce the down-time of RAID systems that result from reconfiguring of the disk drives.
The same numbers are used throughout the drawings to reference similar features and components.
a and 3b are a block diagram of one embodiment of the invention showing a RAID disk drive reconfiguring process as shown in FIG. 2 that can be configured either as an Add_Location function or a Swap_Location function.
a and 4b are a block diagram of one embodiment of the invention showing a Delete_Mirror function flow, which is a version of the RAID disk drive reconfiguring process as shown in
Much has gone into the development of disk drives. Disk drives provide one technique to store large amounts of data in a random access memory (RAM) format for such uses as databases. Disk drive reliability is especially important in critical RAM and database applications such as banking, finance, security, and personal matters. One technique that increases the reliability of disk drives involves so-called “Redundant Array of Inexpensive Disks” (RAID) systems.
The RAID protocol is subdivided into a number of RAID levels, each RAID level provides different functionality. RAID level-1 is also known as “RAID mirroring” in which data from a primary disk drive is copied (i.e., “mirrored”) into one or more mirror disk drives. RAID mirroring improves redundancy, lowers latency, increases bandwidth for reading and/or writing data, and allows for recoverability from hard-disk crashes. This disclosure describes a number of RAID mirror embodiments that reconfigure host-based RAID mirrors while keeping the RAID system on-line.
In RAID level-1 mirror systems, each mirror disk drive (and there may be a number thereof) acts to store identical data to that stored in the primary disk drive to provide a redundant storage of data. Data stored in the disk drives of the RAID array can be compared using the RAID algorithms. If one disk drive (either the primary disk drive or one of the mirror disk drive) contains inconsistent data compared to other disk drives, then a RAID algorithm can determine which data set(s) to rely upon using error handling techniques.
RAID systems that operate based on a RAID protocol include disk arrays in which part of the physical storage capacity is used to store redundant information about user data. The redundant data is stored on other portions of the physical storage. The redundant information enables regeneration of user data in the event that one of the array's member disks (or the access path thereto) fails.
Each RAID disk drive may be configured as a non-volatile, randomly accessible, rewritable mass storage device. The physical mass storage device of each RAID disk drive can include (but is not limited to) rotating magnetic disks, rotating optical disks, solid-state disks, non-volatile electronic storage elements [e.g., programmable memories (PROMs), erasable programmable memories (EPROMs), and electrical erasable programmable memories (EEPROMs)], and/or flash drives.
This disclosure provides a mechanism by which the RAID disks are maintained on-line while re-configuring the RAID system. By maintaining the RAID system on-line during reconfiguring, the users of computer environments can experience continued operation of the application programs. Although an illustrative RAID Level 1 system is described as being reconfigured in the present disclosure, the on-line reconfiguring concepts disclosed herein are applicable to other RAID and other redundant non-RAID systems. As RAID and other redundant disk systems are becoming more commonplace, allowing the application programs to run more continuously is becoming more important.
To maintain the RAID system on-line while being reconfigured, process pairs are used to access information from the disk drive. The process pair includes a first processor running a first process that relates to a first copy of an image of an operating system. The process pair further includes a second processor (that in certain embodiments may actually be the first processor) running a second process that relates to a second copy of an image of an operating system. The use of multiple processors running multiple processes is considered as a clustered system.
To change the state of a RAID system, the operation that results in a change of the state is initially performed relative to the first processor/process while activity in the second processor/process is suspended or maintained. The operation that results in a change of the state is then performed relative to the second processor/process while any activity in the second processor/process is suspended.
By using this technique by which the state in one or two processes are maintained while the other one is being changed (either the second process is suspended is suspended while the first process is modified, or vice versa), the RAID system can be maintained on-line during disk reconfiguration process. In current systems that rely on a single process, the RAID system has to be taken off-line to reduce the potential for errors. As such, the present disclosure provides a mechanism by which the RAID system can be maintained online continuously during the reconfiguration processes.
Certain embodiments of RAID systems are capable of error handling by which data or operational errors of the RAID drives are handled. The techniques by which a RAID system can detect and handle errors can vary between different RAID disk drives. In this disclosure, the RAID disk drives can undergo on-line error handling while being reconfigured.
Error handling can be more reliably performed using the process pair as described above. For example, if an error is detected in the first processor/process as the change of the state is initially performed relative to the first processor/process, then the state of the first processor/process can be returned to its original state using the state of the second processor/process before the state of the second processor/process of the process pair is changed.
By comparison, if the first processor/process of the process pair is successfully changed (and does not invoke error handling), then the second processor/process in the process pair will be attempted to be changed. If an error is detected in the second processor/process, then the overall process will be forwarded to the state of the first processor/process that has already been changed based on the changed state of the first processor/process. This technique allows for reconfiguration the RAID system (one process at a time) while maintaining the RAID system online, which is important to provide continual application support.
There are several aspects about how to reconfigure the RAID disk drive during the RAID disk drive using the reconfiguring and/or mirroring processes as described in this disclosure. One aspect of the RAID disk drive reconfiguring process is that users should be able to mirror an unmirrored disk volume without taking the disk volume out of service. The rationale for this aspect is that users do not want to take their applications out of service to change the disk configuration from an unmirrored configuration to a mirrored configuration.
Another aspect is that users should be able to unmirror a mirrored disk volume without taking the disk volume out of service since users do not want to take their applications out of service to change the disk configuration from mirrored to unmirrored.
Another aspect is that users should be able to reconfigure a mirrored disk volume to use a disk drive in an alternate location without taking the disk volume out of service. One rationale for this aspect is that users do not want to take their applications out of service to change the disk configuration to use a different disk drive. The disk drives of the mirrored disk volume should be synchronized before another mirror-oriented reconfiguring is performed. In other words, if the user wants to move both halves of a mirrored disk volume, then the user should move one disk drive at a time and allow the disk revive to complete before the other disk drive is moved.
Yet another aspect of the disclosed ODR mechanism is that users should be able to switch roles between the primary and mirror disk drives of a mirrored disk volume without taking the disk volume out-of-service. A rationale for this aspect is that users should be provided continued service without taking their applications out of service to perform this type of configuration change.
Another aspect of the disclosed ODR mechanism is that users should be able to replace a failing disk drive of a mirrored disk volume 112 while being physically off-site. A rationale for this aspect is that there may be no users at the site where the ODR mechanism is physically housed. In these situations, users still need to be able to replace a disk drive that is showing sign of incipient failure without having to go onsite and physically replace the disk drive. One technique to meet this desired aspect is to provide a number of non-configured disk drives to the system, thereby allowing the off-site user to do an Online Disk Replacement to replace the ailing disk drive with one of the spare disk drives.
Another aspect described in this disclosure is known as “Disk Sparing”, in which certain disk drives should not be configured automatically within the ODR mechanism 100 since they can be used as part of a pool of spare disk drives. This pool of spare disk drives can allow the user to configure the system to automatically replace a failed disk drive with one of the spare disk drives by performing an automated ODR mechanism that starts a disk revive.
Another aspect of this disclosure is that users should be able to configure their system to automatically perform an Online Disk Replacement using one of a pool of spare disk drives. Another aspect of the disclosed ODR mechanism 100 is that users should be able to script actions such as moving from a non-interleaved to an interleaved disk configuration. A rationale for this aspect is that moving from a non-interleaved to an interleaved disk configuration may change the configuration for a large number of disks, including actions such as power off the disk drive to be moved to its new location. If the user intends to physically move the disk drives, then it's acceptable to have the user make use of two scripts, one that prepares the disk drives to be moved (including power the disk drive off) and another that completes the configuration changes once the disk drives have been moved.
I. General Disk Reconfiguring System
As described with respect to
One embodiment of the RAID system 102 includes a disk array controller 108 that is coupled to the disk array 104 to coordinate data transfer to and from the storage disk drives 114. The RAID system 102 further includes a RAID management system 110 that provides a mechanism to reconfigure an array of host-based RAID level-1 mirror disk drives without taking the disk drives 114 out-of-service (and resulting in taking application programs that run on many of the disk drives off-line). Arrays of RAID disks typically include a plurality of the storage disk drives 114, the hardware used to connect the storage disk drives to a host computer(s) 124, and management software that controls the operation of the physical disk drives 114. The RAID system 102 is coupled to a host computer 124 via an I/O interface bus 118. The software, hardware, and/or firmware of the RAID system 102 can present the data stored in the storage disk drives 114 in such a manner that many of the storage disk drives 114 within the disk array 104 can appear to the users as a virtual disk running on the host computer 124. The “virtual disk” may be realized in the disk array using the management software.
The disk array controller 108 is coupled to the disk array 104 via one or more interface buses 119, such as a commercially available small computer system interface (SCSI), ATA, Universal Serial Bus (USB), FibreChannel, IEEE 1394, etc. The RAID management system 110 is operatively coupled to a disk array controller 108 via an interface protocol 116. The RAID management system 110 can be configured either as a separate component as shown, or contained within the disk array controller 108 or within the host computer 124. The RAID management system 110 and the disk array controller 108 together provide the process pair functionality to the disk array 104. The process pair functionality includes a first processor that runs a first process relating to a first copy of the operating system, and a second processor that runs a second process relating to a second copy of the operating system. The components of the process pairs (the processors, the processes, and the copies of the operating systems) are not shown due to the variety of potential implementations. The RAID management system 110 can provide a data manager for controlling disk storage and reliability levels, and for transferring data among various reliability storage levels. The RAID management system 110 can also implement distributed write disk logging.
In the system shown, the disk array controller 108 is provided as a single or multiple controllers. The methods disclosed within this disclosure can be practiced with a single disk array controller 108, more than two controllers, or other architectures.
The disk array 104 can be characterized as different storage spaces, including its physical storage space and one or more virtual storage spaces. For example, storage disk drives 114 in disk array 104 is conceptualized as being arranged in a disk volume 112 of multiple disks 114. These various views of storage are related through mapping techniques. For example, the physical storage space of the RAID system 102 is mapped into a virtual storage space that delineates storage areas according to the various data reliability levels. Some areas within the virtual storage space are allocated for a first reliability storage level, such as RAID mirror (RAID level-1) and other areas are allocated for another reliability storage level such as parity or striping. These areas may be configured on the same or separate disks or any combination thereof.
The RAID system 102 can include a memory map store 122 that provides for persistent storage of the virtual mapping information used to map the disk array 104. The memory map store 122 is external to the disk array (and is resident in the disk array controller 108). The memory mapping information is updated continually or at regular (or irregular) intervals by the disk array controller 108 or RAID management system 110 as the various mapping configurations among the different views change.
The memory map store 122 can be embodied as one or more non-volatile Random Access Memory within the disk array controller 108. The memory map store 122 provides for redundant storage of the memory mapping information that can be used to provide the process pair (two processes with two copies of the operating system). The virtual mapping information is duplicated and stored in the memory map store 122 according to mirror redundancy techniques. In this manner, the memory map store 122 is dedicated to storing the original mapping information and the redundant mapping information.
As indicated, the disk array 104 includes multiple storage disk drives 114 (which may also be configured as multiple locations within a single disk drive device to store redundant data). The management of data on redundant storage disk drives 114 is coordinated by the RAID management system 110. When viewed by the user or host application program, an application level virtual view can represent a single large storage capacity indicative of the available storage space on storage disk drives 114. The RAID management system 110 can dynamically alter the configuration of the RAID areas over the physical storage space.
As a result, the mapping of the RAID areas in a RAID-level virtual view onto the disks and the mapping of a front end virtual view to the RAID view are generally in a state of change. In one embodiment, the memory map store 122 maintains the current mapping information used by the RAID management system 110 to map the RAID areas onto the disk drives, as well as the information employed to map between the two virtual views. As the RAID management system 110 dynamically alters the RAID level mappings, it also updates the mapping information in the memory map store to reflect the alterations.
Different mechanisms described in this disclosure provide for reconfiguring of the disk drives in the mirrored disk volume 112 without taking the disk volume out of service. Therefore, any application programs that rely on data stored within the disk volume 112 can continue their operation as the disk volume is maintained online. This disclosure also describes how to determine the order to update various system tables, when to invoke specific actions and how to handle the different errors that might occur.
Within this disclosure, the term “process” is frequently used. It is intended that the term “process” apply to, and be interchangeable with, the “processor” that performs the “process”.
The RAID disk drive reconfiguring process 200 includes a “receive request to reconfigure the RAID” operation 202. The RAID disk drive reconfiguring process 200 continues to where the disk array 104 as described with respect to
If the answer to decision 206 is yes, then there might be some error in the reconfiguring process, and the RAID disk drive reconfiguring process 200 continues to decision 209 in which it is determined whether the reconfiguring process 200 is substantially complete. An aspect of determining whether the reconfiguring process 200 is substantially complete is that if an error is detected when the reconfiguring process is first starting up (and is not substantially complete such as would be the case if the error was in the first process of the process pair); it typically is easier and more reliable to return the ODR mechanism 100 to its original state prior to beginning the reconfiguring process.
By comparison, if the reconfiguring process has already accurately completed the first process of the process pair, and the error is in the second process of the process pair, then the reconfiguring process is considered to be substantially complete. The term “substantially complete” may be equated to the process pair concept as described within this disclosure. An error being detected in the first process which is being processed (as the second process is sustained in its original state) would likely relate to a reconfiguration process not being substantially complete. By comparison, the error being detected in the second process (after the first process has completed its processing and its completed state is being sustained) could be considered as an error as the reconfiguration process being substantially complete. If the reconfiguration process is substantially complete, then the reconfiguration process continues to 212 as described with respect to
In one embodiment, a number of factors can be used to determine whether to roll-back to the original state, or continue to the finished state. Whether to roll-back or continue may depend on two factors: 1) The type of error encountered 2) Where in the process the processor that is running the process is. In one embodiment, a processor halt occurs. Following a processor halt, the process considers a new or old configuration is being run, and cleans up the configuration of the process accordingly. If the processor is running one process (e.g., with a new configuration), then the processor completes what it can, which includes steps such as database update. If the processor is running another process (e.g., with the old configuration), then the processor performs reversal steps such as contained within reversing system tables.
If the processor encounters some other error (for example, failing to update a system table), then the decision becomes whether the process contains sufficient configuration state information to commit the configuration change. If the process does not contain sufficient configuration state information, then the process is backed out with the state being returned to its original state. By comparison, if the reconfiguring process 200 is substantially complete, then it is typically easier and more reliable for the ODR mechanism 100 to finish the reconfiguring process, and then correct the resulting errors produced by the reconfiguring process. By following this reconfiguring logic and
II. Example Disk Device Reconfiguration
This disclosure now describes a number of embodiments of the RAID disk drive reconfiguring process 200 described with respect to FIGS. 2, 3a, 3b, 4a, and 4b that can operate based on the process pairs as described within this disclosure. One embodiment of the RAID disk drive reconfiguring process 200 that can alternatively add a mirror disk drive or switch between a mirror disk drive and a primary disk drive is described with respect to
In
Each one of the RAID processes 304, 306 includes a disk process file manager 308, a driver 309, and a SCSI interface monitor 310 (which is a low-level privileged system process responsible for maintenance of the in/memory I/O configuration tables on behalf of the Kernel). At any given instant in time, one of the RAID processes 304 or 306 controls both the primary disk drive (and the associated process) and the mirror disk drive (and the associated process). The particular one of the respective RAID processes 306 or 304 that controls the disk drives may be switched using a RAID disk drive reconfiguring process 200 as described herein. This disclosure provides one implementation in which the roles between the different processes are organized. In other embodiments of the operating systems, a different mechanism may be provided (for example, by combining the SSM and SIFM functions).
In
The embodiment of the RAID disk drive reconfiguring process 200 as described with respect to
In a “Reserve Location in Database” operation 316, the storage subsystem manager 312 writes a location reservation record for the target mirror into a database since a mirror disk drive is to be added to the disk volume 112. This location reservation record is the “DISK_Altkey_Record” record used by the storage subsystem manager 312 to find disk volumes by location. The purpose of this reservation is to ensure that another configuration request against the same location will fail—as soon as the location is reserved, the location is considered in use. In this implementation, the location record contains a modified version of the disk-volume name, which enables the system to remove orphaned reservation records in case of a crash in the midst of ODR processing. The “Reserve Location In Database” operation 316 is invoked after receiving the “Operation Parsing” operation 315 indicating with respect to
In the “Update Path Configuration in Primary” operation 318, the storage subsystem manager calls a function specifying an “Add-Mirror” call. This call can be used when the path configuration in the system tables should be updated in both processes 304 and 306 before the disk process is told to begin its “ODR processing using the ODR_Begin” operation. This call ensures that the disk process have up-to-date information in the system tables in case it has to switch to the backup process before the ODR action has been completed. A reply 319 responds to the “Update Path Configuration in Primary” operation 318.
In the “Update Path Configuration in Backup” operation 320, the storage subsystem manager 312 of the control process 302 calls an operation for documentation to specify which action to perform (i.e., “Add-Mirror” or “Swap Mirror” in
In the “ODR-Begin” operation 324, the storage subsystem manager 312 sends a message to the primary disk process, telling it to commence the RAID disk drive reconfiguring process 200 as described with respect to
There are two inverse operations described within this disclosure, the “Driver_Brother_Down” operation that causes the link between the primary process and the backup process to be broken, and the “Driver_Brother_Up” operation that causes the link between the primary process and the backup process to be (re)established as described in this disclosure. This means that the Disk Driver 309 in the primary disk process 304 can not, for example, send checkpoints to the Disk Driver 309 in the backup disk process 306 thereby ensuring that configuration changes don't occur in a non-controlled manner. (If the link was not severed, then the configuration change would be copied as part of the checkpoint, which could cause, for example, I/O paths to be deleted during a pending I/O).
The “Disable I/O in Backup” operation 330 brings the volume down in the backup process. The “Begin_ODR” operation 324 is invoked prior to beginning the reconfiguring the SCSI Interface Monitor (SIFM) 310 using the “Disable I/O in Backup” operation 330. The “Driver_Brother_Down” operation 328 breaks the link between processors for the Driver ODR_State within the ODR mechanism. Replies 329 and 332 respond to the “Disable I/O in the Backup” operation 330.
In the “ODR_Reconfigure” operation 334, the storage subsystem manager tells the backup disk process to reconfigure its Disk Driver 309. The “ODR_Reconfigure” operation 334 causes the backup disk process to invoke the following operations: a) call the “Driver_Stop” operation 336, which causes the Disk Driver 309 to think that the disk process is going away thereby causing the Disk Driver to clean up configuration entries and data structure in the backup process; b) call the “Driver_Environment” operation 338, which causes the Disk Driver 309 in the backup process to retrieve the changed path information from the system tables via the SIFM 310 in the backup process 306 and create new data structures; and c) call the “Driver_Initialize” operation 344 which causes the Disk Driver 309 to go through setup processing.
Consider that the I/O remains disabled in the backup process at this point following 330 (the disk volume is still logically down in the backup process). Once the operations 336, 338, and 344 have been completed, the Disk Driver 309 in the backup disk process 306 is ready to use and make use of the new path information in 340. Call the “Driver_Initialize” operation 344 is invoked as part of the first pass through the ODR processing, and represents the first time the “ODR_Reconfigure” operation is invoked. The Reply 346 acts to respond to the “ODR_Reconfigure” operation 334.
In the “ODR Primary” operation 350, the storage subsystem manager 312 tells the primary disk process to perform a “Primary_Disk” action which causes the primary disk process to switch roles with the backup disk process in a “Switch” operation 352, followed by a “Handshake” operation 354. Following the “Switch” operation 352, the backup disk process now acts as the primary disk process, and the primary disk process acts as the backup disk process. During this processing, the disk process uses the path information to determine how to handle the special ownership-switch request. The “ODR Primary” operation 350 is invoked after the Disk Driver 309 in the backup process 306 has been reconfigured following the “Enable I/O in Primary” operation 356.
If the path information is different between the primary and backup disk processes, then the “ODR Primary” operation 350: a) starts the disk process thread in the new primary process; and b) stops the disk process thread in the new backup process; and c) calls the “Driver_Brother_Down” operation, which causes the Disk Driver 309 in the primary process to break its link to the Disk Driver 309 in the backup process in 368. This checking is used to ensure that the disks in the disk array 104 as shown in
If the path information is the same between the primary and backup disk processes 304, 306, then the ODR Primary Operator 350: a) start the disk process thread in the new primary process (there is no need to start the disk process thread in the backup process since it is already running from prior to the ownership switch); and calls the “Driver_Brother_Up” operation, which causes the Disk Driver 309 in the primary process to recreate its link to the Disk Driver 309 in the backup process. This means that Driver checkpointing is activated again. The configuration change is complete in the disk process. The Replies 358 and 360 are both responsive to the “ODR_Primary” operation 350.
In the “Begin_ODR” operation 362, the storage subsystem manager 312 sends a message to the primary disk process, telling it to (once again following 324) commence the RAID disk drive reconfiguring process 200. This “Begin_ODR” operation 362 causes the primary disk process to: a) stop the disk process' disk process thread in the backup process in 364; and b) call the “Driver_Brother_Down” operation in 366, which causes the Disk Driver 309 in the primary process to break its link to the Disk Driver 309 in the backup process in the “Disable I/O in Backup” operation 368. This means that the Disk Driver 309 in the primary disk process can not, for example, send checkpoints to the Disk Driver 309 in the backup disk process. The “Disable I/O in Backup” operation 368 results in the volume being down in the backup. The “Begin_ODR” operation 362 is invoked after changing the configuration of the SIFM 310. The Reply 370 is responsive to the “Begin_ODR” operation 362.
In the “ODR_Reconfigure” 372 the storage subsystem manager tells the primary disk process to reconfigure the Disk Driver 309. The “ODR_Reconfigure” request causes the backup disk process to: a) call the “Driver_Stop” operation 374 that causes the Disk Driver 309 to think that the disk process is going away thereby causing the Disk Driver 309 to clean up configuration entries and data structure in the backup process; b) call the “Driver_Environment” operation 376 that causes the Disk Driver 309 in the backup process to retrieve the changed path information from the SIFM 310 in the backup process and to create new data structures; and c) call the “Driver_Initialize” operation 379, which causes the Disk Driver 309 to go through setup processing. The I/O remains disabled in the backup process at this point; the disk volume is still logically reduced in the backup process.
Once the operations 374, 376, and 379 have been completed, the Disk Driver 309 in the backup disk process is ready to use and make use of the new path information in 378. Invoked as part of the second pass through the ODR processing. This is the second time Reconfigure Driver is invoked. The Reply 380 is responsive to the “ODR_Reconfigure” operation 372.
In the “ODR Primary” operation 381, the storage subsystem manager tells the primary disk process to perform the Primary_Disk action. The “ODR Primary” operation 381 causes the primary disk process to “Switch” roles with the backup disk process in 382, after which 383 is a “Handshake” operation The new backup disk process, which was switched form the original primary disk process in 352, is therefore switched again to the current primary disk process after 382. Additionally, the new primary disk process, which was switched form the original backup disk process in 352, is therefore switched again to the current backup disk process after 382. During this processing, the Disk Process uses the path information to determine how to handle the special ownership-switch request. The “ODR Primary” operation 381 is invoked after the Disk Driver 309 in the backup disk process has been reconfigured following the “Enable I/O in Primary” operation 384.
If the path information of the primary disk drive differs from that of the backup disk drive, then a) start the disk process thread in the new primary process; b) stop the disk process thread in the new backup process; and c) call the “Driver_Brother_Down” operation to break the link between the primary process and the backup process. At this point, the new path configuration is in use in the primary disk process. The Disk Driver 309 in the primary disk process is temporarily no longer in communication with the Disk Driver 309 in the backup disk process, which means that path reconfiguring can be performed in the backup disk process.
If the path information is the same between the primary and backup disk processes, then start the disk process thread in the new primary process. There is no need to start the disk process thread in the backup process since it is already running from prior to the ownership switch. Call the “Driver_Brother_Up” operation, which causes the Disk Driver 309 in the primary process to recreate its link to the Disk Driver 309 in the backup process. The configuration change is complete in the disk process and is in active use. The replies 385, 387, and 388 are responsive to the ODR-Primary operation 381 (or other associated operations).
In the Update Database operation 389, the storage subsystem manager 312 updates the system-configuration database depending on what type of ODR processing was performed. To add a disk as described with respect to
The control process 302 then transmits an update path configuration in backup request 320 to the RAID process 306 (that can be configured in
One embodiment the RAID disk drive reconfiguring process 200 that acts to delete a mirror disk drive as a mirror is described with respect to
In the “Begin_ODR” operation 404, the storage subsystem manager send a message to the primary disk process, telling it to commence the Online Disk Remirroring processing. This message causes the primary disk process to perform the following steps: a) perform the “stop the disk process disk process thread in the backup process” operation 406, and b) call the “Driver_Brother_Down” operation in 408, which causes the Disk Driver 309 in the primary process to break its link to the Disk Driver 309 in the backup process in 410. This means that the Disk Driver 309 in the primary disk process can not, for example, send checkpoints to the Disk Driver 309 in the backup disk process. The “Begin_ODR” operation is invoked before changing the SCSI Interface Monitor configuration. Reply 414 is responsive to the “Begin_ODR” operation 404.
In the “Prepare To Delete Mirror In SCSI Interface Monitor (SIFM)” operation 418, the storage subsystem manager calls the “Predelete_Mirror” operation, which tells the SIFM that the mirror disk drive is about to be deleted. This request causes the SIFM to present the new path information from the Disk Driver 309 when it performs a path fetch but keeps a copy of the old path information in memory for fallback purposes. The “Prepare to Delete Mirror in SIFM” operation 418 is invoked before calling first “ODR_Primary” operation and before reconfiguring the Disk Driver 309 in the backup disk process. The reply 420 is responsive to the “Prepare to Delete Mirror in SIFM” operation 418.
The “ODR_Reconfigure” operation 422 acts to the storage subsystem manager tells the backup disk process to reconfigure the Disk Driver 309. The “Disk Process_Reconfigure Request” causes the backup disk process to invoke the following three steps: a) call the “Driver_Stop” operation 424, which causes the Disk Driver 309 to think that the disk process is going away thereby causing the Disk Driver 309 to clean up configuration entries and data structure in the backup process; b) call the “Driver_Environment” operation 426, which causes the Disk Driver 309 in the backup process to retrieve the changed path information from the SIFM in the backup process in 428 and to create new data structures; and c) call the “Driver_Initialize” operation 430, which causes the Disk Driver 309 to go through setup processing. The I/O remains disabled in the backup process at this point—the disk volume is still logically down in the backup process.
Once these steps have been completed, the Disk Driver 309 in the backup disk process is ready to make use of the new path information from 428. The “ODR_Reconfigure” operation 422 is invoked as part of the first pass through the ODR processing. This is the first time Reconfigure Driver is invoked. Reply 432 is responsive to the “ODR_Reconfigure” operation 422.
In the “Prepare To Delete Mirror In SCSI Interface Monitor (SIFM)” operation 434, the storage subsystem manager tells the SIFM that the mirror disk drive is about to be deleted. This request causes the SIFM to present the new path information from the Disk Driver 309 when it performs a path fetch but keeps a copy of the old path information in memory for fallback purposes. The “Prepare To Delete Mirror In SIFM” operation 434 is invoked before calling first “ODR_Primary” and before reconfiguring the Disk Driver 309 in the backup disk process. The reply 436 is responsive to the “Prepare To Delete Mirror in SCSI Interface Monitor (SIFM)” operation 434.
In the “ODR_Primary” operation 438, the storage subsystem manager tells the primary disk process to perform the Primary_Disk action which causes the primary disk process to switch roles with the backup disk process in 440. Following the “Switch” operation 440, the backup disk process is now the primary disk process and vice versa. During this processing, the Disk Process uses the path information to determine how to handle the special ownership-switch request using the “Handshake” operation 442 and the “Enable I/O in Primary” operation 444. The Reply 448 is responsive to the “ODR_Primary” operation 438.
If following the “ODR_Primary” operation 438, the path information is different between the primary and backup disk processes, then: a) start the disk process thread in the new primary process; b) stop the disk process thread in the new backup process; and c) call the “Driver_Brother_Down” operation. At this point: a) the new path configuration is in use in the primary disk process; and b) the Disk Driver 309 in the primary disk process is no longer in communication with the Disk Driver 309 in the backup disk process, which means that path reconfiguring can be performed in the backup disk process.
If the path information is the same between the primary and backup disk processes, then: a) start the disk process thread in the new primary process. (There is no need to start the disk process thread in the backup process since it is already running from prior to the ownership switch.); and b) call the “Driver_Brother_Up” operation, which causes the Disk Driver 309 in the primary process to recreate its link to the Disk Driver 309 in the backup process. This means that Driver checkpointing is activated again. The “ODR_Primary” operation 438 can be invoked after the Disk Driver 309 in the backup disk process has been reconfigured.
In the “Begin_ODR” operation 450, the storage subsystem manager sends a message to the primary disk process, telling it to commence the Online Disk Remirroring processing. This “Begin_ODR” operation 450 causes the primary disk process to perform the following steps: a) stop the disk process disk process thread in the backup process in 452, and b) call the “Driver_Brother_Down” operation in 454, which causes the Disk Driver 309 in the primary process to break its link to the Disk Driver 309 in the backup process in 456. This means that the Disk Driver 309 in the primary disk process can not, for example, send checkpoints to the Disk Driver 309 in the backup disk process—however, the Disk File Manager checkpointing remains active. The “Begin_ODR” operation is invoked before changing the SCSI Interface Monitor configuration. The reply 458 is responsive to the “Begin_ODR” operation 450. The “Begin_ODR” operation is invoked after changing the SCSI Interface Monitor configuration.
In the “Prepare To Delete Mirror In SCSI Interface Monitor (SIFM)” operation 460, the storage subsystem manager specifies to the SIFM that the mirror disk drive is about to be deleted. This request causes the SIFM to present the new path information from the Disk Driver 309 when it performs a path fetch but keeps a copy of the old path information in memory for fallback purposes. The “Prepare To Delete Mirror In SIFM” operation 460 is invoked before calling first “ODR_Primary” operations and before reconfiguring the Disk Driver 309 in the backup disk process. The reply 462 is responsive to the “Prepare To Delete Mirror In SIFM” operation 460. The “Prepare To Delete Mirror In SIFM” operation 460 is invoked after calling first “ODR_Primary” operation and before reconfiguring the Disk Driver 309 in the backup disk process.
In the “ODR_Reconfigure” operation 464, the storage subsystem manager tells the backup disk process to reconfigure the Disk Driver 309. The “ODR_Reconfigure” request causes the backup disk process to invoke the following three steps: a) call the “Driver_Stop” operation 466, which causes the Disk Driver 309 to think that the disk process is going away thereby causing the Disk Driver 309 to clean up configuration entries and data structure in the backup process; b) call the “Driver_Environment” operation 468, which causes the Disk Driver 309 in the backup process to retrieve the changed path information from the SIFM in the backup process in 470 and to create new data structures; and call the “Driver_Initialize” operation 472 which causes the Disk Driver 309 to go through setup processing. The I/O remains disabled in the backup process at this point, the disk volume is still logically down in the backup process. The “ODR_Reconfigure” operation 464 is invoked as part of the second pass through the ODR processing. (This is the second time Reconfigure Driver is invoked). The reply 476 is responsive to the “ODR_Reconfigure” operation 464.
In the “ODR_Primary” operation 480, the storage subsystem manager tells the primary disk process to perform the Primary_Disk action which causes the primary disk process to “Switch” roles with the backup disk process in 482. Following the “Switch” operation 482, the backup disk process is now the primary disk process and vice versa. During this processing, the Disk Process uses the path information to determine how to handle the special ownership-switch request using the “Handshake” operation 484 and the “Enable I/O in Primary” operation 486. The Reply 488 is responsive to the “ODR_Primary” operation 480.
If following the “ODR_Primary” operation 480, the path information is different between the primary and backup disk processes, then: a) start the disk process thread in the new primary process 490; b) stop the disk process thread in the new backup process; and c) call the “Driver_Brother_Up” operation. At this point: a) the new path configuration is in use in the primary disk process; and b) the Disk Driver 309 in the primary disk process is no longer in communication with the Disk Driver 309 in the backup disk process, which means that path reconfiguring can be performed in the backup disk process. If the path information is the same between the primary and backup disk processes, then: a) start the disk process thread in the new primary process. (There is no need to start the disk process thread in the backup process since it is already running from prior to the ownership switch); and b) call the “Driver_Brother_Down” operation which causes the Disk Driver 309 in the primary process to recreate its link to the Disk Driver 309 in the backup process. This means that Driver checkpointing is activated again. The “ODR_Primary” operation 480 can be invoked after the Disk Driver 309 in the backup disk process has been reconfigured. The replies 492 and 494 are responsive to the “ODR_Primary” operation 480.
In the Update Configuration Database 496, the storage subsystem manager 312 updates the system-configuration database depending on what type of ODR processing was performed. The mirror-disk path information is removed from the driver record. The storage subsystem manager then replies to the user, indicating success. The reply 498 is responsive to the Update Configuration Database 496.
III. Reconfiguring Operations
The RAID disk drive reconfiguring process 200 can be implemented as a process pair that runs in different processes. In one embodiment, the path configuration associated with the RAID disk drive reconfiguring process 200 is kept in system tables replicated to the processes in which the disk process pairs are running. In one embodiment, the configuration information associated with the RAID disk drive reconfiguring process 200 is maintained in separate system database tables that are managed by a separate program. In another embodiment it is not necessary to maintain the configuration information that are associated with the RAID disk drive reconfiguring process 200 in separate system database tables, but instead the information is maintained within a single system table as a process pair. It is not necessary to have a separate process managing the overall procedure from the process that is running the RAID disk drive reconfiguring process 200 (even though multiple distinct managing and running processes are useful) since a single procedure is handled within the process pair itself.
In one embodiment of the ODR mechanism 100 of the present disclosure, reconfiguring might be not allowed when the disk volume 112 is in a transitional state such as during a disk revive. If, for any reason, the user has an overriding reason to move the disk drive when in a transitional state, then the user can simply temporarily halt the revive or other transitional activity, after which the disk drive is moved. The disk revive is restarted by the operator once the disk drive has been moved.
Many RAID disk drive reconfiguring processes 200 use the ODR mechanism 100 to move at least one disk drive. For example, swapping a primary disk drive with a mirror disk drive as described with respect to
During one embodiment of moving the primary disk drive that is associated with the moving the mirror disk drive as shown in TABLE 1, the operator can undergo the procedure described in TABLE 2.
There are a number of operations that are described within this disclosure that are used to reconfigure disk drives and thereby provide the RAID disk drive reconfiguring process 200 functionality to the ODR mechanism 100. These operations are intended to be illustrative and not limiting.
A certain number of embodiments of the ODR mechanism 100 provide the capability to power on or off a disk drive from the system configuration facility. The “Power_Off” operation and the “Power_On” operation provides for the capability to perform a consistency check between the different storage components in regards to configuration information. The “Control Disk” operation relates to Power {OFF|ON}, which provide “Power_Off” and “Power_On” operation attributes. The Power {OFF|ON} operation attributes are used to power off or on a disk drive that is in a Stopped state.
In certain embodiments, the “Alter Disk” operation can include a “Swap_Mirror” operation and a “Mirror_Location” operation. The “Swap_Mirror” operation attribute is used to switch roles between the primary disk drive and mirror disk drives of a mirror disk volume 112. The “Mirror_Location” operation attribute is used to add a disk drive to a non-mirrored disk volume 112 thereby creating a mirrored disk volume. The “Mirror_Location” operation attribute is allowed even if the disk volume 112 is in a Started state.
A “Delete Disk” operation acts to delete a mirror disk drive (even if primary disk drive is in a Started state). To use the “Delete Disk” operation, the mirror disk drive transitions to a Stopped state.
In one embodiment of the “Status Disk” operation, a Consistency operation attribute is added that is used to validate that the configuration information is equal in the system-configuration database, the disk process, and the SCSI Interface Monitor system tables.
Both the primary disk process and the backup disk process can do I/O to both the primary disk drive and the mirror disk drives. The backup disk process is configured to take over from the primary disk process at any point during processing. To change the path configuration of a disk drive, the processes as shown in TABLE 3 are taken.
To change the mirror disk drive-related configuration of a disk volume 112, the user can first put the disk volume into a stopped state using the “Stop Disk” operation. The “Stop Disk” operation causes the ODR mechanism to be inaccessible to user processes. The stopped state is used for mirror disk drive-related configuration changes to be accepted. If the user does not put the disk volume 112 into a stopped summary state, then an error is generated. If the user wishes to delete the mirror disk drive of a mirror disk volume, the system configuration facility “Delete Disk” operation is used.
If the user wants to add a mirror disk drive to an unmirrored disk volume 112, the system configuration facility “Alter Disk” operation is used, specifying the location of the mirror disk drive. For internal disks, the storage subsystem manager can derive that the mirror-backup path is the same as the mirror path.
If the user wants to move the primary disk drive, the disk volume 112 can first be deleted from the system configuration using the system configuration facility “Delete Disk” operation and then be added back to the system configuration using the system configuration facility Add Disk operation.
If the user wants to switch the roles between the primary disk drive and the mirror disk drive of the disk volume 112, the disk volume can first be deleted from the system configuration using the system configuration facility “Delete Disk” operation and then be added back to the system configuration using the system configuration facility Add Disk operation.
Once the configuration change is complete, the user places the disk volume 112 in a started summary state using the system configuration facility “Start Disk” operation. If a mirror disk drive has been added to the disk volume 112, the user is asked whether a disk revive should be started using the “Start Disk” operation.
If a mirror disk drive has been added to the disk volume 112 but there is no physical disk drive present in the system, then a “Start Disk” warning is generated. When the disk drive is inserted, the disk revive is started automatically if the “Autorevive” setting is set to ON and the disk volume 112 consists of internal disks. If either of these two conditions is false, then the operator can issue a “Start Disk” operation to cause the disk revive to be started. Certain embodiments of the present disclosure provide for certain automatic capabilities, in which automatic action is provided using, e.g., disk inserts.
One embodiment of the storage subsystem manager provides processing that can provide for non stop process management, multitasking, and allowing blocking calls to be processed in a standalone environment. Therefore, the storage subsystem manager itself is focused on processing different events, such as programmatic command buffers operations and different system events such as process reload.
IV. “Alter Disk” Processing
Modifying one, or several of, the disk(s) 114 of the disk array as described with respect to
In the case described with respect to TABLE 4, the mirror disk drive-location attributes are not changed while the disk volume 112 is in a started state. Furthermore, the only path attribute that is specified is VAL_DEVICEPATH_NONE. Once all the attributes have been validated, the storage subsystem manager generates an alter-audit Event Management System event and passes the processing to the “Phys add_alter” operation in the storage subsystem manager. This is done since much of the processing used to process the “Alter Disk” operation uses the invocation of blocking function calls the “Phys add_alter_” operation that performs those actions shown in TABLE 5.
When receiving a SPI buffer containing an object-type token with the value of Obj_Disk and a operation-type token with the value of CMD_Delete, the storage subsystem manager does the following by checking whether: a) the forced token is provided, if so the value is validated; b) a path token is provided, if so the value is validated; c) the pool token is provided, if so the value is validated.
One embodiment of the standard processing that is applied for all operations within this disclosure is shown in TABLE 6.
Certain embodiments of the storage subsystem manager supports automated actions in response to disk-insertion events, which are currently limited to internal disk drives. There are a number of illustrative settings that can control automation as described with respect to TABLE 7.
There are several techniques to control whether the automated actions take place. A prescribed operation can turn off certain settings. Alternatively, each disk-volume configuration record contains the above settings, thereby allowing the operator to turn the automatic actions on or off on a per-disk-volume level. In yet another embodiment, an object controls the default settings for all attributes when a disk volume 112 is added. These automated actions can also be invoked during system startup as part of discovery processing.
In one embodiment, the storage subsystem manager code verifies the operation buffer. Additionally, the buffer is forwarded to one of multiple process in the storage subsystem manager. This code practice causes a problem for the Online Disk Remirroring (ODR) since the ODR involves many steps to execute the operation, which can causes database corruption if a process outage or a software defect causes a storage subsystem manager process to stop.
The remedy for this situation is that the mirror-related portion of the processing is moved, and the use of the storage subsystem manager process is limited to deal with calls to specific blocking functions only. Thus, the programming paradigm for the “Alter Disk” operation is as shown in TABLE 8.
Whether to process the mirror-related attributes first or last is a matter of design preference. Given that calls to blocking functions is done by the storage subsystem helper processes, it's impossible for the storage subsystem manager itself to know whether a system-configuration database request was performed if the storage subsystem manager fails before the result of the request is returned. A server process should handle duplicate requests using the opener table; that is, it should keep track of the n latest requests on a per-opener level. However, given that the Application Programming Interface (API) of the Configuration Services is implemented as a connection-less API, no opener-table processing has to be done.
Therefore, the process can perform a pre-check before commencing the mirror disk drive reconfiguring, to ensure that there are no records for the target disk drive location. The new mirror disk drive is addressed by location in the system configuration facility “Alter Disk” operation. The process can ignore any record-does-not-exist errors when deleting records and any record-already-exists errors when doing inserts.
To block other operations, there's a race condition that can occur when processing the system configuration facility “Alter Disk” operation, especially when mirroring an unmirrored disk or when changing disk drives in a mirrored disk volume 112: another operation can also use the free location. To address this race condition, the storage subsystem manager can pre-configure the free location as soon as the operation buffer has been validated, thus prohibiting other operations from making use of the location.
This pre-configured alternate-key record is treated as: a) if the remirroring operation succeeds: the pre-configured alt-key record is made a permanent configuration record; b) if the remirroring operation fails: the pre-configured alt-key record is deleted as well as if there's a system failure in the midst of processing the remirroring operation.
Another concern in this implementation is that operations such as the Primary_Disk and the Info_Disk and Label operations should be blocked while the configuration change is pending. This should be addressed by ensuring that the task table is scanned for other tasks manipulating the object. The storage subsystem manager is capable of checking whether a task for the specific object exists thereby allowing the storage subsystem manager to reject another request to do work for that task.
Certain disk drives should not be configured automatically within the ODR mechanism 100 since they are intended to be used as part of a pool of spare disk drives. This pool of spare disk drives can allow the user to configure the system to automatically replace a failed disk drive with one of the spare disk drives by performing an automated ODR mechanism that starts a disk revive. This feature is known as “Disk Sparing”. The system configuration facility POOL object is a collection of software resources as shown in TABLE 9.
Providing the disk drive pool allows handling of both automatic configuration and automated disk sparing. If the location is in the disk drive pool, then the disk drive in that location can not be configured automatically.
The user interface can modify the behavior of the system configuration facility “Alter Disk” and the “Delete Disk” operations as a result of the ODR mechanism. The “Control Disk” operation is enhanced to allow the operator to power disk drives on or off. A Consistency option is added to the Status Disk operation to allow the operator to display whether the configuration information is equal between the system-configuration database, the two SCSI Interface Monitor processes, and the two disk processes that make up the disk process pair.
Mirroring-related disk drive configuration changes are made online; that is, while the disk volume 112 is in a Started state. The reconfiguring action is not allowed when the disk volume 112 is in a transitional state (for example, during a disk revive). If the user has a reason to move the disk drive and a disk revive is in progress, then the user can stop the revive after which the disk drive is moved. The disk revive is restarted by the operator once the disk drive has been moved.
The “Alter Disk” operation has many different options where the Mirror_Location attribute applies to the online disk remirroring. Therefore, the “Alter Disk” operation is changed to allow the Mirror_Location attribute to be executed even if one of the disk paths is in the Started state. In addition, a “Swap_Mirror” attribute is added to the “Alter Disk” operation.
The Mirror_Location attribute of the “Alter Disk” operation is used to add the location of the mirror disk drive in the disk volume 112. For brevity, the Mirror_Location attribute is used in this document to represent the MBACKUPLOCATION, MBACKUPSAC, Mirror_Location, and MIRRORSAC attributes, which all affect the definition of the mirror disk drive. Consider that in certain embodiments: a) the user is given the option to start a disk revive once the reconfiguring is completed. b) The Mirror_Location attributes are not specified if the “Swap_Mirror” attribute is specified; and c) This operation is allowed when all the paths are in a Started state (online disk remirroring) or in a Stopped state (offline disk remirroring).
The “Swap_Mirror” attribute of the “Alter Disk” operation can cause the two disk drives of a mirrored disk volume 112 to switch roles; that is, the primary disk drive becomes the mirror disk drive and vice versa. Consider that in certain embodiments: a) the disk volume 112 should be mirrored for this operation to work; b) both disk drives should have the same number of paths; that is, both disk drives have—or do not have —backup paths; C) the Mirror_Location attribute is not specified if the “Swap_Mirror” attribute is specified; d) for offline disk remirroring, all configured disk path should be in a Stopped state; and for online disk remirroring, this operation is allowed when the disk paths are in specific states only. TABLE 10 shows illustrative legal states.
In certain embodiments: the user can specify only a P path or an M path, and all paths for the disk drive (that include the Primary path (P), the Backup path (B), the Mirror path (M), and the Mirror-Backup path (MB) The disk drive should be stopped before one of these operation options is used.
The “Delete Disk” operation is changed to allow the mirror disk drive to be deleted while the disk volume 112 is in a Started state. The operation can continue to use the disk volume 112 in a Stopped state if the whole disk volume is to be deleted.
To delete the mirror disk drive, the user can issue the “Delete Disk” operation, thereby indicating that the mirror disk drive is to be removed from the configuration. Consider that in certain embodiments: a) all paths to the mirror disk drive (-M and -MB) should be in the Stopped state before this operation is used; b) if the primary disk drive is to be deleted, then the roles between the disk drives should first be switched using the “Alter Disk”, “Swap_Mirror” operation; c) for offline disk remirroring, all configured disk path should be in a Stopped state; and for online disk remirroring, the -P and -B paths should be in a Started state while the -M and -MB paths should be in a Stopped state.
The Status Disk operation is amended with a Consistency option, which is used to verify whether the path configuration stored in the system-configuration database is equal to the path configuration used by the two disk processes (of the disk-volume process pair) and the two SCSI Interface Monitor processes.
The Status Disk, Consistency operation is issued at any time and can return the configuration information for the configured paths. In addition, information is displayed if there is a difference between the configuration information and the information from one of the other sources; that is, the disk process or the SCSI Interface Monitor.
Users can initiate configuration changes and prompts that the user can need to respond to. These illustrative embodiments assume that the user is running the system configuration facility in interactive mode (e.g., entering operations at the system configuration facility prompt), and may even be integrated within a Graphical User Interface (GUI) as is generally understood in computer technologies.
The “Autorevive” setting does not have any meaning when doing an Online Disk Remirroring since that setting requires that a disk-insertion event is sent to a storage subsystem manager process. This event triggers the storage subsystem manager to check this setting. In one embodiment, if the new disk drive is in place when the “Alter Disk” operation is given, then no insertion event is generated. If the new disk drive is not in place when the “Alter Disk” operation is given, then a storage warning is generated.
If the disk drive is inserted at a later time, then the disk revive is started automatically if the disk drive is an internal disk drive, and the “Autorevive” setting for the disk volume 112 is set to ON. If not, then the operator should start the disk revive manually using the “Start Disk” operation.
Errors and Error Recovery
The ODR mechanism 100 as described with respect to
In one embodiment, after the first process and the second process have both been completed, the values of the first and the second process should be compared to each other, In those instances that the second process reflects (includes an identical process and identical input data) as the first process, then the output data of the first process should be identical to the second process. If the output of the first process does not reflect the output of the second process in these instances, then it should be concluded that an error occurred in at least one of the processes. The first output and the second output values can then be analyzed to determine where the error occurred and corrected.
When an error is detected, the RAID system 102 thereby determines whether the data state in the disk array 104 as described with respect to
A number of events may be generated by the disk process file management during the RAID disk drive reconfiguring process 200 within the ODR mechanism 100. The events include both an ODR internal error indicator and an ODR recovery error indicator. ODR error messages is provided to the user that relate to ODR failures within the ODR mechanism 100. One embodiment of a generalized ODR error within the ODR mechanism 100 is provided as described in TABLE 11.
In which:
Consider that the ODR-step includes, e.g., begin, reconfigure, or primary. The ODR-error-detail provides an error detail for an ODR-error. The ODR-processor represents a processor number of the pair that was processing ODR work within the ODR mechanism. The ODR-process-mode describes whether the reporting process was acting as the primary process, or the backup process.
The generalized embodiment of ODR error as shown in TABLE 11 is caused by some error occurring during some aspect of the RAID disk drive reconfiguring process 200. In one embodiment, the RAID disk drive reconfiguring process 200 within the ODR mechanism 100 may fail if a retry is not successful. Reliable error handling enhances the reliability of the ODR mechanism 100. With error handling processes in general, there is no recovery if there are no associated failures. If due to a soft-down disk process, the ODR mechanism 100 issues a manual abort of the operation. If the ODR error results from, e.g., a processor failure, the ODR_State within the ODR mechanism 100 is checked to determine whether the operation succeeded or failed. If necessary, the ODR mechanism 100 can re-issue the operation after reloading the failed processor. One embodiment of the recovery for an ODR error is described in TABLE 12.
in which:
Relative to TABLE 12, the ODR-step represents begin, reconfigure, or primary. The ODR-error represents an error. The ODR-error-detail represents the error detail. The ODR-processor includes the processor number of the pair that was processing ODR work within the ODR mechanism. The ODR-process-mode represents that the reporting process was acting as the primary process (or backup).
The cause of the ODR recovery error indicates the cause of an automatic or manual ODR operation recovery attempt failure. The effect of the ODR recovery error is that the ODR recovery attempt may fail within the ODR mechanism 100 if the retry is not successful. Within the ODR recovery error recovery due to a processor failure, check the ODR_State within the ODR mechanism 100 to determine if the operation succeeded or failed. If the operation failed, re-issue the operation after reloading the failed processor, if needed. An off line reconfiguring may be used to recover via the system configuration facility using the STOP, RESET, ALTER, and START operations as described in this disclosure.
The interfaces provided in TABLE 13 are used by the disk process file manager and the disk process driver components during the RAID disk drive reconfiguring process 200 as performed by the ODR mechanism 100. These and modified versions of these operations may also be used during the error detection and error recover procedures as described in this disclosure.
The Driver_Stop_ODR operation acts to halt the RAID disk drive reconfiguring process 200. With the Driver_Stop_Normal operation, the disk process file management does not actually halt the RAID disk drive reconfiguring process 200, and can continue providing checkpoints from the primary disk drive to the backup processor.
Once the “Driver_Brother_Down” is called, input/output can not be provided via backup. This means that the volume may go down if there is path loss in the current primary processor. The down state is sent to the backup via checkpoint by the disk process file management.
There are a variety of inter-process communication messages that are associated with certain embodiments of the RAID disk drive reconfiguring process 200. A message interface is used to transfer the inter-process communication messages between the system configuration facility/storage subsystem manager and the disk process file management to control the RAID disk drive reconfiguring process 200 within the ODR mechanism 100. The message definition of each inter-process communication message is maintained in a database that is maintained by the system configuration facility/storage subsystem manager product. The illustrative fields of TABLE 14 for the inter-process communication messages are included in the ODR request structure.
The ODR_Reason operation provides an indication to the configuration change that is under way. The ODR_Target operation ensures the correct ODR_Target is selected (thereby limiting switching within the process).
The ODR_State is maintained by the system configuration facility/storage subsystem manager can keep track of the processor that is under reconfiguring. In one embodiment, an ODR_State of Reconfig0 means that the first processor is being worked on. An ODR_State of Reconfig1 or RECOVERY means the second processor is being worked on (with the first one being complete).
The use of abort state can result in the use of a full backup process stop and restart instead of the current driver stop and restart. This depends upon the current state of the primary process and backup processes. Illustrative actions performed for each RAID disk drive reconfiguring process 200 (similar to those described with respect to
If the value of the ODR_State as described in TABLE 15 is greater than zero, there is a presumption that any error can leave the volume in an inconsistent state. An inconsistent state is defined to be where the processes in each processor are running with different configurations. The system configuration facility can automatically attempt to resolve the inconsistency with a recovery attempt, when possible. Exceptions to this are processor failures where the remaining process is running the old configuration (presume abort) or the new configuration (presume success).
One embodiment of the ODR reply structure that relates to the disk process (DP) includes the fields shown in TABLE 16.
The recovery action used depends upon the state of the disk drive volume. The following table provides the recovery action for each step in the RAID disk drive reconfiguring process 200. Any retry or delay and retry return should be counted, and a minimum number of retries should be attempted.
Any error related to processor or soft-down failure of either the primary or backup can cancel the RAID disk drive reconfiguring process 200. For processor failure, the current configuration state in use by the remaining processor can determine if the RAID disk drive reconfiguring process 200 was successful or aborted.
If an error occurs during reconfigure, the storage subsystem manager should perform an “ODR_Primary” operation to ensure cleanup of the disk process. This can not result in any processor switch, since the disk process can retain the failure state and avoid any processor switch. An error should be returned to the primary request, without any retry; or manually by some system configuration facility operation. TABLE 17 provides one illustrative embodiment of a reconfiguring recovery action for an ODR mechanism.
The ODR recovery process is performed automatically by the storage subsystem manager. If the ODR recovery request is sent to the disk process, the following steps are performed: a) stop the backup process (e.g., using the “Begin_ODR” operation); b) restart the backup process (e.g., using the “ODR_Reconfigure” operation sent to the primary); and c) switch to the backup process (e.g., using the “ODR_Primary” operation). In one embodiment, the correct configuration state is established by returning the configuration state back to the volume in the processor prior to performing the recovery operation. Any failures can cancel the recovery operation that is reported to the user.
The ODR Recovery request is sent to the current primary processor disk process process, and it can operate on the current backup processor disk process process. The ODR Recovery request restores an inconsistent backup processor process to a state that is consistent with the current primary processor process. The correct configuration may be new or old. The abort operation restores the full non-stop function. The correct configuration is that configuration currently in use in the primary processor process. Errors that use an abort process are not expected, since most errors are the result of processor or soft-down failures. A softdown failure is a internal software failure that removes the thread(s) from service, and forces an unplanned switch to the backup threads.
Any processor failures that occur during the RAID disk drive reconfiguring process 200 may be related to the attempt to perform the ODR operation. Upon such failures, the RAID disk drive reconfiguring process 200 is declared successful if the volume is successfully loaded using the new configuration. The RAID disk drive reconfiguring process 200 is aborted if the volume uses the old configuration. If the volume is in a down state after a processor failure, the recommended action may be to reset and restart the volume, after reverting to the prior configuration. Internally detected failures can result in a soft-down disk process member. Once this occurs, the ODR processing is discontinued. The soft down state may or may not be related to the ODR process.
One embodiment of aborting is to support an ODR abort request (that can shut down the processes and restart them) as needed. The system configuration facility/storage subsystem manager should perform the abort processes in both processors (both processors are stopped and restarted). This can avoid the need to determine which processors are soft down, and which have the old or new configuration.
If the driver does not checkpoint the UP state, and the primary moves to a soft down state, the volume is moved to a down state. This continues to use an abort to stop and re-start the processes. It may be possible to use the RESET and the START operation to recover for this case, since the volume is in a down state. A test program is used to verify the disk process file manager and driver functions.
The processing of the different system events (one embodiment shown in TABLE 18) is specific to each event and is not directly relevant to the ODR processing design. Therefore, this processing is not described any further in this document. The processing of SPI operations differs depending on what operation was received but there is a common method that is used for most if not all operations.
This portion of the disclosure applies relates errors and error recovery more particularly to system facilities. A variety of system configuration facility errors that might be generated within the (ODR) mechanism 100 are now described. The cause of the “Disk Not Mirrored” error is that the operation is not supported for a non-mirrored disk volume 112. The effect of the “Disk Not Mirrored” error is that the operation is not executed, and the system configuration facility waits for the next operation. The recommended action of the “Disk Not Mirrored” error is that this operation is not performed on the specified disk.
In certain embodiments, the error text is adjusted depending on the source of the error and the effect of the error. The cause of this message is that an error occurred when changing the configuration. The effect of the error is that the configuration change did not occur. The “Configuration Status” indicates whether the configuration information is consistent in the processes in which the disk process is running. One recommended action is to determine whether to retry the operation.
Once the error-detail portion of the error message has been addressed, the user can either reissue the operation indicating the desired new configuration, or reissue the operation indicating the previous configuration.
In the “disk not mirrored” error, the configuration change did succeed (i.e., the desired configuration is in use). However, the configuration information in the process that contained the original primary disk process when the operation was entered was not updated properly. The cause of the “Disk Not Mirrored” error is that an error occurred when changing the configuration. With the “Disk Not Mirrored” error, the configuration was changed but in one process only. The error message indicates which subsystem encountered the error, what the error was, and which process was affected by the error. A recommended action of the “Disk Not Mirrored” error is to make the configuration change which caused the configuration information in the indicated process to become invalid. Because of this, fault tolerance might be lost. For example, the backup disk process (the original primary process) may no longer be able to access the mirror disk drive.
Depending on the error indicated in the error- and error-detail portion of the message, the user can reissue the operation indicating the desired new configuration, or reissue the operation indicating the previous configuration. Using the original configuration can cause the storage subsystem manager to attempt to undo the configuration back to the original setting.
One example of an “Alter Disk” error may state: “Wrong state for DISK State: Started”. This error can result from attempting to perform an action that is illegal for the specified disk volume 112. The effect of this error is that the operation is not executed. The system configuration facility waits for the next operation.
There are a variety of embodiments of implementation details to the RAID disk drive reconfiguring process 200 that may include all or some of the following:
In one embodiment, the disk power off and disk power on processing is implemented as an extension of the current “Control Disk” operation. The power-related operations are supported for internal disks only since there is no way to tell an external disk drive to power on or off. The power-related operations for internal disks are handled by the Service Process. The power-related attributes of the “Control Disk” operation is handled in this manner:
One consideration of the RAID mechanism design relates to determining how errors is handled. There are situations where the configuration information in the storage subsystem manager, the disk process, the driver, and the SCSI Interface Monitor (SIFM) can become out-of-sync.
Any ODR mechanism 100 and/or RAID disk drive reconfiguring process 200 design should be configured to expect that such an out-of-sync situation might occur, and be able to clean up any out-of-sync situation. When an out-of-sync situation occurs, the backup disk process should go into a soft-down state, allowing it to reply with a configuration-is-inconsistent error. (An Event Management System event should be generated the first time this is detected). To address this error, the user should delete the configuration and then add the disk back into the system-configuration database.
The primary disk process can not go into a soft-state since it should encounter a configuration-is-inconsistent error due to an Online Disk Remirroring (ODR) action. Therefore, continued application access is provided.
The ODR mechanism 100 should be able to handle out-of-sync replies. When a configuration inconsistency is detected, it is desired that the backup disk process goes into a soft-down state and replies with a configuration-is-inconsistent error. The backup disk process with its associated Driver and the SIFM all have to reply with a configuration-is-inconsistent error if a configuration inconsistency is detected—they cannot call halt. The backup disk process should generate an Event Management System event when the error is detected and go into a soft-down state (that indicates the backup disk can no longer do anything but reply with the configuration-is-inconsistent error, reply to device-info request, and accept a stop message.
The primary disk process can have good configuration information. It can have access to the primary disk drive and maybe the mirror disk drive. In one embodiment, the role between the primary and backup disk process is not switched until the configuration change succeeds in the backup process. Therefore, if the configuration change fails prior to the first ownership switch, the original configuration is still valid in the primary disk process. If the configuration change fails after the first ownership switch, the primary disk process contains the new configuration information.
To clean up the configuration inconsistency, the user should issue operations from the system configuration facility. As documented in the system configuration facility Error Messages above, the normal corrective action is to use the “Alter Disk” or the “Delete Disk” operation to correct the invalid configuration. If this action fails, an application outage can occur and the user should reconfigure the disk from scratch, which is the topic of this subsection.
If an error occurs, the user is asked to use the system configuration facility “Alter Disk” or “Delete Disk” operation to remedy the situation. Therefore, the storage subsystem manager is designed to allow the same operation to be used several times to, for example, retry a configuration change. It is possible that a process failure occurs during an online disk drive re-mirroring. This disclosure describes what actions can be taken depending on which process failed and where in the processing the process failure occurred. The storage subsystem manager can detect a process failure using this operation:
If a process fails, the storage subsystem manager can use the server operation for the Status Disk, Consistency operation to determine the current state of the system-configuration database record, the disk process, and the SIFM. Given that there's no way to revert the disk process' configuration at this point, the storage subsystem manager can take appropriate states to change the SIFM configuration and system-configuration database to match the path configuration returned by the disk process in the surviving process.
If the primary process loses access to the disk drive, in one embodiment the processing of the RAID disk drive reconfiguring process 200 is aborted. That is, the configuration is returned to whatever known state is applicable, and the user is informed that an error occurred. The reasoning behind this approach is that it is likely that the backup disk process can not be able to access the disk drive(s) if the primary disk process loses access to the disk drive(s), it is unlikely that this situation can ever happen at a user site, and alternate error handling such as trying to switch to the backup disk process to attempt to recover from the situation may be more complicated.
Therefore, in one embodiment of the ODR mechanism 100, when the primary disk process loses access to the disk drive(s):
A “disk process recovery” is part of the requests that are supported by the disk process. Recovery requests are sent to the disk process whenever the storage subsystem manager detects that the RAID disk drive reconfiguring process is to be cancelled. Upon receipt, the disk process can perform whatever actions are needed to clean up after the processing of the RAID disk drive reconfiguring process, for example, start disk process threads and invoke the “Driver_BROTHER_Up” operation to reestablish the link between the driver in the primary and the backup disk processes. When used in this fashion, the storage subsystem manager is responsible for generating appropriate Event Management System events indicating the reason for the cancelled ODR processing.
The disk process can also use this operation when it detects a situation that causes the disk process to cancel the RAID disk drive reconfiguring process 200 by itself. In those cases, the disk process can invoke the ODR-recovery processing after which it responds with an error to the storage subsystem manager, indicating that the RAID disk drive reconfiguring process 200 has been cancelled. When used in this fashion, the disk process is responsible for generating appropriate Event Management System events indicating the reason for the cancelled RAID disk drive reconfiguring process 200. The storage subsystem manager invokes the disk process recovery as three steps which are invoked in sequence with the action value set to “recovery” as follows: “ODR_Begin”, “ODR_Reconfigure”, and “ODR_Primary”.
The following actions are common to all error handling:
Table 19 describes how errors such as a failure of a backup process are handled in one embodiment of ODR mechanism 100.
The following table describes how a failure of the primary disk process is handled in one embodiment of ODR 100.
Different embodiments of the storage subsystem manager are equipped with specific test interfaces. The test instrumentation can be available on multiple levels including: a) a per operation-level basis that can allow failures to be induced during specific steps in the operation processing; and b) a subsystem level to allow failures to be induced during the non-operation portions of the storage subsystem manager processing, for example, when handling certain system messages.
V. Conclusion
Although the disclosure is described in language specific to structural features and methodological steps, it is to be understood that the disclosure is not necessarily limited to the specific features or steps described. Rather, the specific features and steps disclosed represent different forms of implementing the disclosure.