The present disclosure relates generally to computer networks, and, more particularly, to a system and method for maintaining data integrity in a cluster network.
As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option available to these users is an information handling system. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may vary with respect to the type of information handled; the methods for handling the information; the methods for processing, storing or communicating the information; the amount of information processed, stored, or communicated; and the speed and efficiency with which the information is processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include or comprise a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
A server cluster is a group of independent servers that is managed as a single system. Compared with groupings of unmanaged servers, a server cluster is characterized by higher availability, manageability, and scalability. At a minimum, a server cluster includes two servers, which are sometimes referred to as nodes and which are connected to one another by a network or other communication links. A storage subsystem, including in some instances a shared storage subsystem, may be coupled to the cluster. A storage subsystem may include one or more storage enclosures, which may house a plurality of disk-based hard drives.
A server cluster network may include an architecture in which each of the server nodes of the network is directly connected to a single, adjacent storage enclosure and in which each server node is coupled to other storage enclosures in addition to the storage enclosure that is adjacent to the server node. In this configuration, the storage enclosures of the cluster reside between each of the server nodes of the cluster. Each storage enclosure includes an expansion port. To access a storage enclosure of the cluster—other than the storage enclosure of the cluster that is adjacent to the server—the server node must access the storage enclosure by passing the communication through the expansion ports of one or more of the storage enclosures, including the storage enclosure that is adjacent to the server node.
A RAID array may be formed of drives that are distributed across one or more storage enclosures. RAID storage involves the organization of multiple disks into an array of disks to obtain performance, capacity, and reliability advantages. In addition, the servers of the network may communicate with the storage subsystem according to the Serial Attached SCSI (SAS) communications protocol. Serial Attached SCSI is a storage network interface that is characterized by a serial, point-to-point architecture.
If a storage enclosure of the cluster were to fail, or if the communication links between the storage enclosures were to fail, drives of one or more of the storage enclosure would be inaccessible to the cluster nodes. In the case of a failed storage enclosure, for example, the drives of the failed storage enclosure and the drives of each storage enclosure that is distant from the server node would be inaccessible by the server node. In this example, the drives of the failed storage enclosure and the drives of any storage enclosure that is only accessible through the failed storage enclosure would not be visible to the affected server node. In addition, because any failed storage enclosure is necessarily located between two server nodes, each server node of the cluster may have a different view of the available drives of the storage enclosure. As a result of an enclosure failure, the drives of a RAID array may be separated from the server nodes of the storage network such that each server node of the cluster can access some, but not all, of the physical drives of the RAID array. In this circumstance, the server node that was the logical owner of the RAID array may or may not be able to access the RAID array.
In accordance with the present disclosure, a system and method for failure recovery and failure management in a cluster network is disclosed. Following a failure of a storage enclosure or a communication link failure between storage enclosures, each server node of the cluster determines whether the server node can access the drives of each logical unit owned by the server node. If the server node cannot access a set of drives of the logical unit that include an operational set of data, an alternate server node is queried to determine if the alternate server node can access a set of drives of the logical unit that include an operational set of data. A server node may not be able to access a complete set of drives of the logical unit if the drives of the logical unit reside on the failed enclosure or are inaccessible due to a broken storage link between storage enclosures. If the alternate server node can access the set of drives of the logical unit that include an operational set of data, ownership of the logical unit may be transferred to the alternate server node, depending on the storage methodology of the logical unit.
The system and method disclosed herein is technically advantageous because it provides a failure recovery mechanism in the event of a failure of an entire storage enclosure of a communication link failure between storage enclosures. Even though these types of failures may interrupt the ability of a first server node to communicate with all of the physical drives of a logical unit owned by that server node, the system and method disclosed herein provides a technique for accessing an operational set of data from the logical unit. In some instances, the accessible server nodes may comprise a complete set of drives. In other instances, the accessible server drives may comprise a set of drives from which a complete set of data could be derived, as in the case of a single inaccessible drive in a RAID level 5 array. Thus, because of the storage methodology of the drives of the logical array, the accessible drives of the logical unit may comprise an operational set of data, even if all of the drives of the logical unit are not accessible.
Another technical advantage of the failure recovery technique disclosed herein is that the techniques accounts for presence of a differing set of storage enclosures that may be visible to each server node of the network. When a storage enclosure fails, the failure presents each server node with a different set of operational storage enclosures and a different set of storage arrays. Thus, a portion of some drive arrays may be accessible to each server node; some drive arrays may be accessible by only one server node; and some drive arrays may not be accessible by either server node. Despite the differing views of each server node, the handling of ownership of the server nodes is managed such that the first server node having ownership retains ownership of the logical unit unless the first server cannot access the entire content of the storage array and the alternate or other server node can access the logical unit of the storage array.
Another technical advantage of the failure recovery technique disclosed herein is the ability of the recovery technique to preserve the data integrity and maintain the availability of the logical units of the computer network. When a set of drive is identified by a node as including an operation set of data, a designation is written to the drives to identify the drives as including the master copy of the data. The designation of the drives as including the master copy of the data prevents data discontinuities from occurring when the failed storage enclosure is restored and previously inaccessible drives, which may contain stale data, become accessible. Other technical advantages will be apparent to those of ordinary skill in the art in view of the following specification, claims, and drawings.
A more complete understanding of the present embodiments and advantages thereof may be acquired by referring to the following description taken in conjunction with the accompanying drawings, in which like reference numbers indicate like features, and wherein:
For purposes of this disclosure, an information handling system may include any instrumentality or aggregate of instrumentalities operable to compute, classify, process, transmit, receive, retrieve, originate, switch, store, display, manifest, detect, record, reproduce, handle, or utilize any form of information, intelligence, or data for business, scientific, control, or other purposes. For example, an information handling system may be a personal computer, a network storage device, or any other suitable device and may vary in size, shape, performance, functionality, and price. The information handling system may include random access memory (RAM), one or more processing resources such as a central processing unit (CPU) or hardware or software control logic, ROM, and/or other types of nonvolatile memory. Additional components of the information handling system may include one or more disk drives, one or more network ports for communication with external devices as well as various input and output (I/O) devices, such as a keyboard, a mouse, and a video display. The information handling system may also include one or more buses operable to transmit communications between the various hardware components.
Shown in
Cluster network 10 includes five storage enclosures 24, each of which includes four drives 26. In the example of
The storage enclosure of the network may include a plurality of RAID arrays, including RAID arrays in which the drives of the RAID array are distributed across multiple storage enclosures. In the example of
Shown in
In this example, storage enclosure 3 has failed. RAID array X is a two-drive Level 1 RAID array. Because of the failure of storage enclosure 3, only drive X1 of RAID array X is accessible by and visible to server node A. Nonetheless, because RAID array X is a Level 1 RAID array, the entire content of the RAID array is available on drive X1. Like server node A, server node B can access the entire content of RAID array X, as the entire content of the two-drive Level 1 RAID array is duplicated on each of the two drives, including drive X2 in storage enclosure 5. In this example, because server node A was the logical owner of the RAID array X before the failure of storage enclosure 3, and because server node A can access the entire content of the RAID Array, server node A remains the logical owner of RAID array X immediately following the failure of storage enclosure 3. Server node A will remain the owner of RAID array X so long as server node A can verify that it can access the entire content of the RAID array. If server node A later fails, and if ownership of RAID array X is transferred to server node B, server node B must not accept ownership over RAID array X. In this circumstance, server node B must fail any new I/O operations to RAID array X and maintain the existing content of RAID array X even though server node B may have access to drive X2, which in this example is a mirrored drive in a RAID Level 1 storage format that allows each mirrored drive to be seen by each node.
With respect to RAID array Y, drive Y1 is the only drive of RAID array Y that is visible to server node A. RAID array Y is a three-drive Level 5 RAID array. Because server node A can access only one drive of the three-drive array, the entire content of RAID array Y is not accessible by and through server node A. Server node B, however, can access drives Y2 and Y3 of RAID array Y. Because two of the three drives of the distributed parity array are visible to server node B, the entire content of RAID array Y is accessible by server node B, as the content of drive Y1 can be rebuilt from drives Y2 and Y3. Thus, while server node A cannot access the entire content of RAID array Y, server node B can access the entire content of RAID array Y. In this scenario, ownership of RAID array Y could be passed from server node A, which cannot access the entire content of the array, to server node B, which can access the entire content of the array. With respect to RAID array W, assuming that RAID array W is logically owned by server node A, the loss of the storage enclosure does not affect access to RAID array W, as all of the drives of RAID array W are still accessible by server node A. If it were the case that the drives of RAID array W were logically owned by server node B, server node B could not access any of the drives of RAID array W, and ownership of RAID array W could be passed to server node A, which does have access to the drives of RAID array W.
Shown in
If it is determined at step 32 that at least one drive of a logical unit owned by the server node cannot be accessed, the accessible drives of the logical unit are identified by the server node at step 34. It is next determined at step 36 if the accessible drives of the logical unit comprise a complete set of data of the logical unit or otherwise comprise a set of data from which a complete set of data could be derived. If the accessible drives of the logical unit comprise a complete set of data of the logical unit or otherwise comprise a set of data from which a complete set of data could be derived, the drives of the logical units are defined as including an operational set of data. As an example, if the server node is only able to access one drive of a two-drive Level 1 RAID array, the server node nevertheless has access to a complete set of data, as the content of each drive is mirrored on the other drive of the array. As another example, if the server node is only able to access two drives of a three-drive Level 5 RAID array, the server node nevertheless has access to a complete set of data, as the content of the missing drive can be derived from the data and parity information on the two accessible drives. As a final example, if two drives of a three-drive Level 5 RAID array are inaccessible, the accessible drives of the RAID array do not form a complete set of data or a set of data from which a complete set of data could be derived. Applying this standard to the example of
If it is determined at step 36 that the server node that owns the logical unit does have access to the drives of the logical unit that comprise a complete set of data or a set of data from which a complete set of data could be derived, any missing drives of the logical unit are rebuilt at step 38 on one of the active storage enclosures. In the example of
If it is determined at step 36 that the server node that owns the logical unit does not have access to drives of the logical unit that comprise a complete set of data or a set of data from which a complete set of data could be derived, the server node identifies the logical unit to the alternate node at step 46. The alternate node does not have ownership of the logical unit. The method used by the alternate unit to evaluate the data integrity of the logical unit relative to the ability of the alternate unit to access drives of the logical unit is described in
Shown in
At step 54 it is determined if the drives of the logical unit that are accessible by the alternate node comprise a complete set of data of the logical unit or otherwise comprise a set of data from which a complete set of data could be derived. With respect to the example of
If it is determined at step 54 that the drives of the logical unit that are accessible by the alternate node comprise a complete set of data of the logical unit or otherwise comprise a set of data from which a complete set of data could be derived, the flow diagram continues with step 56, where the alternate node becomes the owner of the logical unit. At step 58, because the original owner of the logical unit could not access a complete set of data comprising the logical unit, the alternate node, which can access a complete set of data comprising the logical unit, becomes the owner of the logical unit. At step 58, any missing drives of the logical unit are rebuilt on one of the active storage enclosures. In the example of
If it is determined at step 54 that the drives of the logical unit that are accessible by the alternate node do not comprise a complete set of data of the logical unit or do not otherwise comprise a set of data from which a complete set of data could be derived, the flow diagram continues with step 60, where the logical unit is marked as having failed and communication with any of the drives comprising the logical unit is discontinued. Following the restoration of the failed storage enclosure, the original configuration of the drives of each logical unit can be restored. In doing so, the master copy of the data can be used to update drives that were not included in the logical unit during the period that the failed storage enclosure was not operational. In the example of
As an additional example,
It should be recognized that the technique disclosed herein is sufficiently robust that, in the event of a failure of a storage enclosure, an alternate server node may not attempt to access the resources of a logical unit until it is authorized or instructed to do so by the server node that owns the logical unit. It should also be recognized that, if there is a failure in a first storage enclosure, it may be necessary to prevent an attempt to recover from the failure in a second enclosure failure. Thus, in the case of a two-drive mirrored array, for example, if a storage enclosure were to fail, and if both drives remain operational after the failure, it would be possible for the array to remain operational even though the node owning the logical unit is able to access only the first drive of the array. Updates to the first drive, however, would not be reflected in the second drive. As such, if the node owning the logical unit were to later fail, the alternate node should be prevented from accessing the second drive, as this drive will not include an updated set of data
In sum, the present disclosure concerns a technique in which each server node attempts, in the event of a storage enclosure failure, to catalog or identify those drives that are visible to the server node. For each array owned by the server node, if the server can access (a) drives having the entire content of the array or (b) drives from which the entire content of the array can be derived, the server node retains ownership of the array. If the server node that owns a certain array cannot access drives having the entire content of the array or drives from which the entire content of the array can be derived, and if the alternate server node can access (a) drives having the entire content of the array or (b) drives from which the entire content of the array can be derived, ownership of the array is passed to the alternate server node. In addition, in the case of RAID 1 and RAID 10 arrays, an ownership message can be sent to the alternate server node to notify the alternate server node that it should not write to drives of the RAID array in the event of a failure of the first server node. Preventing the alternate server node from writing to the drives of the RAID array will preserve the data integrity of the RAID array in the event of a failure in the server node that owns the RAID array following a storage enclosure failure.
The failure recovery methodology described herein provides a mechanism for preserving the data integrity of logical units following the failure of a storage enclosure of the network. The logical units owned by each server node are identified. If a server node can access a complete set of data on a logical unit that is owned by the server node, the server nodes retains ownership of the server node and continues to read and write data to the logical unit. If the server node that owns a logical unit cannot access a complete set of data on the logical unit, and if an alternate server node can access a complete set of data on the logical unit, the ownership of the logical unit is transferred to the alternate server node, which coordinates reads and writes to the logical unit. When the failed storage enclosure is returned to operational status, the master copy of the data of each logical unit is identified from a designation written to each drive that includes a master copy of the data of the logical unit. The data of the master copy can be distributed to a drive that was not included in the logical unit during the period that the failed storage enclosure was not operational.
Although the present disclosure has been described in detail, it should be understood that various changes, substitutions, and alterations can be made hereto without departing from the spirit and the scope of the invention as defined by the appended claims.