Embodiments herein are related to mirroring data and more particularly to mirroring data and metadata from an NVLog to a plurality of mirroring nodes.
Data replication and data recovery has become paramount in computing systems today. Data loss and/or data corruption caused by user error, system attacks, hardware failure, software failure, and the like have the potential to cause computing inconsistencies which lead to user inconvenience and sometimes catastrophic system failures. As such, various redundancy techniques have been developed to protect against data loss.
Mirroring is one such technique which replicates data stored in a first location on a second location, thereby creating two copies of the data. If one of the data locations fails, then the lost data can be recovered from the other location. An example of a technique using mirroring is RAID (redundant array of independent disks) 1. In RAID 1, at least two storage media are used, wherein the data written to a first storage medium is mirrored on the second storage medium. With this technique, the second storage medium acts as a redundancy mechanism and can be used to reconstruct the data in the first storage medium should any of that data be lost.
Another redundancy technique is parity, which is used by RAID 4. In RAID 4, parity (rather than mirroring) is used for data redundancy. Parity uses parity data such as error correction codes (ECC) (e.g., XOR, Reed-Solomon (RS) codes, etc.), which are stored on a disk used for a parity memory, and uses the parity data to reconstruct a data block should it be lost or otherwise become unavailable.
Computer users traditionally envision backing-up large groups of user visible data such as software code, documents, photographs, music, and the like. For example, a user often stores important family photos on more than one memory, such that the loss of one of the storage memories does not result in the loss of the family photos. But, user transparent data, such as computer operations (e.g. inputs/outputs (I/Os)) and metadata are also important to replicate as well, despite their transparency to the user.
An example of such replication is the replication of a cache memory. Cache memory is used by a computer system to quickly read from and write to as the computer conducts computing operations. Replication of the cache by a partner computer system aids in data recovery should the content stored in the cache be lost. An example of a cache is a write cache, which provides a non-volatile log (NVLog) for logging client operations. A write cache may be stored in non-volatile random access memory (NVRAM) because NVRAM provides for quick access times as compared to other means of data storage (e.g. disk storage). In additional to logging client operations, the NVLog may also store metadata which describes the data contained within the NVLog.
While the NVLog provides for quick access time, the NVLog traditionally has a lower storage capacity and limited read/write endurance. As such, the NVLog may be periodically flushed to a more permanent memory having higher storage capacity (e.g. hard disks) at points in time called Consistency Points (CPs). At any given point in time, the current view of a client's computing operations and metadata can be viewed as data in the NVLog and on the permanent memory. As mentioned above, replication of the NVLog and permanent memory is desirable so that all computing operations data and metadata can be recovered should some or all of the computing operations data and metadata be lost for any reason.
Traditionally, computing operations data and metadata is replicated on a single partner computer system, as mentioned above. The partner will have access to both the NVLog and the permanent storage of the client system, which provides for a complete back up. The client's NVLog may be replicated on the partner computer system's NVLog while the client's permanent memory may be replicated on the partner computer system's permanent memory.
In-order to avoid data loss or corruption, at any given point in time the NVLog of the client and the replicated NVLog located on the partner node should be consistent in-terms of the data and the metadata it contains. For this reason, the data and metadata is logged in a certain order and that ordering is maintained while the NVLog gets mirrored to partner node. To ensure that data in the client is consistent with data in the partner node, I/O incoming to the client is acknowledged after the data and corresponding metadata gets logged in NVRAM locally and also in the partner node. In-order to ensure that the data and corresponding metadata is logged in both the client and the partner, the following functionality is traditionally utilized: in-order placement of mirrored NVlog payload in partner node's NVRAM; completion of mirroring operation at the client only after the corresponding payload has been placed in the partner's NVRAM; and completion of mirroring operation in the same order that it was issued.
In the past, such a replication technique has been sufficient because there has traditionally been a one to one relationship between the client system and the partner system. However, moving forward with distributed filesystems (for example, Write Anywhere File Layout (WAFL®) architecture developed by NetApp, Inc.) that may be distributed throughout multiple nodes in one or more networks (for example a cluster system), a client system's NVLog and permanent memory may not be located locally to the client system. For example, a client system's data may be distributed throughout one or more clustered network environments. Likewise, replication may involve one or more replication partner computer systems, and the data stored by one or more replication partners may be distributed throughout one or more clustered network environments. As such, the traditional replication of a client system using a partner system adds substantial performance overhead with each remotely located memory that is added to the overall system and as such, is not scalable. Furthermore, it adds complexity to the client systems because traditional systems and methods often require the client to be aware of the presence and nature of the replication partner and manage the data replication.
For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawing, in which:
Systems and methods herein are operable to simultaneously mirror data to a plurality of mirror partner nodes. In embodiments, a mirror client may be unaware of the number of mirror partner nodes and/or the location of the plurality of mirror partner nodes. In operation, a mirror client may issue a single mirror command requesting initiation of a mirror operation. An interconnect layer may receive the mirror command and using information therein, determine a number of active mirror partner nodes to mirror the data and determine the location of the active mirror partner nodes. With this information at hand, the single mirror command may be split it into a plurality of mirror instances, one for each identified active mirror partner node. The respective mirror instances may comprise the data to be mirrored, the physical identifier of the respective mirror partner node, and properties of the respective mirror partner node. After the mirror instances are created, the interconnect layer simultaneously sends the mirror instances to a mirror layer, which creates a write command for each respective mirror instance. Thereafter, the mirror layer simultaneously executes each of the write commands.
Upon initiating the write commands, mirror layer may send an update to interconnect layer indicating whether each respective write command successfully launched. Using the information in the update, interconnection layer may determine whether any of the mirroring instances successfully launched. If at least one mirroring instance successfully launched, then the interconnect layer sends a single update to the mirroring client indicating that the mirror command was successfully launched. If any of the mirroring instances were not successfully launched, then the interconnect layer determines which of the mirror partner nodes failed to launch it respective mirror instance and changes the status of that mirror partner node to inactive.
After issuing a mirror command, the mirror client may wait until all writing operations (e.g. successful finishing of local NVLog writing operations and successful finishing of one or more mirroring operations) before acknowledging that the information was logged in the local NVLog. As such, mirror client may send a query requesting confirmation that the mirror command was successful. Interconnect layer may receive the query and determine which mirror operations are of interest based on information in the query. With knowledge of which mirror operations are of interest, interconnect layer may simultaneously issue a plurality of calls, one for each mirror instance, requesting an update of the completion status of the mirror instance. Mirror layer may receive the issued calls and poll the mirror partner nodes for a completion update.
Upon determining the completion status of mirror partner nodes, mirror layer may raise a completion update for a respective mirror partner node to the interconnect layer. Based at least on the received completion update, interconnect layer determines whether any mirror operation has successfully finished without error prior to a timer expiring. If at least one mirror operation successfully finished without error prior to a timer expiring, interconnect layer reports a success to the mirroring client indicating that the mirroring command was a success. In embodiments, as long as at least one mirror operation has successfully finished without error prior to a timer expiring, interconnect layer may report a success even if one or more mirroring operations resulted in an error and even if one or more mirroring operations did not complete prior to the timer's expiration.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims. The novel features which are believed to be characteristic of the invention, both as to its organization and method of operation, together with further objects and advantages will be better understood from the following description when considered in connection with the accompanying figures. It is to be expressly understood, however, that each of the figures is provided for the purpose of illustration and description only and is not intended as a definition of the limits of the present invention.
The modules, components, etc. of data storage systems 102 and 104 may comprise various configurations suitable for providing operation as described herein. For example, nodes 116 and 118 may comprise processor-based systems, such as file server systems, computer appliances, computer workstations, etc. Accordingly, nodes 116 and 118 of embodiments comprise a processor (e.g., central processing unit (CPU), application specific integrated circuit (ASIC), programmable gate array (PGA), etc.), memory (e.g., random access memory (RAM), read only memory (ROM), disk memory, optical memory, flash memory, etc.), and suitable input/output circuitry (e.g., network interface card (NIC), wireless network interface, display, keyboard, data bus, etc.). The foregoing processor-based systems may operate under control of an instruction set (e.g., software, firmware, applet, code, etc.) providing operation as described herein. Throughout the description, if item numbers are related (for example 116a and 116n are related in that the item numbers share the same number (116) but have separate letter designations (a and n), then the related items may be collectively referred to herein by just their number (for example, 116a and 116n may collectively be referred to herein as 116).
Data store devices 128 and 130 may, for example, comprise disk memory, flash memory, optical memory, and/or other suitable computer readable media. Data store devices 128 and 130 may comprise battery backed non-volatile RAM (NVRAM) to provision NVLogs 132 and 134, which operate as write caches. Data modules 124 and 126 of nodes 116 and 118 may be adapted to communicate with data store devices 128 and 130 according to a storage area network (SAN) protocol (e.g., small computer system interface (SCSI), fiber channel protocol (FCP), INFINIBAND, etc.) and thus data store devices 128 and 130 may appear as locally attached resources to the operating system. That is, as seen from an operating system on nodes 116 and 118, data store devices 128 and 130 may appear as locally attached to the operating system. In this manner, nodes 116 and 118 may access data blocks through the operating system, rather than expressly requesting abstract files.
Network modules 120 and 122 may be configured to allow nodes 116 and 118 to connect with client systems, such as clients 108 and 110 over network connections 112 and 114, to allow the clients to access data stored in data storage systems 102 and 104. Moreover, network modules 120 and 122 may provide connections with one or more other components of system 100, such as through network 106. For example, network module 120 of node 116 may access data store device 130 via communication via network 106 and data module 126 of node 118. The foregoing operation provides a distributed storage system configuration for system 100.
Clients 108 and 110 of embodiments comprise a processor (e.g., CPU, ASIC, PGA, etc.), memory (e.g., RAM, ROM, disk memory, optical memory, flash memory, etc.), and suitable input/output circuitry (e.g., NIC, wireless network interface, display, keyboard, data bus, etc.). The foregoing processor-based systems may operate under control of an instruction set (e.g., software, firmware, applet, code, etc.) providing operation as described herein.
Network 106 may comprise various forms of communication infrastructure, such as a SAN, the Internet, the public switched telephone network (PSTN), a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), a wireless network (e.g., a cellular communication network, a wireless LAN, etc.), and/or the like. Network 106, or a portion thereof may provide infrastructure of network connections 112 and 114 or, alternatively, network connections 112 and/or 114 may be provided by network infrastructure separate from network 106, wherein such separate network infrastructure may itself comprise a SAN, the Internet, the PSTN, a LAN, a MAN, a WAN, a wireless network, and/or the like.
As can be appreciated from the foregoing, system 100 provides a data storage system in which various digital data (or other data) may be created, maintained, modified, and accessed (referred to collectively as data management). A logical mapping scheme providing logical data block mapping information, stored within and stored without the data structures, may be utilized by system 100 in providing such data management. For example, a filesystem implemented by data store devices 128 and 130 may implement a logical data block allocation technique.
In the exemplary configuration of system 100, clients 108 and 110 utilize data storage systems 102 and 104 to store and retrieve data from data store devices 128 and 130. In such an embodiment, for example, client 108 can send data packets to N-module 120 in node 116 within data storage system 102. Node 116 can forward the data packets to data store device 128 using D-module 124. In this example, the client 108 may access data store device 128, to store and/or retrieve data, using data storage system 102 connected by network connection 112. Further, in this embodiment, client 110 can exchange data with N-module 122 in node 118 within data storage system 104 (e.g., which may be remote from data storage system 102). Node 118 can forward the data to data storage device 130 using D-module 126, thereby accessing the data storage device 130.
DR provides protection in the event of a failure of one or more sites. In a DR scenario, a threshold number of nodes (e.g. all nodes) within a site fail thereby causing the entire site to be considered in failure. Upon a site failure, DR is operational to cause an operational node located in a different site to take over operations that were previously serviced by the failing node and its data storage devices. In DR, the node taking over the operations is located a larger distance away from the failing node (e.g. kilometers away).
In order for an operational node to take over operations of a failing node, embodiments herein take precautionary steps to mirror the NVLog of a node onto a plurality of partner nodes. Example clustered storage system 200 shows a scalable example cluster system wherein the NVLog of one or more nodes is mirrored onto a plurality of partner nodes. Example clustered storage system 200 comprises a number of nodes including nodes 216a, 216n, 218a, and 218n. In embodiments, nodes 216a, 216n, 218a, and 218n may respectively correspond to figure l's nodes 116a, 116n, 118a, and 118n, and their data storage devices. Example clustered storage system 200 is dynamically scalable, and as such, any number of nodes may be added to or taken away from Example clustered storage system 200 at any given time. Nodes 216a and 216n are part of cluster A and located within the same site, site 240. Node 216a is located within meters of node 216n. Being that node 216a and 216n are in relatively close proximity to each other, nodes 216a and 216n are grouped into HA group 217. Node 216a is operational to mirror the NVLog of node 216n, and node 216n is operational to mirror the NVLog of node 216a. Due to the mirroring, if node 126a fails, node 216n has a mirror of node 216a's NVLog. Thus, an HA operation may cause node 216n to take over the operations of failing node 216a, and the takeover may be seamless (e.g. without data loss or data corruption).
Nodes 218a and 218n are part of cluster B and located within the same site, site 242. Node 218a is located within meters of node 218n. Being that nodes 218a and 218n are in relatively close proximity to each other, nodes 218a and 218n are grouped into a HA group 219. Node 218a is operational to mirror the NVLog of node 218n, and node 218n is operational to mirror the NVLog of node 218a. Due to the mirroring, if node 218a fails, node 218n has a mirror of node 218a's NVLog. Thus, an HA operation may cause node 218n to take over the operations of failing node 218a, and the takeover may be seamless.
Site 240 is remotely located from site 242. As such, HA group 217 is remotely located from HA group 219, and thus, the HA groups are relatively far away from each other (e.g. kilometers away). HA group 217 and HA group 219 are grouped into DR group 220. HA group 217 is operational to mirror the NVLog of HA group 219, and HA group 219 is operational to mirror the NVLog of HA group 217. Due to the mirroring, if HA group 217 fails, HA group 219 has a mirror of HA group 217's data. Thus, a DR operation may cause HA group 219 to take over the operations of failing HA group 217, and the takeover may be without data loss or data corruption.
In example clustered storage system 200, when the mirroring client issues a mirror command, the mirror command is split into a plurality of mirror instances for the nodes within node 216a's HA group 217 and the nodes within node 216a's DR Group 220. As such, in this example, at least three mirroring instances will issue, one to each of 216n, 218a, and 218n. Further, when it comes time to determine whether a mirroring command was successful, a query issues which is used in determining whether the NVLog was successfully mirrored on any of 216n, 218a, and 218n.
Embodiments disclosed herein synchronously mirror data and metadata of an NVLog to two or more partners by encapsulating the management of multiple partners and their properties in an interconnect layer. The mirroring clients (WAFL/RAID/EMS subsystems etc.) are unaware of the existence of multiple partners and issue a single operation for mirroring and a single query to determine the mirroring's status. In embodiments, interconnect layer manages the multiple partners, and as such, mirroring clients may remain naïve, if desired, about the partners involved in the mirroring process.
When mirroring client 401b issues mirroring command 406, interconnect layer 402 receives the mirroring command 406. The mirroring command 406 may include an ic-stream 407 and virtual identifier (VID) 406. The ic-stream 407 comprises the data that is to be mirrored, which may include operational data and its associated metadata. The VID 408 can be used to determine which partners are to mirror the ic-stream, the number of mirroring partners, and the location of the partners.
With the VID 408, interconnect layer 403 may access non-volatile management (NVM) layer 411 to make a runtime determination of which partner mirrors are active at that time and if desired determine which partner mirrors are inactive at that time. NVM layer 411 maintains a per mirror status indicating whether a mirror is active at any given time. Further, NVM layer 411 may also maintain a VID to PID mapping, which maps a virtual identifier (VID) to the physical identifiers (PID) of mirroring partners.
In embodiments, VID 408 may be mapped to mirroring partners 216n, 218a, and 218n. Further still, NVM layer may further maintain mirror properties for use in determining offsets that may be included in a mirroring instance. Examples of mirror properties may include whether a particular mirror is a HA node or a DR node, the type of connection to the node (e.g. the transport used), which sections of a storage device (e.g. NVRAM) is used for a respective partner, memory protection keys used for mirroring, and the like. For example, NVM layer 411 may maintain properties that identify node 216n and being an HA node and nodes 218a and 218n as being DR nodes.
With the active partners, their PID's, and their offsets determined, interconnect layer 402 splits mirroring command 406 into a plurality of mirror instances 409a-409n by generating a mirror instance 409 for each respective active partner, wherein each respective mirror instance 409 comprises information relevant to that particular partner and is used to execute a mirroring operation on that particular partner. For example, mirror command 406 may be split into mirror instance 409a for partner node 216n, mirror instance 409b for node 218a, and mirror instance 409n for node 218n. When creating the mirror instances (409a-409n), interconnect layer 402 duplicates ic-stream 407 and includes in the mirror instances (409a-409n) a duplicate of the ic-stream (407a-407n), the physical address of the partner (PID 410a-410n), and the offsets for that partner (412a-412n). With the mirror instances 409 created, interconnect layer 402 synchronously sends each of the mirror instances (409a-409n) to the mirror layer 403. For each mirror instance (409a-409n), mirror layer 403 issues a write command (413a-413n) to interconnect services 405 to write the respective ic-stream to the respective partner according to its respective offset.
Upon receiving write commands 413a-413n, interconnect services 405 executes the write commands 413 which writes the ic-stream to the partner nodes. If a node has become inactive in the time between NVM layer 411 determining that the node was active and the time of executing its respective write command, execution of the respective write command will likely fail. After beginning execution of the writing operations, interconnection services 405 can send one or more updates 415a-415n to mirror layer 403 updating the status (e.g. active, inactive, etc.) and PID of the partner nodes. Mirror layer 403 may send one or more updates 416a-416n to interconnect layer 402 which includes an update of the status of the respective partners (e.g. active, inactive, etc.) and the physical address of the respective partners (e.g. PID 410a-410n). Interconnection layer 402 may send the updated status and PIDs to NVM layer 411 for use in maintaining the per mirror status and PID mapping. Further, interconnection layer 402 may use this updated status and physical address information to determine whether one or more write commands (413a-413n) successfully launched. If interconnect layer 402 determines that at least one write command successfully issued, interconnect layer 402 may inform mirroring client 401b of the success by the populating a VID and returning the populated VID in update 417 to the mirroring client (e.g. RAID 401b).
At this point, mirroring client 401b issued a single mirror command 406, and transparent to the client, interconnect layer 402 split the single mirror command 406 into a plurality of simultaneous mirror instances 409a-409n. Further, the interconnect layer 402, transparent to the mirror client 401b, determined whether one or more mirroring operations has successfully begun, and if so, mirroring client 401b is informed that the mirroring operation is underway. As explained above, in order to prevent data loss or corruption, it is advantageous to prevent the mirroring client from performing further write operations to the NVLog after issuing a write command 406, until mirroring client 401b confirms that the mirroring operation was successfully finished. In the present embodiments, mirroring operations may be executing on a plurality of nodes; thus, in-order to prevent data loss and/or data corruption, it is desirable that mirroring client 401b withhold an acknowledge indicating completion of a mirroring operation from the clients (e.g. 108 & 110 in
In short, mirroring clients 401a-401n issue a single query 701 to determine the completion status of mirroring command 406. Interconnect layer 402 splits query 701 into a plurality of calls 702a-702n, one for each respective mirroring instance 409a-409n, and simultaneously issues the plurality of calls 702a-702n to mirror layer 403. Calls 702a-702n are requesting an update on the completion status of each respective mirror instances 409a-409n. Mirror layer 403 receives the calls 702 and begins checking the status of each write command 413a-413n in parallel. Upon determining a respective write operation 413's completion status, mirror layer 403 makes a callback to the interconnect layer 402 indicating the respective write command 413's status. Once interconnect layer 402 determines that each mirror instance 409a-409n is complete (whether it is a finished operation, an error, an operation that did not finished before timeout occurred, or the like), interconnect layer 402 sends a single return 707 to mirror client 401b indicating a SUCCESS or FAILURE. As indicated above, to ease understanding, examples herein assume that mirroring client 401b is located in node 216a (of
In detail, completion determination process 600 begins in step 601 wherein mirroring client 401b (located in node 216a) issues query 701. Query 701 comprises the latest VID which was received in update 417 (of
If at step 603, interconnection layer 402 determines that timer 708 for the writing operations has not yet timed out, the process moves to step 604 were interconnect layer 402 simultaneously issues calls 702a-702n to mirror layer 403 requesting updates regarding the writing status of each mirror (e.g. node 216n, 218a, and 218n). Examples of writing statuses include an indication that a mirroring error has occurred, an indication that a mirroring operation has finished, and/or the like. In step 605, upon mirror layer 403 receiving calls 702a-702n, mirror layer 403 polls 703a-703n interconnect services 405 for the writing status of each executing write command 413a-413n. Node 216a is located close in proximity to node 216n. As such, the poll issued to node 216n (the HA node in this example) is likely to be received quickly because latencies associated with node 216n may be relatively minimal due at least to distance. On the contrary, node 216a is located at a distance from nodes 218a and 218n (the DR nodes in this example). As such, polls issued to the DR nodes are likely to be received more slowly (in comparison to the HA node) due to latencies associated with the DR nodes caused at least by distance. As will be seen below, interconnection layer 402 is operable to handle the varying latencies caused by the varying locations and transports associated with the varying partners.
After issuing polls 703a-703n, interconnect services 405 returns poll updates 704a-704n, which indicate the status of the mirroring operations (e.g. mirroring successfully finished, mirroring error, and the like) for each respective mirroring operation. Due to the varying latencies discussed above, mirror layer 403 may receive poll updates 704a-704n at different times. In embodiments, as mirror layer 403 receives poll updates 704a-704n, at step 606a, mirror layer 403 raises an upcall (e.g. one or more upcalls 705a-705n) to interconnect layer 402 indicating the mirroring status of the respective mirrors (e.g. nodes 216n, 218a, and 218n). Interconnect layer 402 receives upcalls 705a-705n at varying times, due at least to latency issues, and logs the mirror status of each mirror (e.g. nodes 216n, 218a, and 218n) as the upcalls 705 are received.
Upon receiving upcalls 705a-705n, interconnection layer 402 determines whether a respective upcall 705 indicates that a mirroring operation is in error. A mirroring operation may be in error because the mirroring node has experienced a hardware and/or software failure for some reason that affects a portion of the node (e.g. the mirroring portion) or the entire node. Upon receiving an upcall 705 indicating an error, interconnect layer marks the failing node as being offline (step 606b). Treatment of an offlined node is discussed further below with reference to step 611.
Process 600 now moves to step 607, wherein interconnect layer 402 determines whether all mirroring operations have been completed (e.g. by determining whether all expected upcalls have been received). If interconnection layer 402 determines that all mirroring operations have not yet been completed (e.g. either by successfully finishing the operation or by reporting an error), then the process 600 moves back to step 603 to determine whether timer 708 has expired. If at step 603, interconnection layer 402 determines that timer 708 has not yet expired, the process repeats steps 604-607 and 603. In embodiments, when repeating steps 604-607 and 603, interconnect layer 402 may issue progressively less calls in each repeated step 604. For example, when repeating step 604, interconnect layer 402 may limit the calls 702 issued to those nodes which interconnect layer 402 does not yet know whether their mirroring operations are complete. For example, due to latencies, interconnect layer 402 is likely to receive upcalls corresponding to HA nodes before receiving upcalls corresponding to DR nodes. If that is the case, then when repeating step 604, issued calls may be limited to calls to DR nodes. If in step 604, less calls are issued, then in steps 605-607, a correspondingly less number polls may be issued, and a correspondingly less number of poll updates may be issued. Steps 604-607 will be repeated until step 607 determines that all mirroring operations are complete or until step 603 determines that timer 708 has expired.
If at step 607 interconnect layer 402 determines that all mirror instances 409a-409n are complete, process 600 moves to step 608 wherein interconnect layer 402 updates the current VID. Then in step 609, interconnect layer 402 determines whether the last completed VID is greater than or equal to the current VID. Interconnect layer 402 updates the current VID upon determining that all mirror instances 409a-409n are complete. Thus, by determining whether the current VID is greater than or equal to the last updated VID, interconnect layer 402 ensures that all the mirror operations successfully finished. If at step 609 interconnect layer determines that the current VID is greater than or equal to the last updated VID, process 600 moves to step 615, wherein interconnect layer 402 issues return 707 indicating SUCCESS to the mirroring client 401b. When mirroring client 401b knows that its mirroring command was a success, mirroring client may perform operations accordingly. For example, if mirroring client 401b has withheld an acknowledgement of a writing operation until determining the success of writing command 406, upon receiving a return 707 indicating success, mirroring client 401b may acknowledge the writing operation to clients 108 and 110.
Referring back to step 607, if at step 607 interconnect layer 402 again determines that all mirror instances 409a-409n are not yet completed, then process 600 moves to step 603 again. If at step 603, interconnection layer 402 determines that timer 708 has expired, the process moves to step 610 wherein interconnection layer 402 determines whether any mirror operation successfully finished without error (as determined via upcalls 705a-705n). If at least one mirror operation has successfully finished, then mirroring command 406 may be considered a success because the NVLog is mirrored on at least one partner. However, because some of nodes 216a, 218a, and 218n may not have successfully finished the mirroring operation without error, interconnect layer 402 tracks which nodes were successful and which nodes failed so that auto-takeover may remain enabled for the successful nodes and auto-takeover may be disabled for the unsuccessful nodes (thereby preventing an unsuccessful node with incomplete and/or stale data from initiating an auto-takeover).
The interconnect layer 402's ability to issue a report 707 indicating success even though one or more mirroring node 216a, 218a, and/or 218n failed, is unique as compared to mirroring systems which are based on a one-to-one partnership because in a one-to-one partnership, the failure of a single mirroring node causes the report to indicate failure. In contrast, present embodiments may issue report 707 indicating a success, even if one or more mirroring node fails. Interconnection layer 402 tracks which mirror nodes (e.g. 216a, 218a, and 218n) succeed and which failed to ensure that auto-takeovers are successful. Further, interconnect layer 402's operations are transparent to mirroring client 401b, thereby making embodiments herein operable with legacy mirroring clients without having to add overhead and complexity to the legacy mirroring clients.
Referring back to step 610, interconnect layer 402 determines whether any mirroring operations were successfully finished without error before expiration of timer 708. If no mirror operations have successfully finished without error, the process moves to step 614 wherein interconnection layer 402 sends return 707 indicating a FAILURE to the mirror client 401b. Upon mirror client 401b receiving a FAILURE, mirror client 401b knows that mirroring command 406 failed, at which point, if desired, mirror client 401b may take steps to correct the failure, for example issue a new mirror command and continue to cease other operations.
If in step 610, interconnection layer 402 determines that at least one mirroring operations has successfully finished without error, the process moves to step 611, wherein interconnection layer 611 marks mirrors which were unable to return a poll update (e.g. 704a-704n) before timer 708 expired as offline. Marking a mirror node as being offline indicates that for some reason that node failed to complete the mirror operation within the allotted time and as a result contains stale and/or incomplete data. Examples of mirrors which were unable to complete the mirroring operation before the timer expired include mirrors that reported an error before timer 708 expired and mirrors that have not finished their writing operation when timer 708 expired. When a mirroring node (e.g. nodes 216a, 218a, and 218n) is marked as offline, the offlined node is prevented from performing an auto-takeover.
Taking a node offline may take some time to complete, so in step 612, interconnect layer 402 determines whether all offlined mirrors have been successfully taken completely offline before return 707 is issued. In some embodiments, step 612 is limited to determining whether certain ones of the mirrors are completely offline. For example, interconnect layer may determine that a failed HA mirror (e.g. node 216n) is completely offline, but skip that determination for DR mirrors (e.g. nodes 218a and 218n). In other embodiments, step 612 is performed for all failed mirrors (e.g. for example both failed HA mirrors and failed DR mirrors). Making the determination in step 612 ensures that mirroring client 401b, which upon receiving a SUCCESS may be naive regarding which mirror nodes succeeded and which mirror nodes failed, does not receive a SUCCESS while an offlining operation of interest is still in process. Returning a SUCCESS before an offlining operation is complete may lead a situation wherein a failed node initiates an auto-takeover before the failed node is completely taken offline.
If at step 612, interconnect layer 402 determines that the offlining operations of nodes of interest have not yet completed when it is time to issue return 707, then interconnect layer marks all mirroring nodes (e.g. 216n, 218a, and 218n) offline, and issues a return 707 that indicates a FAILURE. Upon mirror client 401b receiving return 707 indicating a FAILURE, mirror client 401b knows that mirroring command 406 failed at which point, if desired, mirror client 401b may take steps to correct the failure, for example issuing a new mirror command and maintaining a hold on other local NVLog write operations.
If at step 612, interconnect layer 402 determines that all of the offlining operations of interest successfully completed when it is time to issue return 707, then at step 615, interconnect layer 402 issues return 707 indicating a SUCCESS.
Because mirroring operations are executing on both HA groups and DR groups, which will be on different transports and located at varying distances thereby experiencing differing latencies, it is likely that the completions will occur at differing times for the various nodes. The above example process allows interconnect layer 402 to manage mirroring completions that occur at differing times while also repeatedly checking timer 708 to determine whether time has expired for the writing operations to be complete. Further, if interconnection layer 402 determines that a writing operation is not complete before the timer expires, interconnection layer 402 marks the nodes having incomplete operations as offline, thereby preventing a node with stale and/or incomplete data from performing a take-over.
With the above described systems and methods, mirroring operations become scalable without mirroring clients having to know about the systems' dynamic size and/or dynamic structure. For example, mirroring nodes may be added and/or taken away from an HA group and/or a DR group and mirroring nodes may become active and/or inactive without the mirroring client knowing or having to account for the changes. Some or all changes to HA groups and/or DR groups may be handled by interconnect layer 402 making dynamics of the mirroring groups transparent to mirroring clients, thereby taking burden away from mirroring clients. As such, if desired, mirroring clients of the proposed systems and methods may continue to operate in the same or similar manner as compared to how legacy mirroring clients operate.
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
Number | Name | Date | Kind |
---|---|---|---|
6253209 | Chase-Salerno et al. | Jun 2001 | B1 |
6282610 | Bergsten | Aug 2001 | B1 |
6477591 | VanderSpek | Nov 2002 | B1 |
7085903 | Johnson et al. | Aug 2006 | B2 |
7143250 | Riedl | Nov 2006 | B1 |
7657782 | Das et al. | Feb 2010 | B2 |
7743171 | Hwang et al. | Jun 2010 | B1 |
8041879 | Erez | Oct 2011 | B2 |
8533539 | Marathe et al. | Sep 2013 | B2 |
20050246504 | Frey et al. | Nov 2005 | A1 |
20050262309 | Frey et al. | Nov 2005 | A1 |
20060005074 | Yanai et al. | Jan 2006 | A1 |
20080168246 | Haustein et al. | Jul 2008 | A1 |
20080222356 | Mimatsu et al. | Sep 2008 | A1 |
20090037677 | Coronado et al. | Feb 2009 | A1 |
20110320889 | Balasubramanyan et al. | Dec 2011 | A1 |
20150113095 | McCabe et al. | Apr 2015 | A1 |
20150121131 | Kiselev et al. | Apr 2015 | A1 |
Number | Date | Country | |
---|---|---|---|
20140298078 A1 | Oct 2014 | US |