In a high-availability (HA) cluster, storage controllers (also referred to herein as “storage nodes”) may be deployed in an active-passive configuration, in which a primary storage node takes on the role of an active node and at least one secondary storage node takes on the role of a standby node. In the active-passive configuration, the active node may process storage input/output (IO) requests from host computers and maintain page reference information on its memory, while the standby node may not be currently interacting with the host computers. Storage controllers in an HA cluster may also be deployed in an active-active configuration, in which two or more active nodes collaborate to process storage IO requests from host computers and maintain page reference information on their memories in image style.
In the event of a process or equipment malfunction (also referred to herein as a “high availability (HA) event”) on an active node in an active-passive configuration, a system-level failover can occur, in which tasks of the active node, including processing storage IO requests and maintaining page reference information, are entirely taken over by a standby node. An appropriate set of actions can then be executed in a high-availability (HA) process flow (also referred to herein as an “HA flow”) to address actual or potential ramifications of the HA event. However, in an active-active configuration, multiple such HA events can occur simultaneously on two or more active nodes, requiring multiple HA flows to be executed concurrently to address any actual or potential ramifications of the HA events. Because such concurrent HA flows can have dependencies in which certain HA flows are dependent upon other HA flows or processes to execute their functions, a more unified approach to addressing HA events occurring on storage nodes in an active-active configuration is needed.
Techniques are disclosed herein for providing a centralized framework for handling execution of high-availability (HA) process flows in an active-active storage node configuration. The disclosed techniques can include an HA flows execution framework manager (also referred to herein as the “framework manager”), which can be implemented on one of multiple storage nodes in an active-active configuration. In the disclosed techniques, the framework manager can receive, periodically or at intervals, explicit or implicit notifications and/or reports of functional statuses of processes and/or equipment associated with the storage nodes in the active-active configuration. The framework manager can make determinations regarding whether and/or how to address any actual or potential process and/or equipment malfunctions (or “HA events”) based on the received notifications and/or reports. If the framework manager determines to address one or more actual or potential HA events occurring in the active-active configuration, then the framework manager can implement an HA flow for each HA event as an asynchronous process thread. The framework manager can represent each HA flow as an instance of an HA flow object and store the HA flow object for each HA flow waiting to be executed in a persistent repository or database. The framework manager can define each HA flow with reference to one or more dependencies specifying its relationships with one or more other HA flows and/or certain software, firmware, and/or hardware modules or components in the active-active configuration. Based at least on the dependencies defining conditions for the HA flow, the framework manager can determine whether to refuse a request to execute the HA flow, service the request to execute the HA flow, abort one or more HA flows in execution, and/or postpone execution of the HA flow to a later time.
By receiving notifications and/or reports of functional statuses of processes and/or equipment associated with storage nodes in an active-active configuration, making determinations regarding whether and/or how to address actual or potential HA events occurring on the processes and/or equipment associated with the storage nodes based on the received notifications and/or reports, and, in response to a request to execute an HA flow for a respective HA event, determining whether to refuse the request to execute the HA flow, service the request to execute the HA flow, abort one or more HA flows in execution, and/or postpone execution of the HA flow to a later time based on one or more dependencies defining conditions for the HA flow, mutual interference of HA flows or other process threads in the active-active configuration can be reduced or eliminated. As a result, recovery times from HA events occurring in the active-active configuration can be reduced.
In certain embodiments, a method of handling execution of high-availability (HA) process threads in an active-active storage node configuration includes receiving notifications of functional statuses of processes or equipment associated with storage nodes in an active-active configuration, determining that an HA event has occurred on one of the processes or equipment associated with the storage nodes in the active-active configuration based on the received notifications, and, in response to a request to execute a first HA process thread to address the HA event, performing one or more of refusing the request to execute the first HA process thread, servicing the request to execute the first HA process thread, aborting one or more HA process threads in execution, and postponing execution of the first HA process thread based on one or more dependencies defining conditions for the first HA process thread.
In certain arrangements, the method further includes specifying a set of parameters and a set of executable steps for the first HA process thread. The set of parameters includes the one or more dependencies defining the conditions for the first HA process thread and an abort policy specifying rules regarding whether or when to abort the one or more HA process threads in execution.
In certain arrangements, the method further includes, in response to the request to execute the first HA process thread not being refused, allocating a first HA process thread object representing the first HA process thread, and adding the first HA process thread object to a persistent database.
In certain arrangements, the method further includes checking the specified rules in the abort policy and aborting one or more of the HA process threads in execution based on the specified rules.
In certain arrangements, the method further includes checking the dependencies defining the conditions for the first HA process thread with regard to one or more other HA process threads represented by other HA process thread objects in the persistent database.
In certain arrangements, the method further includes, in response to the dependencies dictating an order in which the first HA process thread and the other HA process threads are to be executed, performing the postponing of the execution of the first HA process thread to satisfy the dependencies.
In certain arrangements, the method further includes checking the specified rules in the abort policy and aborting all of the HA process threads in execution based on the specified rules.
In certain arrangements, the method further includes checking the dependencies defining the conditions for the first HA process thread with regard to one or more other HA process threads represented by other HA process thread objects in the persistent database, and, in response to the dependencies dictating an order in which the first HA process thread and the other HA process threads are to be executed, performing the postponing of the execution of the first HA process thread to satisfy the dependencies.
In certain arrangements, the method further includes, for each respective HA process thread from among the one or more other HA process threads represented by the other HA process thread objects in the persistent database, determining one or more of whether a request to execute the respective HA process thread should be refused and whether execution of the respective HA process thread should be postponed as necessary to satisfy its dependencies.
In certain arrangements, the method further includes, having determined whether the request to execute the respective HA process thread should be refused or whether the execution of the respective HA process thread should be postponed, initiating execution of the first HA process thread.
In certain embodiments, a system for handling execution of high-availability (HA) process threads in an active-active storage node configuration includes a persistent database, a memory, and processing circuitry configured to execute program instructions out of the memory to receive notifications of functional statuses of processes or equipment associated with storage nodes in an active-active configuration, to determine that an HA event has occurred on one of the processes or equipment associated with the storage nodes in the active-active configuration based on the received notifications, and, in response to a request to execute a first HA process thread to address the HA event, to perform one or more of refusing the request to execute the first HA process thread, servicing the request to execute the first HA process thread, aborting one or more HA process threads in execution, and postponing execution of the first HA process thread based on one or more dependencies defining conditions for the first HA process thread.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory to specify a set of parameters and a set of executable steps for the first HA process thread, in which the set of parameters includes the one or more dependencies defining the conditions for the first HA process thread and an abort policy specifying rules regarding whether or when to abort the one or more HA process threads in execution.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory, in response to the request to execute the first HA process thread not being refused, to allocate a first HA process thread object representing the first HA process thread, and to add the first HA process thread object to the persistent database.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory to check the specified rules in the abort policy, and to abort one or more of the HA process threads in execution based on the specified rules.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory to check the dependencies defining the conditions for the first HA process thread with regard to one or more other HA process threads represented by other HA process thread objects in the persistent database.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory, in response to the dependencies dictating an order in which the first HA process thread and the other HA process threads are to be executed, to perform the postponing of the execution of the first HA process thread to satisfy the dependencies.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory to check the specified rules in the abort policy, to abort all of the HA process threads in execution based on the specified rules, to check the dependencies defining the conditions for the first HA process thread with regard to one or more other HA process threads represented by other HA process thread objects in the persistent database, and, in response to the dependencies dictating an order in which the first HA process thread and the other HA process threads are to be executed, to perform the postponing of the execution of the first HA process thread to satisfy the dependencies.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory, for each respective HA process thread from among the one or more other HA process threads represented by the other HA process thread objects in the persistent database, to determine one or more of whether a request to execute the respective HA process thread should be refused and whether execution of the respective HA process thread should be postponed as necessary to satisfy its dependencies.
In certain arrangements, the processing circuitry is further configured to execute the program instructions out of the memory, having determined whether the request to execute the respective HA process thread should be refused or whether the execution of the respective HA process thread should be postponed, initiating execution of the first HA process thread.
In certain embodiments, a computer program product includes a set of non-transitory, computer-readable media having instructions that, when executed by processing circuitry, cause the processing circuitry to perform a method of handling execution of high-availability (HA) process threads in an active-active storage node configuration. The method includes receiving notifications of functional statuses of processes or equipment associated with storage nodes in an active-active configuration, determining that an HA event has occurred on one of the processes or equipment associated with the storage nodes in the active-active configuration based on the received notifications, and, in response to a request to execute a first HA process thread to address the HA event, performing one or more of refusing the request to execute the first HA process thread, servicing the request to execute the first HA process thread, aborting one or more HA process threads in execution, and postponing execution of the first HA process thread based on one or more dependencies defining conditions for the first HA process thread.
Other features, functions, and aspects of the present disclosure will be evident from the Detailed Description that follows.
The foregoing and other objects, features, and advantages will be apparent from the following description of particular embodiments of the present disclosure, as illustrated in the accompanying drawings, in which like reference characters refer to the same parts throughout the different views.
Techniques are disclosed herein for providing a centralized framework for handling execution of high-availability (HA) process flows (also referred to herein as “HA flow(s)”) in an active-active storage node configuration. The disclosed techniques can include receiving notifications and/or reports of functional statuses of processes and/or equipment associated with storage nodes in an active-active configuration, making determinations regarding whether and/or how to address actual or potential malfunctions (also referred to herein as “HA events”) occurring on the processes and/or equipment associated with the storage nodes based on the received notifications and/or reports, and, in response to a request to execute an HA flow for a respective HA event, determining whether to refuse the request to execute the HA flow, service the request to execute the HA flow, abort one or more HA flows in execution, and/or postpone execution of the HA flow to a later time based on one or more dependencies defining conditions for the HA flow. In this way, mutual interference of HA flows or other process threads in an active-active configuration can be reduced or eliminated, and recovery times from HA events occurring in the active-active configuration can be reduced.
The communications medium 103 can be configured to interconnect the plurality of host computers 102.1, . . . , 102.n and the active-active storage system 104 to enable them to communicate and exchange data and/or control signaling. As shown in
It is noted that each of the multiple storage nodes (e.g., storage node A 112.1, storage node B 112.2) included in the active-active storage system 104 can be configured to include at least a communications interface, processing circuitry, a memory, an OS, and a malfunction monitor like the storage node 200 of
In the context of the processing circuitry 204 being implemented using one or more processors executing specialized code and data, a computer program product can be configured to deliver all or a portion of the specialized code and data to the respective processor(s). Such a computer program product can include one or more non-transient computer-readable storage media, such as a magnetic disk, a magnetic tape, a compact disk (CD), a digital versatile disk (DVD), an optical disk, a flash drive, a solid-state drive (SSD), a secure digital (SD) chip or device, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), and so on. Further, the non-transient computer-readable storage media can be encoded with sets of program instructions for performing, when executed by the respective processor(s), the various techniques and/or methods disclosed herein.
During operation, the framework manager (e.g., the framework manager 212; see
The disclosed techniques for providing a centralized framework for handling execution of HA flows in an active-active storage node configuration will be further understood with reference to the following illustrative examples. In a first example, it is assumed that the framework manager 212 (see
Having determined that an HA event has occurred on a process or equipment associated with one of the storage nodes A 112.1, B 112.2, the framework manager 212 implements a new HA flow for the HA event as an asynchronous process thread. In this first example, the new HA flow is defined by a set of parameters and a set of executable steps. For example, the set of parameters can include (i) zero, one, or more dependencies specifying the new HA flow's relationships with one or more other HA flows represented by HA flow objects in the persistent HA flow object database 214, (ii) an abort policy specifying rules regarding whether and/or when to abort certain HA flows in execution at the time a request to execute the new HA flow is generated, and (iii) logging and statistics information. In some embodiments, the abort policy can be priority-based or can explicitly specify which HA flows in execution to abort. It is noted that certain HA flows in execution will be aborted only if required by the abort policy. In cases where there is no need to abort or otherwise interrupt an HA flow in execution, the HA flow will not be aborted or interrupted. Further, the set of executable steps can include a set of actions to be taken by the new HA flow to address the HA event. Upon implementation of the new HA flow for the HA event, the framework manager 212 generates a request to execute the new HA flow.
In this first example, once the request to execute the new HA flow has been generated, the framework manager 212 determines, as appropriate, (i) whether the request should be immediately refused, (ii) whether any HA flows in execution should be aborted, in accordance with the abort policy, and (iii) whether execution of the new HA flow should be postponed to a later time. For example, such refusal of the request to execute the new HA flow can be based on the storage node A 112.1 or B 112.2 of interest having been taken offline or any other suitable reason. If the request is not immediately refused, then the framework manager 212 allocates an HA flow object configured to represent the new HA flow and adds the HA flow object to the HA flow object database 214. Further, the framework manager 212 checks the rules specified in the abort policy for the new HA flow and aborts zero, one, or more asynchronous process threads for HA flows in execution, as warranted by the rules. In addition, the framework manager 212 checks the dependencies of the new HA flow vis-a-vis one or more other HA flows represented by HA flow objects in the HA flow object database 214. If the HA flow dependencies dictate a certain order in which the HA flows may be executed, then the framework manager 212 can postpone the execution of the new HA flow, as necessary, to satisfy the dependencies.
Having determined that the request to execute the new HA flow should not be immediately refused, aborted zero, one, or more asynchronous process threads for HA flows in execution, and/or postponed the execution of the new HA flow as necessary to satisfy any dependencies, the framework manager 212 can determine whether any other factors exist preventing immediate execution of the new HA flow. If so, then the framework manager 212 can determine, periodically or at intervals, whether such factors preventing execution of the new HA flow continue to exist. Once it is determined that such factors no longer exist, then the framework manager 212 starts execution of the new HA flow in the asynchronous process thread.
In a second example, it is assumed that the framework manager 212 (see
Having determined that an HA event has again occurred on a process or equipment associated with one of the storage nodes A 112.1, B 112.2, the framework manager 212 implements another new HA flow for the HA event as an asynchronous process thread. As in the first example, the new HA flow of the second example is defined by a set of parameters and a set of executable steps. For example, the set of parameters can include (i) zero, one, or more dependencies specifying the new HA flow's relationships with one or more other HA flows represented by HA flow objects in the persistent HA flow object database 214, (ii) an abort policy specifying rules regarding whether and/or when to abort certain HA flows in execution at the time a request to execute the new HA flow is generated, and (iii) logging and statistics information. In this second example, however, the rules specified in the abort policy dictate that all HA flows in execution are to be aborted. Upon implementation of the new HA flow for the HA event, the framework manager 212 generates a request to execute the new HA flow.
In this second example, once the request to execute the new HA flow has been generated, the framework manager 212 determines, as appropriate, (i) whether the request should be immediately refused, (ii) whether any HA flows in execution should be aborted, in accordance with the abort policy, and (iii) whether execution of the new HA flow should be postponed to a later time. If the request is not immediately refused, then the framework manager 212 allocates an HA flow object configured to represent the new HA flow and adds the HA flow object to the HA flow object database 214. Further, the framework manager 212 checks the rules specified in the abort policy for the new HA flow and aborts all asynchronous process threads for HA flows in execution, as warranted by the rules. In addition, the framework manager 212 checks the dependencies of the new HA flow vis-a-vis one or more other HA flows represented by HA flow objects in the HA flow object database 214 and postpones the execution of the new HA flow, as necessary, to satisfy the dependencies. Moreover, for each HA flow from among the other HA flows represented by HA flow objects in the HA flow object database 214, the framework manager 212 further determines, as appropriate, (i) whether the request to execute the HA flow should be immediately refused and (ii) whether execution of the HA flow should be postponed as necessary to satisfy its dependencies. Once these further determinations are made and satisfied, the framework manager 212 starts execution of the new HA flow in the asynchronous process thread.
A method of handling execution of HA process threads in an active-active storage node configuration is described below with reference to
Several definitions of terms are provided below for the purpose of aiding the understanding of the foregoing description, as well as the claims set forth herein.
As employed herein, the term “storage system” is intended to be broadly construed to encompass, for example, private or public cloud computing systems for storing data, as well as systems for storing data comprising virtual infrastructure and those not comprising virtual infrastructure.
As employed herein, the terms “client,” “host,” and “user” refer, interchangeably, to any person, system, or other entity that uses a storage system to read/write data.
As employed herein, the term “storage device” may refer to a storage array including multiple storage devices. Such a storage device may refer to any non-volatile memory (NVM) device, including hard disk drives (HDDs), solid state drives (SSDs), flash devices (e.g., NAND flash devices, NOR flash devices), and/or similar devices that may be accessed locally and/or remotely (e.g., via a storage attached network (SAN)). A storage array (drive array, disk array) may refer to a data storage system used for block-based, file-based, or object storage. Storage arrays can include, for example, dedicated storage hardware containing HDDs, SSDs, and/or all-flash drives. A data storage entity may be a filesystem, an object storage, a virtualized device, a logical unit (LU), a logical unit number (LUN), a logical volume (LV), a logical device, a physical device, and/or a storage medium. An LU may be a logical entity provided by a storage system for accessing data from the storage system and may be used interchangeably with a logical volume. An LU or LUN may be used interchangeably with each other. A LUN may be a logical unit number for identifying an LU and may also refer to one or more virtual disks or virtual LUNs, which may correspond to one or more virtual machines. A physical storage unit may be a physical entity such as a drive or disk or an array of drives or disks for storing data in storage locations that can be accessed by addresses. A physical storage unit may be used interchangeably with a physical volume.
As employed herein, the term “storage medium” may refer to one or more storage media such as a hard drive, a combination of hard drives, flash storage, a combination of flash storage, a combination of hard drives, flash storage, and other storage devices, and/or any other suitable types or combinations of computer readable storage media. A storage medium may also refer to both physical and logical storage media, include multiple levels of virtual-to-physical mappings, and include an image or disk image. A storage medium may be computer-readable and may be referred to as a computer-readable program medium.
As employed herein, the term “IO request” or “IO” may be used to refer to an input or output request such as a data read request or data write request.
As employed herein, the terms, “such as,” “for example,” “e.g.,” “exemplary,” and variants thereof describe non-limiting embodiments and mean “serving as an example, instance, or illustration.” Any embodiments described herein using such phrases and/or variants are not necessarily to be construed as preferred or more advantageous over other embodiments, and/or to exclude the incorporation of features from other embodiments. In addition, the term “optionally” is employed herein to mean that a feature or process, etc., is provided in certain embodiments and not provided in other certain embodiments. Any particular embodiment of the present disclosure may include a plurality of “optional” features unless such features conflict with one another.
While various embodiments of the present disclosure have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the present disclosure, as defined by the appended claims.