High availability in storage systems refers to the ability to provide a high uptime of uninterrupted and seamless operation of storage services for a long period of time, which can include withstanding various failures such as component failures, drive failures, node failures, network failures, or the like. High availability may be determined and quantified by two main factors: MTBF (Mean Time Between Failure) and MTTR (Mean Time to Repair). MTBF refers to the average time between failures of a system. A higher MTBF indicates a longer period of time that the system is likely to remain available. MTTR refers to the average time taken for repairs to make the system available following a failure. A lower MTTR indicates a higher level of system availability. HA may be measured in percentage of time relative to 100%, where 100% implies that a system is always available.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
In general, this disclosure describes techniques for providing high availability in a data center. In particular, the data center may include two types of nodes associated with storing data: access nodes and data nodes. The access nodes may be configured to provide access to data of a particular volume or other unit of data, while the data nodes may store the data itself (e.g., data blocks and parity blocks for erasure coding, or copies of the data blocks). According to the techniques of this disclosure, a node of the data center may be designated as a secondary access node for a primary access node. In the event that the primary access node fails, the secondary access node may be configured as the new primary access node, and a new secondary access node may be designated, where the new secondary access node is configured for the new primary access node. In this manner, the system may be tolerant to multiple failures of nodes, because new secondary access nodes may be established following the failure of a primary access node.
In one aspect, the method comprises storing a unit of data to each of a plurality of data nodes of a data center, designating a first node of the data center as a primary access node for the unit of data, the primary access node being configured to service access requests to the unit of data using one or more of the plurality of data nodes, determining that the first node is not available, and performing a failover process by reconfiguring a second node of the data center as the primary access node for the unit of data.
The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
A high availability (HA) system for storing data in a data center may provide various data durability schemes to implement a fault tolerant system. An example of such schemes includes dual redundancy where redundancy can be provided through duplication for modules and/or components and interfaces within a system. While dual redundancy may improve the availability of a storage system, the system may be unavailable if a component or module and its associated redundant component or module fail at the same time. For example, in a conventional HA system, redundancy mechanisms can be implemented to perform failover using fixed primary and secondary components/modules. In such a system, dual redundancy can be provided for components of the system where, if a primary component is down, a secondary component operating as a backup can perform the tasks that would otherwise have been performed by the primary component. However, such techniques can suffer from system unavailability when both the primary and secondary components fail. In such cases, the system is unavailable until it is serviced and brought back. This is typically true even when a customer has multiple nodes as each node is independent and does not have access to the data in other nodes. Furthermore, the recovery or replacement of failed controllers is typically a manual operation, which results in downtime of the system.
In view of the observations above, this disclosure describes various techniques that can provide an HA system protected against failure of multiple nodes, improving data availability of such systems. Various techniques related to implementing scale out and disaggregated storage clustering using virtual access nodes are provided. Unlike the traditional scale-up storage systems that can only failover between fixed primary and secondary access nodes, the techniques of this disclosure allow for selecting and implementing a new secondary access node in response to a failover, allowing for dynamic designation of access nodes. Furthermore, the techniques of this disclosure enable an automatic handling of failovers without manual intervention. In addition, the repair or replacement of storage access or data nodes can be performed without disruption to customers accessing the data. The techniques of this disclosure offer scalable, more resilient protection to failure than conventional HA techniques. In some examples, a new secondary access can be selected and implemented in response to each subsequent failover, which can (in some cases) be a previous primary access node that has failed and been repaired. Such techniques allow for a higher system availability compared to traditional methods. In some examples, these techniques allow failing a nearly unlimited number of times, limited by the availability of a quorum of storage nodes for hosting a given volume of data. Additionally or alternatively, these techniques can include keeping a secondary access node independent from storage nodes (nodes where data shards are stored in drives, such as solid state drives (SSDs)), which allows the failover to remain unaffected by any ongoing rebuilds due to one or more storage nodes failing. These techniques may be applicable to a variety of storage services, such as block storage, object storage, and file storage.
In some examples, data center 10 may represent one of many geographically distributed network data centers. Data center 10 may provide various services, such as information services, data storage, virtual private networks, file storage services, data mining services, scientific-or super-computing services, host web services, etc. Customers 11 may be individuals or collective entities, such as enterprises and governments. For example, data center 10 may provide services for several enterprises and end users.
In the example of
In the example of
In some examples, SDN controller 21 operates to configure data processing units (DPUs) 17 to logically establish one or more virtual fabrics as overlay networks dynamically configured on top of the physical underlay network provided by switch fabric 14. For example, SDN controller 21 may learn and maintain knowledge of DPUs 17 and establish a communication control channel with each DPU 17. SDN controller 21 may use its knowledge of DPUs 17 to define multiple sets (DPU groups 19) of two of more DPUs 17 to establish different virtual fabrics over switch fabric 14. More specifically, SDN controller 21 may use the communication control channels to notify each DPU 17 which other DPUs 17 are included in the same set. In response, each DPU 17 may dynamically set up FCP tunnels with the other DPUs 17 included in the same set as a virtual fabric over a packet-switched network. In this way, SDN controller 21 may define the sets of DPUs 17 for each of the virtual fabrics, and the DPUs can be responsible for establishing the virtual fabrics. As such, underlay components of switch fabric 14 may be unaware of virtual fabrics.
DPUs 17 may interface with and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity between DPUs 17 of any given virtual fabric. In this way, the servers connected to any of the DPUs 17 of a given virtual fabric may communicate packet data for a given packet flow to any other of the servers coupled to the DPUs 17 for that virtual fabric. Further, the servers may communicate using any of a number of parallel data paths within switch fabric 14 that interconnect the DPUs 17 of said virtual fabric. More details of DPUs operating to spray packets within and across virtual overlay networks are available in U.S. Provisional Patent Application No. 62/638,788, filed Mar. 5, 2018, entitled “NETWORK DPU VIRTUAL FABRICS CONFIGURED DYNAMICALLY OVER AN UNDERLAY NETWORK” (Attorney Docket No. 1242-036USP1) and U.S. patent application Ser. No. 15/939,227, filed Mar. 28, 2018, entitled “NON-BLOCKING ANY-TO-ANY DATA CENTER NETWORK WITH PACKET SPRAYING OVER MULTIPLE ALTERNATE DATA PATHS” (Attorney Docket No. 1242-002US01), the contents of which are incorporated herein by reference in their entireties for all purposes.
Data center 10 further includes storage service 22. In general, storage service 22 may configure various DPU groups 19 and storage nodes 12 to provide high availability (HA) for data center 10. For a given customer 11, one of storage nodes 12 (e.g., storage node 12A) may be configured as an access node, and one or more other nodes of storage nodes 12 (e.g., storage node 12N) may be configured as data nodes. In general, an access node provides access to a volume (i.e., a logical unit of data) for the given customer 11, and data nodes may store data of the volume for redundancy but otherwise not used to access the data for the given customer 11 (while acting as data nodes).
Storage service 22 may be responsible for access to various durable volumes of data from customers 11. For example, storage service 22 may be responsible for creating/deleting/mounting/unmounting/remounting various durable volumes of data from customers 11. Storage service 22 may designate one of storage nodes 12 as a primary access node for a durable volume. The primary access node may be the only one of storage nodes 12 that is permitted to direct data nodes of storage nodes 12 with respect to storing data for the durable volume. At any point of time, the storage service 22 ensures that only one access node can communicate with the data nodes for a given volume. In this manner, these techniques may ensure data integrity and consistency in split brain scenarios.
Each of storage nodes 12 may be configured to act as either an access node, a data node, or both for different volumes. For example, storage node 12A may be configured as an access node for a durable volume associated with a first customer 11 and as a data node for a durable volume associated with a second customer 11. Storage node 12N may be configured as a data node for the durable volume associated with the first customer 11 and as an access node for the durable volume associated with the second customer 11.
Storage service 22 may associate each durable volume with a relatively small number of storage nodes 12 (compared to N storage nodes 12). If one or more of the storage nodes 12 of a given durable volume goes down, a new, different set of storage nodes 12 may be associated with the durable volume. For example, storage service 22 may monitor the health of each of storage nodes 12 and initiate failover for corresponding durable volumes in response to detecting a failed node.
Storage service 22 may periodically monitor the health of storage nodes 12 for various purposes. For example, storage service 22 may receive periodic heartbeat signals from each storage node 12. When network connectivity is lost by a storage node 12 (or when a storage node 12 crashes), storage service 22 can miss the heartbeat signal from said storage node 12. In response, storage service 22 may perform an explicit health check on the node to confirm whether the node is not reachable or has failed. When storage service 22 detects that the health of an access node is below a predetermined threshold (e.g., the access node is in the process of failing or has in fact failed), storage service 22 may initiate a failover process. For example, storage service 22 may configure one of the data nodes or a secondary access node as a new primary access node for the volume such that access requests from the associated customer 11 are serviced by the new primary access node, thereby providing high availability. Thus, access requests may be serviced by the new primary access node without delaying the access requests, because data of the volume need not be relocated prior to servicing the access requests. In some cases, storage service 22 may copy data of the volume to another storage node 12 that was not originally configured as a data node for the volume in order to maintain a sufficient level of redundancy for the volume (e.g., following a failover).
Accordingly, storage service 22 may allow for failover of a durable volume in a scale out and disaggregated storage cluster. As long as there are a sufficient number of storage nodes 12 to host the durable volume, storage services may perform such failover as many times as needed without disrupting access to the durable volume. Moreover, failover may be achieved without relocating data of the durable volume, unless one of the data nodes is also down. Thus, failover may occur concurrently with or separately from rebuilding of data for the volume. In this manner, storage service 22 may provide high availability even when some of storage nodes 12 containing data for the durable volume have failed either during or following the failover.
Data center 10 further includes storage initiators 24, each of which is a component (software or a combination of software/hardware) in a compute node 13. Customers 11 may request access to a durable volume (e.g., read, write, or modify data of the durable volume) via a storage initiator 24. A storage initiator 24 may maintain information associating the storage node(s) 12 configured as an access node for each durable volume. Thus, when a new node is selected as either a new primary access node or a secondary access node, storage service 22 may send data representing the new primary access node and/or secondary access node to the appropriate storage initiator 24. In this manner, customers 11 need not be informed of failovers or precise locations of data for their durable volumes.
In some examples, a primary storage node (also referred to herein as an “primary access node”) may be independent of storage nodes 12. For example, the primary storage node may be one of storage nodes 12. In response to a request to access data of a volume, the primary storage node may send a request to read, write, or modify data to/from a data node of storage nodes 12 associated with the volume.
Although not shown in
As further described herein, in some examples, each DPU 17 is a highly programmable I/O processor specially designed for offloading certain functions from storage nodes 12 and compute nodes 13. In some examples, each DPU 17 includes one or more processing cores consisting of a number of internal processor clusters, e.g., MIPS cores, equipped with hardware engines that offload cryptographic functions, compression and regular expression (RegEx) processing, data storage functions, and networking operations. In this way, each DPU 17 includes components for fully implementing and processing network and storage stacks on behalf of one or more storage nodes 12 or compute nodes 13. In addition, DPUs 17 may be programmatically configured to serve as a security gateway for its respective storage nodes 12 or compute nodes 13, freeing up the processors of the servers to dedicate resources to application workloads.
In some example implementations, each DPU 17 may be viewed as a network interface subsystem that implements full offload of the handling of data packets (with zero copy in server memory) and storage acceleration for the attached server systems. In some examples, each DPU 17 may be implemented as one or more application-specific integrated circuit (ASIC) or other hardware and software components, each supporting a subset of the servers. DPUs 17 may also be referred to as access nodes or devices including access nodes. In other words, the term access node may be used herein interchangeably with the term DPU. Additional example details of various example DPUs are described in U.S. Provisional Patent Application No. 62/559,021, filed Sep. 15, 2017, entitled “Access Node for Data Centers,” and U.S. Provisional Patent Application No. 62/530,691, filed Jul. 10, 2017, entitled “Data Processing Unit for Computing Devices,” the contents of which are incorporated herein by reference in their entireties for all purposes.
In some examples, DPUs 17 are configurable to operate in a standalone network appliance having one or more DPUs 17. For example, DPUs 17 may be arranged into multiple different DPU groups 19, each including any number of DPUs 17 up to, for example, N DPUs 17. As described above, the number N of DPUs 17 may be different than the number N of storage nodes 12 or the number N of compute nodes 13. Multiple DPUs 17 may be grouped (e.g., within a single electronic device or network appliance), referred to herein as a DPU group 19, for providing services to a group of servers supported by the set of DPUs 17 internal to the device. In some examples, a DPU group 19 may comprise four DPUs 17, each supporting four servers so as to support a group of sixteen servers.
In the example of
In some examples, each DPU group 19 may be configured as a standalone network device and may be implemented as a two rack unit (2RU) device that occupies two rack units (e.g., slots) of an equipment rack. In some examples, DPU 17 may be integrated within a server, such as a single 1RU server in which four CPUs are coupled to the forwarding ASICs described herein on a motherboard deployed within a common computing device. In some examples, one or more of DPUs 17, storage nodes 12, and/or compute nodes 13 may be integrated in a suitable size (e.g., 10RU) frame that may become a network storage compute unit (NSCU) for data center 10. For example, a DPU 17 may be integrated within a motherboard of a storage node 12 or a compute node 13 or otherwise co-located with a server in a single chassis.
In some examples, DPUs 17 interface and utilize switch fabric 14 so as to provide full mesh (any-to-any) interconnectivity such that any of storage nodes 12 and/or compute nodes 13 may communicate packet data for a given packet flow to any other server using any of a number of parallel data paths within the data center 10. For example, in some example network architectures, DPUs 17 spray individual packets for packet flows among the other DPUs 17 across some or all of the multiple parallel data paths in the switch fabric 14 of data center 10 and reorder the packets for delivery to the destinations so as to provide full mesh connectivity.
A data transmission protocol referred to as a Fabric Control Protocol (FCP) may be used by the different operational networking components of any of the DPUs 17 to facilitate communication of data across switch fabric 14. The use of FCP may provide certain advantages. For example, the use of FCP may significantly increase the bandwidth utilization of the underlying switch fabric 14. Moreover, in example implementations described herein, the servers of the data center may have full mesh interconnectivity and may nevertheless be non-blocking and drop-free.
FCP is an end-to-end admission control protocol in which, in some examples, a sender explicitly requests a receiver with the intention to transfer a certain number of bytes of payload data. In response, the receiver issues a grant based on its buffer resources, QoS, and/or a measure of fabric congestion. In general, FCP enables spray of packets of a flow to all paths between a source and a destination node, which may provide numerous advantages, including resilience against request/grant packet loss, adaptive and low latency fabric implementations, fault recovery, reduced or minimal protocol overhead cost, support for unsolicited packet transfer, support for FCP capable/incapable nodes to coexist, flow-aware fair bandwidth distribution, transmit buffer management through adaptive request window scaling, receive buffer occupancy based grant management, improved end to end QoS, security through encryption and end to end authentication, and/or improved ECN marking support. More details on the FCP are available in U.S. Provisional Patent Application No. 62/566,060, filed Sep. 29, 2017, entitled “Fabric Control Protocol for Data Center Networks with Packet Spraying Over Multiple Alternate Data Paths,” the entire content of which is incorporated herein by reference for all purposes.
Although DPUs 17 are described in
In the illustrated example of
In the example of
As described herein, the new processing architecture utilizing a DPU may be efficient for stream processing applications and environments compared to previous systems. For example, stream processing is a type of data processing architecture well-suited for high performance and high efficiency processing. A stream is defined as an ordered, unidirectional sequence of computational objects that can be of unbounded or undetermined length. In a simple example, a stream originates from a producer and terminates at a consumer and is operated on sequentially. In some examples, a stream can be defined as a sequence of stream fragments, where each stream fragment includes a memory block contiguously addressable in physical address space, an offset into that block, and a valid length. Streams can be discrete, such as a sequence of packets received from the network, or continuous, such as a stream of bytes read from a storage device. A stream of one type may be transformed into another type as a result of processing. For example, TCP receive (Rx) processing can consume segments (fragments) to produce an ordered byte stream. The reverse processing can be performed in the transmit (Tx) direction. Independently of the stream type, stream manipulation can involve efficient fragment manipulation.
In some examples, the plurality of cores 140 may be capable of processing a plurality of events related to each data packet of one or more data packets, which can be received by networking unit 142 and/or PCIe interfaces 146, in a sequential manner using one or more “work units.” In general, work units are sets of data exchanged between cores 140 and networking unit 142 and/or PCIe interfaces 146 where each work unit may represent one or more of the events related to a given data packet of a stream. As one example, a work unit (WU) can be a container that is associated with a stream state and used to describe (e.g., point to) data within a stream (stored). For example, work units may dynamically originate within a peripheral unit coupled to the multi-processor system (e.g., injected by a networking unit, a host unit, or a solid-state drive interface), or within a processor itself, in association with one or more streams of data and terminate at another peripheral unit or another processor of the system. The work unit can be associated with an amount of work that is relevant to the entity executing the work unit for processing a respective portion of a stream.
One or more processing cores of a DPU may be configured to execute program instructions using a work unit stack containing a plurality of work units. In some examples, in processing the plurality of events related to each data packet, a first core 140 (e.g., core 140A) may process a first event of the plurality of events. Moreover, first core 140A may provide to a second core 140 (e.g., core 140B) a first work unit. Furthermore, second core 140B may process a second event of the plurality of events in response to receiving the first work unit from first core 140A.
DPU 17 may act as a combination of a switch/router and a number of network interface cards. For example, networking unit 142 may be configured to receive one or more data packets from, and to transmit one or more data packets to, one or more external devices (e.g., network devices). Networking unit 142 may perform network interface card functionality, packet switching, etc. and may use large forwarding tables and offer programmability. Networking unit 142 may expose Ethernet ports for connectivity to a network, such as network 7 of
Memory controller 144 may control access to memory unit 134 by cores 140, networking unit 142, and any number of external devices (e.g., network devices, servers, external storage devices, or the like). Memory controller 144 may be configured to perform a number of operations to perform memory management. For example, memory controller 144 may be capable of mapping accesses from one of the cores 140 to either coherent cache memory 136 or non-coherent buffer memory 138. In some examples, memory controller 144 may map the accesses based on one or more of an address range, an instruction or an operation code within the instruction, a special access, or a combination thereof.
Additional details regarding the operation and advantages of DPUs are available in U.S. patent application Ser. No. 16/031,921, filed Jul. 10, 2018, and titled “DATA PROCESSING UNIT FOR COMPUTE NODES AND STORAGE NODES,” (Attorney Docket No. 1242-004US01) and U.S. patent application Ser. No. 16/031,676, filed Jul. 10, 2018, and titled “ACCESS NODE FOR DATA CENTERS” (Attorney Docket No. 1242-005USP1), the contents of which are incorporated herein by reference in their entireties for all purposes.
Although DPU group 19 is illustrated in
In the example of
Within DPU group 19, connections 42 may be copper (e.g., electrical) links arranged as 8×25 GE links between each of DPUs 17 and optical ports of DPU group 19. Between DPU group 19 and the switch fabric, connections 42 may be optical Ethernet connections coupled to the optical ports of DPU group 19. The optical Ethernet connections may connect to one or more optical devices within the switch fabric, e.g., optical permutation devices described in more detail below. The optical Ethernet connections may support more bandwidth than electrical connections without increasing the number of cables in the switch fabric. For example, each optical cable coupled to DPU group 19 may carry 4×100 GE optical fibers with each fiber carrying optical signals at four different wavelengths or lambdas. In other examples, the externally available connections 42 may remain as electrical Ethernet connections to the switch fabric.
In the example of
In some examples, an 8-way mesh of DPUs 17 (e.g., a logical rack of two NSCUs 40) can be implemented where each DPU 17 may be connected to each of the other seven DPUs 17 by a 50 GE connection. For example, each of connections 46 between DPUs 17 within the same DPU group 19 may be a 50 GE connection arranged as 2×25 GE links. Each of connections 44 between a DPU 17 of a DPU group 19 and the DPUs 17 in the other DPU group 19 may include four 50 GE links. In some examples, each of the 50 GE links may be arranged as 2×25 GE links such that each of connections 44 includes 8×25 GE links.
In some examples, Ethernet connections 44, 46 provide full-mesh connectivity between DPUs within a given structural unit that is a full-rack or a full physical rack that includes four NSCUs 40 having four DPU groups 19 and supports a 16-way mesh of DPUs 17 for said DPU groups 19. In such examples, connections 46 can provide full-mesh connectivity between DPUs 17 within the same DPU group 19, and connections 44 can provide full-mesh connectivity between each DPU 17 and twelve other DPUs 17 within three other DPU groups 19. A DPU group 19 may be implemented to have the requisite number of externally available Ethernet ports to provide such connectivity. In the examples described above, a DPU group 19 may have at least forty-eight externally available Ethernet ports to connect to the DPUs 17 in the other DPU groups 19.
In the case of a 16-way mesh of DPUs 17, each DPU 17 may be connected to each of the other fifteen DPUs by a 25 GE connection, for example. Each of connections 46 between DPUs 17 within the same DPU group 19 may be a single 25 GE link. Each of connections 44 between a DPU 17 of a DPU group 19 and the DPUs 17 in the three other DPU groups 19 may include 12×25 GE links.
As shown in
Solid-state storage 41 may include any number of storage devices with any amount of storage capacity. In some examples, solid state storage 41 may include twenty-four SSD devices with six SSD devices for each DPU 17. The twenty-four SSD devices may be arranged in four rows of six SSD devices with each row of SSD devices being connected to a DPU 17. Each of the SSD devices may provide up to 16 Terabytes (TB) of storage for a total of 384 TB per DPU group 19. As described in more detail below, in some cases, a physical rack may include four DPU groups 19 and their supported node groups 52. In that case, a typical physical rack may support approximately 1.5 Petabytes (PB) of local solid-state storage. In some examples, solid-state storage 41 may include up to 32 U.2×4 SSD devices. In other examples, NSCU 40 may support other SSD devices, e.g., 2.5″ Serial ATA (SATA) SSDs, mini-SATA (mSATA) SSDs, M.2 SSDs, and the like. As can readily be appreciated, various combinations of such devices may also be implemented.
In examples where each DPU 17 is included on an individual DPU sled with local storage for the DPU 17, each of the DPU sleds may include four SSD devices and some additional storage that may be hard drive or solid-state drive devices. The four SSD devices and the additional storage may provide approximately the same amount of storage per DPU 17 as the six SSD devices described in the previous examples.
In some examples, each DPU 17 supports a total of 96 PCIe lanes. In such examples, each of connections 48 may be an 8×4-lane PCI Gen 3.0 connection via which each DPU 17 may communicate with up to eight SSD devices within solid state storage 41. In addition, each of connections 50 between a given DPU 17 and the four storage nodes 12 and/or compute nodes 13 within the node group 52 supported by the given DPU 17 may be a 4×16-lane PCIe Gen 3.0 connection. In these examples, DPU group 19 has a total of 256 external-facing PCIe links that interface with node groups 52. In some scenarios, DPUs 17 may support redundant server connectivity such that each DPU 17 connects to eight storage nodes 12 and/or compute nodes 13 within two different node groups 52 using an 8×8-lane PCIe Gen 3.0 connection.
In some examples, each of DPUs 17 supports a total of 64 PCIe lanes. In such examples, each of connections 48 may be an 8×4-lane PCI Gen 3.0 connection via which each DPU 17 may communicate with up to eight SSD devices within solid state storage 41. In addition, each of connections 50 between a given DPU 17 and the four storage nodes 12 and/or compute nodes 13 within the node group 52 supported by the given DPU 17 may be a 4×8-lane PCIe Gen 4.0 connection. In these examples, DPU group 19 has a total of 128 external facing PCIe links that interface with node groups 52.
In the example of
Each storage node 162 includes a set of one or more communicatively-coupled storage devices 160 to store data. Each storage device 160 coupled to a respective DPU 158 may be virtualized and presented as volumes to a storage initiator 152 through virtual storage controllers. As such, reads and writes issued by a storage initiator 152 may be served by storage nodes 162. The data for a given volume may be stored in a small subset of storage nodes 162 (e.g., 6 to 12 nodes) for providing durability. Storage nodes 162 that contain the data for a volume may be referred to as “data nodes” for that volume. For a given volume, one of the storage nodes 162 may serve as an access node to which a storage initiator 152 sends input/output (IO) requests. The access node is responsible for reading/writing the data from/to the data nodes and responding to requests from a storage initiator 152. A given storage node 162 may serve as an access node for some volumes and as a data node for other volumes.
Data of a volume may be made durable by using erasure coding or replication. Other data redundancy schemes may also be implemented. When using erasure coding (EC) for a durable volume, parity blocks may be calculated for a group of data blocks. These blocks may be stored in a different storage node. These nodes may be the data nodes of the corresponding volume. EC is generally more efficient in terms of additional network traffic between storage nodes and storage overhead for required redundancy. Because redundancy may be created through EC, data is not lost even when one or more data nodes have failed. The number of node failures that can be tolerated depends on how many parity blocks are calculated. For instance, two simultaneous node failures can be tolerated when using two parity blocks.
When using replication, data is replicated across a given number of nodes to provide durability against one or more nodes going down. The number of replicas for a given volume determines how many node failures can be tolerated.
Storage initiator 202 may access one or more volumes by connecting to associated storage controllers presented by storage target clusters. For example, as shown in
System 250 may implement an approach to high availability that allows a volume to fail-over a virtually unlimited number of times, as long as there are enough nodes and capacity in the cluster to be able to do so. Such an approach can utilize virtual storage controllers and can achieve durability by sharing the data over many storage nodes with redundancy. This design and implementation of HA allows storage initiator 254 to access data of a volume via multiple paths, with each path going through an access node, e.g., primary access node 256. Secondary access node 258 may act as a backup access node to primary access node 256 in case primary access node 256 fails or is otherwise unavailable. The access nodes 256, 258 may act as storage target nodes that allow access to durable storage volumes that are erasure coded or replicated across multiple data nodes. The data nodes 260 may be nodes used by durable storage volumes to store data of the volumes. For a given volume, a storage target node can act as either an access node or a data node or both an access node and a data node.
In the example of
Data of the volume can be stored across a number of data nodes 260, the number of which can depend on the durability scheme implemented. For example, there may be a total of six data nodes for an EC scheme of four data blocks and two parity blocks. Every chunk of data may be split into four data blocks and two parity blocks are generated using those four data blocks, and these six blocks are stored in the six data nodes. This EC four data block+two parity block scheme thus allows data to be available even when there are up to two data node failures.
Any storage target node in the cluster can be selected as a primary access node or as a secondary access node for a given volume. The secondary node for a volume should not be the same as the primary node of that volume. There could be other criteria, such as failure domains, when selecting the primary and secondary access nodes as well as the data nodes for a given volume. The number of data nodes and parity nodes for a given volume may be configurable and can include an upper bound. The number of data nodes and parity nodes may be selected based on the storage overhead, cluster size, and the number of simultaneous node failures that can be tolerated.
At any given time, one of the two access nodes may act as a primary node (e.g., primary access node 256) and requests from storage initiator 254 may be served by said primary access node 256. Primary access node 256 has access to the volume data residing in data nodes 260, whereas connections from secondary access node 258 to data nodes 260 are not established unless secondary access node 258 becomes a primary access node. Secondary access node 258 does not service any input/output requests when configured as the secondary access node. However, storage initiator 254 is connected to secondary access node 258 as well as primary access node 256. Cluster services 252 may manage all nodes (illustrated as dashed lines to indicate control plane operations), including the DPU-based storage initiator nodes, e.g., storage initiator 254.
In the example of
In the event that primary access node 290 goes down or is otherwise unavailable, secondary access node 300 becomes the primary access node, and storage initiator 282 fails over to secondary access node 300. Another node from the storage cluster (not shown in
This same process of failing over to the current secondary access node as a new primary access node and selecting a new secondary access node from the storage cluster may repeat any time the current primary node goes down. A selected new secondary access node may be a previous primary access node that has been repaired and made available. Thus, failover can be performed a nearly unlimited number of times as long as there is enough capacity left in the cluster for rebuilds when needed and a quorum of nodes (e.g., 6 nodes when using 4+2 erasure coding) available in the cluster. Both primary virtual storage controller 292 and secondary virtual storage controller 302 may provide the same volume identity (e.g., volume 294) and any other controller identity that needs to be maintained. This is possible because storage controllers are virtual and can be created or deleted with needed properties on any of the nodes of the data center.
The storage service may ensure that only one path is primary at any given time, thereby avoiding the possibility of storage initiator 282 communicating with both primary access node 290 and secondary access node 300 at the same time. Otherwise, there would be a potential for loss or corruption of data due to the states between the two nodes diverging. To avoid the split-brain situation of both access nodes becoming active, the storage service may ensure that the current secondary is made primary (active) only after confirming that any write input/output operations to the failed primary cannot reach the actual data nodes by cutting off the connections between the failed primary access node and the data nodes. These techniques for durable volume with separation of access and data nodes allows the storage service to avoid this split-brain situation without having to rely on a third party for either the metadata or the confirmation of the primary access node failure. When there is a path change, the information may be passed to storage initiator 282. When storage initiator 282 is also a DPU-based initiator, the host software need not be aware of any path changes as the initiator DPU handles the failover together with the storage service.
When a secondary access node fails, the storage service may select a new secondary node, mount the volume at the new secondary access node, and provide information to storage initiator 282 indicative of the new secondary access node. The data path will not be disrupted in this case as the primary access node is not affected.
The storage target nodes storing data for a volume can be different from the storage target nodes providing access to the volume. In such cases, there need not be any data relocation when a primary access node fails unless the primary access node also hosts a data shard for the volume.
It is unlikely that both primary access node 290 and secondary access node 300 would fail at the same time. However, when this happens, the storage service may wait for one of these access nodes to come back online before initiating a failover, as there might be some input/output commands that have not yet been flushed to the storage devices of storage nodes 288. The primary and secondary access nodes may be kept in different power failure domains to reduce the likelihood of both primary and secondary access nodes going down at the same time. In some examples, there can be multiple secondary nodes for a given volume, which can provide for a more robust system at the expense of additional overhead and traffic in maintaining write data at all of the access nodes.
Storage service 156 may then mount the volume to the current secondary access node (322). Storage service 156 may also create a virtual storage controller on the secondary access node (324) and attach the volume to the virtual storage controller on the secondary access node (326). Storage service 156 may then make the secondary access node the primary access node for the volume (328), e.g., providing information to an associated storage initiator indicating that the previous secondary access node is now the primary access node. Thus, the new primary access node may then service access requests received from the storage initiator (330).
Following the timeout period (or in the case that there is not a quorum of nodes available for the durable volume following the initial failure of the primary access node (“NO” branch of 340)), if the old primary access node has recovered (“YES” branch of 346), storage service 156 may resynchronize data blocks on the old primary access node (348) and make the old primary access node the new secondary access node (350). On the other hand, if the old primary access node has not recovered within the timeout period (“NO” branch of 346), storage service 156 may rebuild the failed blocks (352) and select a new secondary access node (354). In either case, storage service 156 may instruct the storage initiator to connect to the new secondary access node (356).
If a failed node has crashed or rebooted due to some failure, then the failed node volume state may be lost. Hence, the storage initiator may not be able to access the durable volume through this node anymore. If the primary access node is available and storage service 156 cannot reach it (e.g., due to network failure somewhere between storage service 156 and the node) but the node is still accessible by the storage initiator (e.g., due to management path failure when data and management networks are separate), then storage service 156 may ensure that the storage initiator cannot access the data via this failed node. Storage service 156 may be configured to avoid sending any commands to the primary access node in this case. Storage service 156 may also reach the data nodes of the durable volume and delete the connections from the failed primary access node to the data nodes. As a result, the primary access node would be rendered incapable of sending any further commands from the storage initiator to the data nodes. Thus, these techniques may protect the data center from data corruption or loss due to split brain situations.
Storage service 156 may initiate a resynchronization or rebuild of the data blocks. If data blocks of a data node have failed, and the DPU managing those data blocks does not come back within a certain period of time, storage service 156 may initiate a full rebuild of those data blocks on another data node or data nodes. If the failed data node comes back online within a waiting period, storage service 156 may avoid a full rebuild and instead issue a resynchronization instruction to bring the data blocks up to date with all data writes that have happened since the data block failure.
To select the new secondary access node, storage service 156 may wait for a period of time to see if the failed node comes back online. If the failed node is able to recover within the time period, storage service 156 may use that node as the new secondary access node. However, if the failed node does not come back within the waiting period, storage service 156 may select a new node as a new secondary access node. The selection of the new secondary access node may depend on various factors, such as power failure domains and input/output load. Once the new secondary access node is selected, storage service 156 may mount the volume to the secondary access node in passive mode (or inactive mode) and attached the volume to a newly created storage controller of the secondary access node. Storage service 156 may then indicate to the storage initiator to establish a connection with the newly created secondary storage controller. In the case that storage initiators are not managed by storage service 156, the information may be pulled by storage initiators using a standard discovery service.
The method further includes determining that the first node is not available (364). Determination of availability can be performed in various ways. In some examples, periodic heartbeat signals are received from each node. A missed heartbeat signal for a given node indicates that the given node has failed or is otherwise unavailable, such as failure of the node or lost network connectivity. When a heartbeat signal is missed, health of the given node can be checked. If the health of the given node is below a predetermined threshold, it can be determined to be unavailable.
The method further includes performing a failover process (366). The failover process can be performed in various ways, including the methods described above with respect to
The method may optionally include determining that the second node is not available (368) and performing a second failover process (370). The second failover process can be performed in various ways. Due to the unavailability of the primary access node (the second node), the second failover process can include reconfiguring a different node as the primary access node. Depending on the availability of the nodes of the data center, different nodes may be selected. For example, if the first node (the original primary access node) is available, it may be selected and designated as the new primary access node. In some examples, a different, third node is selected and designated as the new primary access node. As described above, the “third node” may be a node that was designated as the new secondary access node.
The techniques described in this disclosure may be implemented, at least in part, in hardware, software, firmware or any combination thereof. For example, various aspects of the described techniques may be implemented within one or more processors, including one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components. The term “processor” or “processing circuitry” may generally refer to any of the foregoing logic circuitry, alone or in combination with other logic circuitry, or any other equivalent circuitry. A control unit comprising hardware may also perform one or more of the techniques of this disclosure.
Such hardware, software, and firmware may be implemented within the same device or within separate devices to support the various operations and functions described in this disclosure. In addition, any of the described units, modules or components may be implemented together or separately as discrete but interoperable logic devices. Depiction of different features as modules or units is intended to highlight different functional aspects and does not necessarily imply that such modules or units must be realized by separate hardware or software components. Rather, functionality associated with one or more modules or units may be performed by separate hardware or software components, or integrated within common or separate hardware or software components.
The techniques described in this disclosure may also be embodied or encoded in a computer-readable medium, such as a computer-readable storage medium, containing instructions. Instructions embedded or encoded in a computer-readable medium may cause a programmable processor, or other processor, to perform the method, e.g., when the instructions are executed. Computer-readable media may include non-transitory computer-readable storage media and transient communication media. Computer readable storage media, which is tangible and non-transitory, may include random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), flash memory, a hard disk, a CD-ROM, a floppy disk, a cassette, magnetic media, optical media, or other computer-readable storage media. It should be understood that the term “computer-readable storage media” refers to physical storage media, and not signals, carrier waves, or other transient media.
Various examples have been described. These and other examples are within the scope of the following claims. The following paragraphs provide additional support for the claims of the subject application. One aspect provides a method of providing access to data of a data center, the method comprising storing a unit of data to each of a plurality of data nodes of a data center, designating a first node of the data center as a primary access node for the unit of data, the primary access node being configured to service access requests to the unit of data using one or more of the plurality of data nodes, determining that the first node is not available, and performing a failover process by reconfiguring a second node of the data center as the primary access node for the unit of data. In this aspect, additionally or alternatively, the method further comprises receiving a request from a client associated with the unit of data to access the unit of data using the primary access node and, after determining that the first node is not available, providing access to the client to the unit of data via the second node. In this aspect, additionally or alternatively, the failover process is performed by a storage service, and wherein access to the unit of data via the second node is provided by a virtual storage controller unit in response to the failover process. In this aspect, additionally or alternatively, the request from the client to access the unit of data comprises at least one of reading, writing, or modifying the unit of data. In this aspect, additionally or alternatively, the failover process is performed without copying the unit of data when the plurality of data nodes is available. In this aspect, additionally or alternatively, the method further comprises determining that at least one data node of the plurality of data nodes is unavailable, wherein the at least one data node is below a predetermined number and performing a rebuild of the at least one data node that is unavailable during the performing of the failover process. In this aspect, additionally or alternatively, performing the failover process further comprises determining that a first candidate node is not available and reconfiguring a second candidate node as the primary access node, wherein the second candidate node is the second node. In this aspect, additionally or alternatively, the method further comprises determining that a data node of the plurality of data nodes is unavailable, configuring an additional node of the data center, separate from the plurality of data nodes, as a data node for the unit of data, and copying at least a portion of the unit of data to the additional node. In this aspect, additionally or alternatively, the method further comprises monitoring a periodic signal from the primary access node, wherein determining that the primary access node is not available comprises determining that the periodic signal has not been received from the primary access node. In this aspect, additionally or alternatively, the method further comprises determining that the second node is not available and performing a second failover process by reconfiguring the first node as the primary access node for the unit of data.
Another aspect provides a storage system for providing access to data of a data center, the storage system comprising a plurality of data processing units, a plurality of nodes, each node associated with a respective data processing unit of the plurality of data processing units, wherein the storage system is configured to store a unit of data to one or more storage devices of one or more data nodes, wherein the one or more data nodes are nodes within the plurality of nodes, designate a first access node as a primary access node for the unit of data, wherein the first access node is a node within the plurality of nodes different from the one or more data nodes, and wherein the primary access node is configured to service access requests to the unit of data using the one or more data nodes, determine that the primary access node is not available, and perform a failover process by reconfiguring a second access node as the primary access node for the unit of data, wherein the second access node is a node within the plurality of nodes different from the first access node and the one or more data nodes. In this aspect, additionally or alternatively, the storage system is further configured to receive a request from a client associated with the unit of data to access the unit of data using the primary access node and, after determining that the first node is not available, providing access to the client to the unit of data via the second node. In this aspect, additionally or alternatively, the failover process is performed by a storage service, and wherein providing access to the client to the unit of data via the second node is allowed by a virtual storage controller unit of the second node in response to the failover process. In this aspect, additionally or alternatively, performing the failover process further comprises determining that a first candidate node is not available and reconfiguring a second candidate node as the primary access node, wherein the second candidate node is the second access node. In this aspect, additionally or alternatively, the storage system is further configured to determine that the second access node, which is currently designated as the primary access node, is not available, and perform a second failover process by reconfiguring the first access node as the primary access node.
Another aspect provides a method for providing access to data of a data center, the method comprising storing a unit of data to each of a plurality of data nodes of a data center, designating a first node of the data center as a primary access node for the unit of data, the primary access node being configured to service access requests to the unit of data using one or more of the plurality of data nodes, designating a second node of the data center as a secondary access node for the unit of data, determining that the primary access node is not available, and performing a failover process by reconfiguring the second node of the data center as the primary access node for the unit of data by removing connections from the first node to the unit of data, establishing connections from the second node to the unit of data, and designating a node different from the second node as the secondary access node. In this aspect, additionally or alternatively, the method further comprises receiving a request from a client associated with the unit of data to access the unit of data using the primary access node and, after determining that the first node is not available, providing access to the client to the unit of data via the second node. In this aspect, additionally or alternatively, performing the failover process further comprises determining that a first candidate node is not available and reconfiguring a second candidate node as the primary access node, wherein the second candidate node is the second node. In this aspect, additionally or alternatively, the method further comprises determining that a current primary access node is not available and performing a second failover process. In this aspect, additionally or alternatively, performing the second failover process includes designating the first node as the primary access node.
It will be understood that the configurations and/or approaches described herein are exemplary in nature, and that these specific embodiments or examples are not to be considered in a limiting sense, because numerous variations are possible. The specific routines or methods described herein may represent one or more of any number of processing strategies. As such, various acts illustrated and/or described may be performed in the sequence illustrated and/or described, in other sequences, in parallel, or omitted. Likewise, the order of the above-described processes may be changed.
The subject matter of the present disclosure includes all novel and non-obvious combinations and sub-combinations of the various processes, systems and configurations, and other features, functions, acts, and/or properties disclosed herein, as well as any and all equivalents thereof.
This application claims priority to U.S. Provisional Patent Ser. No. 63/490,862, filed Mar. 17, 2023, the entirety of which is hereby incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
63490862 | Mar 2023 | US |