Embodiments generally relate to storage devices. More particularly, embodiments relate to cluster-wide rebuild reduction against storage node failures.
Software Defined Storage (SDS) is a scale out shared nothing architecture (e.g., nodes do not share resources) that presents a single namespace by aggregating storage resources from servers in a cluster. SDS may provide elastic scalability and resilience against hardware failures by distributing multiple copies of data among fault domains. When a storage node fails in the cluster, SDS triggers a cluster-wide rebalance to ensure that surviving nodes maintain the same number of data copies or erasure coded data fragments to meet resiliency goals. It typically takes several hours to complete the rebalance activity, wherein the rebalance time is directly proportional to amount of capacity lost due to storage node failure. Moreover, during rebalancing activity, the cluster may be vulnerable to data loss if there are cascading failures.
The various advantages of the embodiments will become apparent to one skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:
Storage node capacity is typically maintained at low hundreds of Tera Bytes (TBs) in order to avoid long rebalance times and negative impacts to service level agreements (SLAs). Emerging workloads such as artificial intelligence (AI) workloads are driving significant data growth, which in turn creates a demand to increase storage node capacities to multi-Peta Byte (PB) capacities. The technology described herein provides an enhancement that minimizes or avoids cluster-wide rebalancing activity and therefore enables larger (e.g., multi-Peta Byte) storage capacities per server.
More particularly, embodiments involve disaggregating storage nodes coupled with storage software innovations to eliminate or minimize rebuild times due to server failures. Storage devices and memory are disaggregated from storage servers using low-latency, high-bandwidth interfaces such as CXL (Compute Express Link, e.g., CXL Specification, Rev. 3.0, Aug. 1, 2022, Compute Express Consortium) interfaces/switches, while storage software discovers memory, device and switch topologies. Placement advancements use this information to ensure that data is distributed among fault domains by comprehending disaggregated memory, storage and avoiding single points of failure. Storage software along with platform hot-plug functionality may also be used to boot strap storage devices from a failed server to a healthy server in the cluster. Conventional storage software boot strapping techniques may be modified to re-use resident CXL in-memory metadata, cache and start instantaneously.
In the case of a storage drive failure, a separate unordered virtual channel with PCIe (Peripheral Component Interconnect Express, e.g., PCI Express® Base Specification 6.0, Version 1.0, Jan. 11, 2022, PCI Special Interest Group) source ordering technology is used to stream writes via, for example, PCIe non-transport bridging (NTB) over network (e.g., remote direct memory access/RDMA). Such an approach provides faster replication on surviving storage devices in the cluster to rehydrate lost data.
Turning now to
More particularly, a first storage server 12 manages/hosts data chunks (e.g., Chunks 1-3) that are resident/stored on a first solid state drive (SSD, e.g., non-volatile memory) 14, a second storage server 16 manages/hosts data chunks (e.g., Chunks 4-6) that are resident/stored on a second SSD 18, and an nth storage server 20 manages/hosts data chunks (e.g., Chunks 7-9) that are resident/stored on a third SSD 22. Rather than co-locating the SSDs 14, 18, 22 with the storage servers 12, 16, 20, respectively, the storage servers 12, 16, 20 are connected to the SSDs 14, 18, 22 via a first switch 24 and a second switch 26. In one example, the switches 24, 26 are low-latency, high-bandwidth interfaces such as CXL interfaces.
Thus, a storage daemon running inside the first storage server 12 may use a CXL memory buffer (not shown) attached to the switches 24, 26 to hold storage metadata, wherein the storage metadata includes details on how to map Chunks 1-3 to physical block sectors inside the first SSD 14. Similarly, the second storage server 16 may use a CXL memory buffer (not shown) attached to the switches 24, 26 to hold storage metadata that includes details on how to map Chunks 4-6 to physical block sectors inside the second SSD 18.
If a failure is detected in, for example, the first storage server 12, other servers in the cluster 10 such as, for example, the second storage server 16 may be automatically selected (e.g., based on topology data, one or more hash criteria, etc.) and configured (e.g., via hot-plug flow, storage daemon service, metadata, etc.) to host some or all of the data that is resident on the first SSD 14. Such an approach enables a cluster-wide rebalance of the storage cluster 10 to be bypassed/avoided.
The illustrated architecture may also include a first volatile memory 28 (e.g., dynamic random access memory/DRAM) associated with the first SSD 14, a second volatile memory 30 associated with the second SSD 18, a third volatile memory 32 associated with the third SSD 22, and so forth. In such a case, the storage servers 12, 16, 20 may also be connected to the volatile memories 28, 30, 32 via the switches 24, 26. Accordingly, the second storage server 16 may be configured to host some or all of data resident on the first volatile memory 28 in response to a failure in the first storage server 12 and without triggering a cluster-wide rebalance of the storage cluster 10.
In one example, each of the switches 24, 26 includes a fabric manager to build a cluster-wide topology, as will be discussed in greater detail. Additionally, each of the SSDs 14, 18, 22 and the volatile memories 28, 30, 32 may be connected to multiple switches 24, 26 to protect against failures in the switches 24, 26 themselves. The illustrated storage servers 12, 16, 20 also include memory (e.g., storing instructions to manage failures, support non-storage related input/output (IO) processing, etc.), central processing units (CPUs, e.g., host processors to execute the instructions in memory), and network interface cards (NICs, e.g., wired/wireless).
The illustrated architecture therefore to ensures that surviving nodes maintain the same number of data copies to meet resiliency goals while avoiding the several hours that it typically takes to complete the rebalance activity. Moreover, the storage cluster 10 is less vulnerable to data loss if there are cascading failures.
Illustrated processing block 52 provides for detecting a first failure (e.g., CPU failure) in a first storage server, wherein the first storage server is connected to a first NVM (e.g., SSD) via a switch (e.g., CXL switch). Block 54 selects a second storage server that is connected to the first NVM via the switch, wherein the first storage server and the second storage server are in (e.g., part of) a shared storage cluster such as the storage cluster 10 (
In some architectures, the first storage server and the second storage server are also connected to a volatile memory (e.g., DRAM) via the switch. In such a case, block 56 may also configure the second storage server to host second data resident on the volatile memory in response to the first failure in the first storage server. While some examples may reference two storage servers, the examples may be readily expanded to N servers. The method 50 therefore enhances performance at least to the extent that using the switch to configure another storage server to host the first and/or second data ensures that surviving nodes maintain the same number of data copies to meet resiliency goals while avoiding the several hours that it typically takes to complete the rebalance activity. Moreover, bypassing the cluster-wide rebalance enables the storage cluster to be less vulnerable to data loss if there are cascading failures.
Illustrated processing block 62 detects a failed host, wherein block 64 initiates a re-connection (e.g., path selection) in response to the host failure. In an embodiment, block 66 uses cluster metadata 68 to select data chunks that belong to the failed host. Additionally, block 70 may locate/find servers that share a CXL switch with the failed host. In one example, block 72 identifies an appropriate server to host the affected data chunks using consistent hashing (e.g., one or more levels of indirection to achieve reachability, availability and load balancing objectives). Block 74 uses a hot-plug flow to attach a memory and/or SSD partition to the selected server (e.g., enumeration without reboot). In an embodiment, block 76 starts a storage daemon service, wherein block 78 reads CXL memory metadata 80. Block 82 may then announce that the selected server is ready to service clients.
Therefore, the cluster monitoring service detects server failures and initiates a re-connection operation. Moreover, the cluster metadata 68 includes ownership of data chunks, wherein the cluster metadata 68 is used to identify list of data chunks to be hosted on surviving nodes in the storage cluster. In order to avoid rebalancing, the cluster service selects only storage servers that have access to storage and memory of the failed storage service. This approach is in stark contrast to traditional SDS implementations, which start moving data over the network for rebalancing operations. In the disaggregated mode implementation, there is no data movement when a storage server fails. The solution becomes a simple mounting operation for storage media and CXL memory onto one of the surviving servers attached to CXL switch. Moreover, since CXL memory is in-tact (e.g., unlike traditional implementations), there is no need to rebuild in-memory metadata or cached data. In an embodiment, storage software manages CXL memory exclusively using custom memory allocators to organize memory, which may include direct access memory (dax) extensions and kernel managed memory. In one example, the CXL memory is organized in a tree hierarchy to locate all metadata buffers inside the memory. This approach enables near real-time fast failover as there is no reading of drive contents to rebuild the metadata. Additionally, any cached data from clients will still be in-tact and rehydrating the cache again (e.g., attaching event handlers) is avoided.
Turning now to
Illustrated processing block 102 provides for detecting a second failure in a second NVM, wherein the second NVM includes a redundant copy of the first data resident on the first NVM. Block 104 establishes a source-ordered virtual channel between the first NVM and a third NVM, wherein block 106 copies the first data from the first NVM to the third NVM over the source-ordered virtual channel via one or more unordered stream writes.
In one example, the logic 114 includes transistor channel regions that are positioned (e.g., embedded) within the substrate(s) 112. Thus, the interface between the logic 114 and the substrate(s) 112 may not be an abrupt junction. The logic 114 may also be considered to include an epitaxial layer that is grown on an initial wafer of the substrate(s) 112.
Example 1 includes a performance-enhanced computing system comprising a storage cluster including a plurality of storage servers, a plurality of switches coupled to the storage cluster, a plurality of non-volatile memories (NVMs) coupled to the plurality of switches, a processor, and a cluster service memory coupled to the processor, the cluster service memory including a set of instructions, which when executed by the processor, cause the processor to detect a first failure in a first storage server of the plurality of servers, wherein the first storage server is connected to a first NVM in the plurality of NVMs via the switch, select a second storage server of the plurality of servers, wherein the second storage server is connected to the first NVM via the switch, and configure the second storage server to host first data resident on the first NVM, wherein configuring the second storage server to host the first data bypasses a cluster-wide rebalance of the storage cluster.
Example 2 includes the computing system of Example 1, wherein the second storage server is selected based on topology data associated with the storage cluster and one or more hash criteria.
Example 3 includes the computing system of Example 1, wherein to configure the second storage server to host the first data, the instructions, when executed, further cause the processor to conduct a hot-plug flow with respect to the first NVM and the second storage server, initiate a storage daemon service, and read metadata from the switch.
Example 4 includes the computing system of Example 1, further including a volatile memory connected to the first storage server and the second storage server via the switch, wherein the instructions, when executed, further cause the processor to configure the second storage server to host second data resident on the volatile memory in response to the first failure in the first storage server.
Example 5 includes the computing system of any one of Examples 1 to 4, wherein the instructions, when executed, further cause the processor to detect a second failure in a second NVM, wherein the second NVM includes a redundant copy of the first data, establish a source-ordered virtual channel between the first NVM and a third NVM, and copy the first data from the first NVM to the third NVM over the source-ordered virtual channel via one or more unordered stream writes.
Example 6 includes at least one computer readable storage medium comprising a set of instructions, which when executed by a computing system, cause the computing system to detect a first failure in a first storage server, wherein the first storage server is connected to a first non-volatile memory (NVM) via a switch, select a second storage server that is connected to the first NVM via the switch, wherein the first storage server and the second storage server are in a storage cluster, and configure the second storage server to host first data resident on the first NVM, wherein configuring the second storage server to host the first data bypasses a cluster-wide rebalance of the storage cluster.
Example 7 includes the at least one computer readable storage medium of Example 6, wherein the second storage server is selected based on topology data associated with the storage cluster and one or more hash criteria.
Example 8 includes the at least one computer readable storage medium of Example 6, wherein to configure the second storage server to host the first data, the instructions, when executed, further cause the computing system to conduct a hot-plug flow with respect to the first NVM and the second storage server, initiate a storage daemon service, and read metadata from the switch.
Example 9 includes the at least one computer readable storage medium of Example 6, wherein the first storage server and the second storage server are connected to a volatile memory via the switch, and wherein the instructions, when executed, further cause the computing system to configure the second storage server to host second data resident on the volatile memory in response to the first failure in the first storage server.
Example 10 includes the at least one computer readable storage medium of any one of Examples 6 to 9, wherein the instructions, when executed, further cause the computing system to detect a second failure in a second NVM, wherein the second NVM includes a redundant copy of the first data, establish a source-ordered virtual channel between the first NVM and a third NVM, and copy the first data from the first NVM to the third NVM over the source-ordered virtual channel via one or more unordered stream writes.
Example 11 includes a semiconductor apparatus comprising one or more substrates, and logic coupled to the one or more substrates, wherein the logic is implemented at least partly in one or more of configurable or fixed-functionality hardware, the logic to detect a first failure in a first storage server, wherein the first storage server is connected to a first non-volatile memory (NVM) via a switch, select a second storage server that is connected to the first NVM via the switch, wherein the first storage server and the second storage server are in a storage cluster, and configure the second storage server to host first data resident on the first NVM, wherein configuring the second storage server to host the first data bypasses a cluster-wide rebalance of the storage cluster.
Example 12 includes the semiconductor apparatus of Example 11, wherein the second storage server is selected based on topology data associated with the storage cluster and one or more hash criteria.
Example 13 includes the semiconductor apparatus of Example 11, wherein to configure the second storage server to host the first data, the logic is further to conduct a hot-plug flow with respect to the first NVM and the second storage server, initiate a storage daemon service, and read metadata from the switch.
Example 14 includes the semiconductor apparatus of Example 11, wherein the first storage server and the second storage server are connected to a volatile memory via the switch, and wherein the logic is to configure the second storage server to host second data resident on the volatile memory in response to the first failure in the first storage server.
Example 15 includes the semiconductor apparatus of any one of Examples 11 to 14, wherein the logic is further to detect a second failure in a second NVM, wherein the second NVM includes a redundant copy of the first data, establish a source-ordered virtual channel between the first NVM and a third NVM, and copy the first data from the first NVM to the third NVM over the source-ordered virtual channel via one or more unordered stream writes.
Example 16 includes a method of operating a performance-enhanced computing system, the method comprising detecting a first failure in a first storage server, wherein the first storage server is connected to a first non-volatile memory (NVM) via a switch, selecting a second storage server that is connected to the first NVM via the switch, wherein the first storage server and the second storage server are in a storage cluster, and configuring the second storage server to host first data resident on the first NVM, wherein configuring the second storage server to host the first data bypasses a cluster-wide rebalance of the storage cluster.
Example 17 includes the method of Example 16, wherein the second storage server is selected based on topology data associated with the storage cluster and one or more hash criteria.
Example 18 includes the method of Example 16, wherein configuring the second storage server to host the first data includes conducting a hot-plug flow with respect to the first NVM and the second storage server, initiating a storage daemon service, and reading metadata from the switch.
Example 19 includes the method of Example 16, wherein the first storage server and the second storage server are connected to a volatile memory via the switch, the method further including configuring the second storage server to host second data resident on the volatile memory in response to the first failure in the first storage server.
Example 20 includes the method of any one of Examples 16 to 19, further including detecting a second failure in a second NVM, wherein the second NVM includes a redundant copy of the first data, establishing a source-ordered virtual channel between the first NVM and a third NVM, and copying the first data from the first NVM to the third NVM over the source-ordered virtual channel via one or more unordered stream writes.
Example 21 includes an apparatus comprising means for performing the method of any one of Examples 16 to 20.
Thus, unlike traditional shared nothing implementations, server failures (e.g., processor, host memory, network and/or motherboard failures) will not trigger cluster-wide rebalancing under the technology described herein. Storage software with disaggregated processes can fail over instantaneously once the cluster monitoring detects that the node failure and CXL memory failures will not trigger cluster-wide rebalancing. Instead, storage software may reinitialize from pooled CXL memory. This approach is similar to server hot-plug flow for memory without needing to initialize from scratch. Additionally, storage rebalancing due to storage device failure may trigger non-cluster-wide rebalance activity. Due to low level replication using NTB over network (e.g., RDMA) and unordered streamed writes, this approach greatly reduces the amount of time needed to rebalance the failed drive data in the cluster.
Embodiments are applicable for use with all types of semiconductor integrated circuit (“IC”) chips. Examples of these IC chips include but are not limited to processors, controllers, chipset components, programmable logic arrays (PLAs), memory chips, network chips, systems on chip (SoCs), SSD/NAND controller ASICs, and the like. In addition, in some of the drawings, signal conductor lines are represented with lines. Some may be different, to indicate more constituent signal paths, have a number label, to indicate a number of constituent signal paths, and/or have arrows at one or more ends, to indicate primary information flow direction. This, however, should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and may be implemented with any suitable type of signal scheme, e.g., digital or analog lines implemented with differential pairs, optical fiber lines, and/or single-ended lines.
Example sizes/models/values/ranges may have been given, although embodiments are not limited to the same. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the figures, for simplicity of illustration and discussion, and so as not to obscure certain aspects of the embodiments. Further, arrangements may be shown in block diagram form in order to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiment is to be implemented, i.e., such specifics should be well within purview of one skilled in the art. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.
The term “coupled” may be used herein to refer to any type of relationship, direct or indirect, between the components in question, and may apply to electrical, mechanical, fluid, optical, electromagnetic, electromechanical or other connections. In addition, the terms “first”, “second”, etc. may be used herein only to facilitate discussion, and carry no particular temporal or chronological significance unless otherwise indicated.
As used in this application and in the claims, a list of items joined by the term “one or more of” may mean any combination of the listed terms. For example, the phrases “one or more of A, B or C” may mean A; B; C; A and B; A and C; B and C; or A, B and
C.
Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, specification, and following claims.