Current data storage devices such as volatile and non-volatile memory often include a fault tolerance mechanism to ensure that data is not lost in the event of a device error or failure. An example of a fault tolerance mechanism provided to current data storage devices is a redundant array of independent disks (RAID), RAID is a storage technology that controls multiple memory modules and provides fault tolerance by storing data with redundancy. RAID technology may store data with redundancy in a variety of ways. Examples of redundant data storage include duplicating data and storing the data in multiple memory modules and adding parity to store calculated error recovery bits. The multiple memory modules, which may include the data and associated parity, are often accessed concurrently by multiple redundancy controllers.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to an example thereof. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. As used herein, the terms “a” and “an” are intended to denote at least one of a particular element, the term “includes” means includes but not limited to, the term “including” means including but not limited to, and the term “based on” means based at least in part on.
In addition, the following terms will be used throughout the remainder of the present disclosure:
Disclosed herein are examples of methods to serialize concurrent accesses by multiple redundancy controllers to fault tolerant memory. Fault tolerant memory, for instance, may include memory using redundant array of independent disks (RAID) technology. The disclosed examples disclose stripe locking methods to eliminate computational hazards (e.g., race conditions) by sequentially executing, without overlap, those sequences of access primitives to the multiple fault-tolerant memory modules that require atomic access for correct operation.
According to an example, a lock may be requested, by a first redundancy controller, from a parity media controller. The lock may be requested to perform a first locked sequence that accesses multiple memory modules in a stripe. A stripe may include data stored in at least one data memory module and parity stored in at least one parity memory module. In other words, a stripe may include cachelines distributed across multiple modules which contain redundant information and must be atomically accessed to maintain the consistency of the redundant information, as discussed further below. In all examples presented here, a one-cacheline RAID block size is assumed, and all memory access primitives are assumed to operate at a one-cacheline granularity. It will be readily apparent however, that the present disclosure may be practiced without limitation to this specific RAID block size. A locked sequence may include any sequence of primitives that requires atomic access to multiple media controllers for correct operation. This may include a write sequence or an error correction sequence.
The lock for a stripe, for example, may be acquired by a redundancy controller so that a locked sequence may be performed on the stripe. To acquire the lock, the redundancy controller issues a lock request message to the media controller that stores the parity cacheline for the stripe. The media controller consults a stripe-specific lock flag associated with the parity cacheline to determine whether the stripe is already locked or unlocked. In response to the stripe already being locked, the lock request is added to a conflict queue. However, in response to the stripe being unlocked, the lock may be granted by the media controller to the requesting redundancy controller, by issuing a lock completion response to the redundancy controller. The lock request message and subsequent lock completion message form a lock primitive. A lock primitive is the first primitive of any locked sequence. The lock provides the first redundancy controller exclusive access to the stripe, which prevents a second redundancy controller from concurrently performing a second locked sequence.
The lock for the stripe, for example, may then be released after the first redundancy controller completes the first sequence that accesses multiple memory modules in the stripe. To release the lock, the redundancy controller issues an unlock request message to the media controller that stores the parity cacheline for the locked stripe. The media controller in turn releases the lock by clearing the lock flag, and then sending an unlock completion message to the redundancy controller. The unlock request message and subsequent unlock completion message form an unlock primitive. An unlock primitive is the last primitive of any locked sequence. Additionally, the lock may be terminated for the stripe after the media controller determines that the duration of the lock has exceeded a predetermined time threshold. Termination of a lock under these conditions is termed lock breaking. Unlike released locks, broken locks are not associated with any unlock primitive. When a lock is broken, the parity cacheline is poisoned (flagged as invalid). The affected stripe may subsequently be reconstructed by a redundancy controller using an error correction sequence.
Memory technologies such as fast, non-volatile memory enable high performance compute and storage servers. These servers require fault tolerance (e.g., using RAID) to be a viable solution. Next generation high performance compute and storage servers may use fast direct-mapped load/store model storage in lieu of slower Direct Memory Access (DMA) on PCI buses to mechanical and solid state drives. Direct mapped storage has low latency requirements because processor load/store accesses directly target this nonvolatile memory. Computer systems may consist of pools of compute servers and pools of shared non-volatile memory. Each server in the pool may concurrently access the fault tolerant memory. Accordingly, RAID write sequences and error correction sequences require atomic access to two or more of the memory modules. On a scalable fabric or bus, concurrent accesses to multiple memory modules are unordered and not inherently atomic. Consequently, data and parity may become inconsistent due to concurrent accesses to the stripe.
In other words, some sequences of primitives issued by a redundancy controller need to be executed atomically to avoid race conditions that could result in parity inconsistency. For example, a processor write command may result in a locked sequence consisting of lock, read, write and unlock primitives issued by a redundancy controller targeting data and parity cachelines on different media controllers. Similar traffic sequences may reach a media controller from multiple redundancy controllers (e.g., acting on behalf of multiple servers). To ensure atomicity of each sequence requiring it, this invention implements locking and unlocking primitives, used to serialize locked sequences of memory access primitives arriving at media controllers from multiple redundancy controllers.
Accordingly, the disclosed examples enable concurrent access to memory while ensuring parity-data consistency. When applied within the context of a direct-mapped load/store access model, a hardware-implemented redundancy controller can perform the required read and write sequences within a time-constraint compatible with an outstanding read or write command issued by a processor. Generally speaking the disclosed examples provide fault tolerance for high-performance and low-latency direct mapped storage. As such, a pool of memory/storage may be shared among multiple redundancy controllers acting on behalf of multiple servers that are attached to a common fabric or bus, while maintaining parity-data consistency. Moreover, the disclosed examples enable data sharing or aggregation between independent redundancy controllers, and allow memory to be hot-swapped in running systems without operating system (OS) intervention. Further, the disclosed examples do not have a single point of failure in systems with multiple redundancy controllers and media controllers.
With reference to
For example, the compute node 100 may include a processor 102, an input/output interface 106, a private memory 108, and a redundancy controller 110 (e.g., a RAID controller). In one example, the compute node 100 is a server but other types of compute nodes may be used. The compute node 100 may be a node of a distributed data storage system. For example, the compute node 100 may be part of a cluster of nodes that services queries and provides data storage for multiple users or systems, and the nodes may communicate with each other to service queries and store data. The cluster of nodes may provide data redundancy to prevent data loss and minimize down time in case of a node failure.
The processor 102 may be a microprocessor, a micro-controller, an application specific integrated circuit (ASIC), field programmable gate array (FPGA), or other type of circuit to perform various processing functions. The private memory 108 may include volatile dynamic random access memory (DRAM) with or without battery backup, non-volatile phase change random access memory (PCRAM), spin transfer torque-magnetoresistive random access memory (STT-MRAM), resistive random access memory (reRAM), memristor, FLASH, or other types of memory devices. For example, the memory may be solid state, persistent, dense, fast memory. Fast memory can be memory having an access time similar to DRAM memory. The I/O interface 106 may include a hardware and/or a software interface. The I/O interface 106 may be a network interface connected to a network, such as the Internet, a local area network, etc. The compute node 100 may receive data and user-input through the I/O interface 106. Where examples herein describe redundancy controller behavior occurring in response to read or write commands issued by the processor 102, this should not be taken restrictively. The examples are also applicable if such read or write commands are issued by an I/O device via interface 105.
The components of computing node 100 may be coupled by an interconnect fabric 105 (
The redundancy controller 110, for example, may generate certain sequences of primitives independently, not directly resulting from processor commands. These include sequences used for scrubbing, initializing, migrating, or error-correcting memory. The redundancy controller 110 is depicted as including a stripe locking module 112 and a read/write module 114. Blocks 112 and 114 are shown to illustrate the functionality of the redundancy controller 110. However, the functionality is implemented by hardware. The modules 112 and 114 for example are hardware of the redundancy controller 110, and the modules 112 and 114 may not be machine readable instructions executed by a general purpose computer. The stripe locking module 112, for example, may acquire and release a lock for a given stripe in memory. The read/write module 114, for example, may process read or write sequences to the memory.
With reference to
The multiple compute nodes 100A-N may be coupled to the memory modules 104A-M by the network interconnect module 140. The memory modules 104A-M may include media controllers 120A-M and memories 121A-M. Each media controller, for instance, may communicate with its associated memory and control access to the memory by the processor 102. The media controllers 120A-M provide access to regions of memory. The regions of memory are accessed by multiple redundancy controllers in the compute nodes 100A-N using access primitives such as read, write, lock, unlock, etc. In order to support aggregation or sharing of memory, media controllers 120A-M may be accessed by multiple redundancy controllers (e.g., acting on behalf of multiple servers). Thus, there is a many-to-many relationship between redundancy controllers and media controllers. The memory 121A-M may include volatile dynamic random access memory (DRAM) with battery backup, non-volatile phase change random access memory (PCRAM), spin transfer torque-magnetoresistive random access memory (STT-MRAM), resistive random access memory (reRAM), memristor, FLASH, or other types of memory devices. For example, the memory may be solid state, persistent, dense, fast memory. Fast memory can be memory having an access time similar to DRAM memory.
As described in the disclosed examples, the redundancy controller 110 may maintain fault tolerance across the memory modules 104A-M. The redundancy controller 110 may receive read or write commands from one or more processors 102, I/O devices, or other sources. In response to these, it generates sequences of primitive accesses to multiple media controllers 120A-M. The redundancy controller 110 may also generate certain sequences of primitives independently, not directly resulting from processor commands. These include sequences used for scrubbing, initializing, migrating, or error-correcting memory.
Stripe locks acquired and released by the stripe locking module 112 guarantee atomicity for locked sequences. Accordingly, the term “stripe lock” has been used throughout the text to describe these locks. For any given stripe, actual manipulation of the locks, including request queueing, lock ownership tracking, granting, releasing, and breaking, may be managed by the media controller that stores the parity cacheline for the stripe. Locking and unlocking is coordinated between the redundancy controllers and the relevant media controllers using lock and unlock primitives, which include lock and unlock request and completion messages. Media controllers 120A-M implement lock semantics on a per-cacheline address basis. Cachelines that represent stripe parity storage receive lock and unlock primitives from redundancy controllers, while those that represent data storage do not receive lock and unlock primitives. By associating locks with cacheline addresses, media controllers 120A-M may participate in the locking protocol without requiring explicit knowledge about the stripe layouts implemented by the redundancy controllers. Where the term “stripe lock” is used herein in the context of media controller operation, this should not be taken to imply any knowledge by the media controller of stripe layout. Media controllers 120A-M may distinguish locks from each other by address only, without regard to the stripe layout.
Referring to
According to this example, if memory module 1 fails, the data cachelines from memory module 2 may be combined with the corresponding-stripe parity cachelines from memory module 3 (using the Boolean XOR function) to reconstruct the missing cachelines. For instance, if memory module 1 fails, then stripe 1 may be reconstructed by performing an XOR function on data cacheline A2 and parity cacheline Ap to determine data cacheline A1. In addition, the other stripes may be reconstructed in a similar manner using the fault tolerant scheme of this example. In general, a cacheline on a single failed memory module may be reconstructed by XORing the corresponding-stripe cachelines on all of the surviving memory modules.
With reference to
Two redundancy controllers (redundancy controller #0 of a first server and redundancy controller #1 of a second server) and three media controllers are depicted in
In block 302, the redundancy controller #0 issues a sequence to write data D0′ to the first data memory module 303 in stripe 301. To perform the write sequence 302, the redundancy controller #0 may perform the following primitives. As shown in arcs 304 and 306, the redundancy controller #0 reads the old data from the first data memory module 303 and receives the old data D0 from the media controller #0. The redundancy controller #0 then reads the old parity from the parity memory module 305 and receives the old parity P from the media controller #1, as shown in arcs 308 and 310. At this point, the redundancy controller #0 writes the new data D0′ to the first data memory module 303 and in return receives a completion message from the media controller #0, as shown in arcs 312 and 314. Next the redundancy controller #0 calculates the new parity P′ as described below. Finally, the redundancy controller #0 writes the new parity P′ to the parity memory module 305 and in return receives a completion message from the media controller #1, as shown in arcs 316 and 318.
After the write sequence 302 is completed by the redundancy controller #0, the redundancy controller #1 issues a sequence to write data D1′ to the second memory module 307 in stripe 301, as shown in block 320. To perform the write sequence 320, the redundancy controller #1 may perform the following primitives. As shown in arcs 322 and 324, the redundancy controller #1 reads the old data from the second data memory module 307 and receives the old data D1 from the media controller #2. The redundancy controller #1 then reads the old parity from the parity memory module 305 and receives the old parity P′, which was written in arc 316 above, from the media controller #1, as shown in arcs 326 and 328. At this point, the redundancy controller #1 writes the new data D1′ to the second data memory module 307 and in return receives a completion message from the media controller #2, as shown in arcs 330 and 332. Next the redundancy controller #1 calculates the new parity P″. Finally, the redundancy controller #1 writes the new parity P″ to the parity memory module 305 and in return receives a completion message from the media controller #1, as shown in arcs 334 and 336.
The parity in this example may be calculated as follows, where ^ indicates an XOR operation:
P=D0^D1 (1)
P′=D0′^D1 (after write D0′) (2)
P″=D0′^D1′ (after write D1′) (3)
The redundancy controller #0 computes the new parity P′ in arc 316 as (4) P′=D0 ^ D0′^ P, Equation (1) may be rewritten as (5) D1=D0 ^ P and substituted into equation (4) to arrive at (6) P′=D0′^ D1, which matches equation (2). Additionally, redundancy controller #1 computes the new parity P″ as (7) P″=D1 ^ D1′ ^ P′. Equation 2 can be rewritten as (8) D0′=P′ ^ D1 and substituted into equation 7 to arrive at (9) P′″=D0′^ D1′, which matches equation (3). Therefore, the parity is consistent with the data after the two consecutive writes in the example of
In block 402, the redundancy controller #0 issues a sequence to write data D0′ to the first data memory module 303 in stripe 301. To perform the write sequence 402, the redundancy controller #0 may perform the following primitives. As shown in arcs 404 and 406, the redundancy controller #0 reads the old data from the first data memory module 303 and receives the old data D0 from the media controller #0. The redundancy controller #0 then reads the old parity from the parity memory module 305 and receives the old parity P from the media controller #1, as shown in arcs 408 and 410. At this point, the redundancy controller #0 writes the new data D0′ to the first data memory module 303 and in return receives a completion message from the media controller #0, as shown in arcs 412 and 414. Next the redundancy controller #0 calculates a new parity. Finally, the redundancy controller #0 writes the new parity to the parity memory module 305 and in return receives a completion message from the media controller #1, as shown in arcs 416 and 418.
In this example, however, the redundancy controller #1 issues a sequence to write data D1′ to the second data memory module 307 in stripe 301 concurrently with the write sequence 402, as shown in block 420. To perform the write sequence 420, the redundancy controller #1 may perform the following primitives. As shown in arcs 422 and 424, the redundancy controller #1 reads the old data from the second data memory module 307 and receives the old data D1 from the media controller #2. The redundancy controller #1 then reads the old parity from the parity memory module 305 and receives the old parity P, which is the same parity received by redundancy controller #0 in arc 410 above, from the media controller #1, as shown in arcs 426 and 428. At this point, the redundancy controller #1 writes the new data D1 to the second data memory module 307 and in return receives a completion message from the media controller #2, as shown in arcs 430 and 432. Next the redundancy controller #1 calculates a new parity. Finally, the redundancy controller #1 writes the new parity to the parity memory module 305 and in return receives a completion message from the media controller #1, as shown in arcs 434 and 436.
The parity in this example may be calculated as:
P=D0^D1. (1)
After the two concurrent writes 402 by the redundancy controller #0 and 420 by the redundancy controller #1, the expected parity is:
P″=D0′^D1′. (2)
However, after the two concurrent writes 402 and 420, the final parity (P_final) is incorrect because the final write is by redundancy controller #0, which writes (3) P_final=D0 ^ D0′^ P. Equation (1) may be rewritten as (4) D1=D0 ^ P and substituted into equation (3) to arrive at (5) P_final=D0′^ D1, which does not match the expected parity shown in equation (2). Therefore, it is hazardous to allow these write sequences 402, 420 to occur concurrently. In other words, since both redundancy controllers read and wrote to the parity memory module 305 concurrently, a race condition occurred and the parity was left inconsistent with respect to the data.
In block 502, the redundancy controller #0 receives a command to write data D0′ to the first data memory module 303 in stripe 301. To perform the write sequence 502, the redundancy controller #0 may first request a lock from media controller #1, which hosts the parity cacheline, as shown in arc 504.
Since there is no single point of serialization with concurrent redundancy controllers #0 and #1, a point of serialization may be created at media controller #1 of the parity memory module 305. The point of serialization may be created at media controller #1 because any sequence that modifies the stripe 301 must communicate with memory module 305, because it is the memory module hosting the parity cacheline for the stripe 301. As a common resource accessed by both redundancy controllers #0 and #1 when accessing stripe 301, the media controller #1 of memory module 305 becomes the point of serialization for stripe 301.
According to an example, the lock may be an active queue inside the media controller #1. The active queue may include a stripe-specific flag or bit that indicates whether the stripe 301 is currently is locked. That is, the media controller of the parity memory module may (i) keep track of all pending lock requests for a stripe, grant the lock requests one at a time so that each requester gets a turn in sequence to hold the lock for that stripe and (ii) perform this independently for unrelated stripes. In this regard, any subsequent lock requests from other redundancy controllers to the locked stripe are in conflict and may be added to a conflict queue for later granting when the current lock is released. As an example, each media controller may implement a first in, first out (FIFO), conflict queue for each cacheline address, or a similar algorithm to ensure that each sequence eventually acquires the stripe-lock and makes forward progress. Media controllers may associate locks with cacheline addresses, since multiple stripes storing their parity cachelines on the same memory module must store the locks at different cacheline addresses to keep them distinct. Media controllers can thus manage locks for stripes, without requiring any detailed knowledge of the layout of the stripes.
In arc 506, media controller #1 has determined that the stripe 301 is not locked and grants a lock to redundancy controller #0. In response to acquiring the lock, the redundancy controller #0 may read the old data from the first data memory module 303 and receive the old data D0 from the media controller #0, as shown in arcs 508 and 510. The redundancy controller #0 then reads the old parity from the parity memory module 305 and receives the old parity P from the media controller #1, as shown in arcs 512 and 514. At this point, the redundancy controller #0 may write the new data D0′ to the first data memory module 303 and in return receive a write completion message from the media controller #0, as shown in arcs 516 and 518. Next the redundancy controller #0 may calculate the new parity P. Finally, the redundancy controller #0 may write the new parity P′ to the parity memory module 305 and in return receive a write completion message from the media controller #1, as shown in arcs 520 and 522. After these primitives have been completed by the redundancy controller #0, the redundancy controller #0 may release the lock in the media controller #1 and in return receive an unlock completion message, as shown in arcs 524 and 526.
In this example, the redundancy controller #1 may concurrently issue a sequence to write data D1′ to the second data memory module 307 in stripe 301, as shown in block 528. The redundancy controller #1 may first request a lock from media controller #1, which hosts the parity cacheline, as shown in arc 530. However, the media controller #1 has determined that the stripe 301 is locked by redundancy controller #0. Therefore, the lock request by redundancy controller #1 may be placed into the conflict queue. The lock request may be removed from the conflict queue after the lock has been released by redundancy controller #0 as shown in arcs 524 and 526. Accordingly, once the media controller #1 has determined that the stripe 301 is not locked it may grant the lock to redundancy controller #1, as shown in arc 532.
In response to acquiring the lock, the redundancy controller #1 may read the old data from the second data memory module 307 and receive the old data D1 from the media controller #2, as shown in arcs 534 and 536. The redundancy controller #1 may then read the old parity from the parity memory module 305 and receive the old parity P′, which is the same parity written by the redundancy controller #0 in arc 520 above, from the media controller #1, as shown in arcs 538 and 540. At this point, the redundancy controller #1 may write the new data D1′ to the second data memory module 307 and in return receive a write completion message from the media controller #2, as shown in arcs 542 and 544. Next the redundancy controller #1 may calculate the new parity P″. Finally, the redundancy controller #1 may write the new parity P″ to the parity memory module 305 and in return receive a write completion message from the media controller #1, as shown in arcs 546 and 548. After these primitives have been completed by the redundancy controller #1, the redundancy controller #1 may release the lock in the media controller #1 and in return receive an unlock completion message, as shown in arcs 550 and 552.
The parity in this example may be calculated as:
P=D0^D1 (1)
P′=D0′^D1 (after write D0′) (2)
P″=D0′^D1′ (after write D1′) (3)
Equation (1) may be rewritten as (4) D1=D0 ^ P and substituted into equation (2) to arrive at (5) P′=D0′^ D0 ^ P. As a result, the new parity P′ may always be computed by reading the old data (D0), reading the old parity (P), and performing an XOR function on these values with the new data (D′). Similarly, equation (2) may be rewritten as (6) D0′=D1 ^ P′ and substituted into equation (3) to arrive at (7) P″=D1′ ^D1 ^ P′. Therefore, the parity is consistent with the data after the two concurrent write sequences by multiple redundancy controllers in the example of
As shown in arc 604, the redundancy controller #0 may attempt to read data from the first data memory module 303 of the stripe 301. However, the media controller #0 of the first data memory module 303 has returned an error message for data D0, as shown in arc 606. In this situation, the redundancy controller #0 may try to correct the read error by RAID-reconstructing data D0. To perform the error correction sequence, the redundancy controller #0 may first request a lock from media controller #1, which hosts the parity cacheline, as shown in arc 608.
In arc 610, the media controller #1 has determined that the stripe 301 is not locked and grants a lock to redundancy controller #0. Accordingly, redundancy controller #0 may read old data from the second data memory module 307 and receive the old data D1 from the media controller #2, as shown in arcs 612 and 614. The redundancy controller #0 may then read the parity from the parity memory module 305 and receive the parity P the from the media controller #1, as shown in arcs 616 and 618. The redundancy controller #0 may then calculate the corrected data D0. The redundancy controller #0 may write the corrected data D0 (i.e., reconstructed data D0) to the data memory module 303 and in return receive a write completion message from the media controller #0, as shown in arcs 620 and 622. According to an example, the corrected data D0 may be reconstructed by performing an XOR function on the parity P received from the media controller #1 and the data D1 received from the media controller #2. Once the read error correction sequence has been completed, the redundancy controller #0 may release the lock from the stripe 301 and in return receive an unlock completion message from the media controller #1, as shown in arcs 624 and 626.
In this example, the redundancy controller #1 concurrently issues a sequence to write data D1′ to the second data memory module 307 in stripe 301, as shown in block 628. The redundancy controller #1 may first request a lock from media controller #1, which hosts the parity cacheline, as shown in arc 630. However, the media controller #1 has determined that the stripe 301 is locked by redundancy controller #0. Therefore, the lock request by redundancy controller #1 may be placed into a conflict queue. The lock request may be removed from the conflict queue after the lock has been released by redundancy controller #0 as shown in arc 626. Accordingly, once the media controller #1 has determined that the stripe 301 is not locked it may grant the lock to redundancy controller #1, as shown in arc 632.
In response to acquiring the lock, the redundancy controller #1 may read the old data from the second data memory module 307 and receive the old data D1 from the media controller #2, as shown in arcs 634 and 636. The redundancy controller #1 may then read the old parity from the parity memory module 305 and receive the old parity P from the media controller #1, as shown in arcs 638 and 640. At this point, the redundancy controller #1 may write the new data D1′ to the second data memory module 307 and in return receive a write completion message from the media controller #2, as shown in arcs 642 and 644. Next the redundancy controller #1 may calculate a new parity P′. Finally, the redundancy controller #1 may write the new parity P′ to the parity memory module 305 and in return receive a write completion message from the media controller #1, as shown in arcs 646 and 648. After these primitives have been completed by the redundancy controller #1, the redundancy controller #1 may release the lock in the media controller #1 and in return receive an unlock completion message, as shown in arcs 650 and 652.
The parity in this example may be calculated as:
P=D0^D1 (1)
P′=D0^D1′ (after D1′ Write) (2)
Equation (1) may be rewritten as (3) D0=P^ D1. Therefore, when a read error is observed on data D0, D0 may be regenerated by reading D1, reading P, and performing an XOR function on the returned values. Accordingly, the parity is consistent with the data after the concurrent read error correction sequence and the write sequence by the multiple redundancy controllers in the example of
As shown in block 702, the redundancy controller #1 may issue a sequence to write data D1′ to the second data memory module 307 in stripe 301. The redundancy controller #1 may first request a lock from media controller #1, which hosts the parity cacheline, as shown in arc 704. In block 706, the media controller #1 has determined that the stripe 301 is not locked and grants a lock to redundancy controller #1.
In response to acquiring the lock, the redundancy controller #1 may read the old data from the second data memory module 307 and receive the old data D1 from the media controller #2, as shown in arcs 708 and 710. The redundancy controller #1 may then read the old parity from the parity memory module 305 and receive the old parity P from the media controller #1, as shown in arcs 712 and 714. At this point, the redundancy controller #1 may write the new data D1′ to the second data memory module 307 and in return receive a write completion message from the media controller #2, as shown in arcs 716 and 718. Next the redundancy controller #1 may calculate a new parity P′. Finally, the redundancy controller #1 may write the new parity P′ to the parity memory module 305 and in return receive a write completion message from the media controller #1, as shown in arcs 720 and 722. After these primitives have been completed by the redundancy controller #1, the redundancy controller #1 may release the lock in the media controller #1 and in return receive an unlock completion message, as shown in arcs 724 and 726.
As shown in block 728, the redundancy controller #0 may encounter a read error while attempting to access data D0 from the first data memory module 303. As shown in arc 730, the redundancy controller #0 may attempt to read data from the first data memory module 303 of the stripe 301. However, the media controller #0 of the first data memory module 303 may return an error message for data D0, as shown in arc 732. In this situation, the redundancy controller #0 may try to correct the read error by RAID-reconstructing data D0. To perform the error correction sequence, the redundancy controller #0 may first request a lock from media controller #1, which hosts the parity cacheline, as shown in arc 734.
However, the media controller #1 may determine that the stripe 301 is locked by redundancy controller #1. Therefore, the lock request by redundancy controller #0 may be placed into a conflict queue. The lock request may be removed from the conflict queue after the lock has been released by redundancy controller #1 as shown in arcs 724 and 726. Accordingly, once the media controller #1 has determined that the stripe 301 is not locked it may grant the lock to redundancy controller #0, as shown in arc 736.
In response to acquiring the lock, the redundancy controller #0 may read data from the second data memory module 307 and receive the data D1′ from the media controller #2, as shown in arcs 738 and 740. The redundancy controller #0 may then read the parity from the parity memory module 305 and receive the parity P′ the from the media controller #1, as shown in arcs 742 and 744. The redundancy controller #0 may calculate the corrected data D0, and then write corrected data D0 (i.e., reconstructed data D0) to the data memory module 303 and in return receive a write completion message from the media controller #0, as shown in arcs 746 and 748. According to an example, the corrected data D0 may be reconstructed by performing an XOR function on the parity P′ received from the media controller #1 and the data D1′ received from the media controller #2. Once the read error correction sequence has been completed, the redundancy controller #0 may release the lock from the stripe 301 and in return receive an unlock completion message from the media controller #1, as shown in arcs 750 and 752.
The parity in this example may be calculated as:
P=D0^D1 (1)
P′=D0^D1′ (after D1′ Write) (2)
Equation (2) may be rewritten as (3) D0=P′ ^ D1. Therefore, when a read error is observed on data D0, D0 may be regenerated by reading D1′, reading P′, and performing an XOR function on the returned values. Accordingly, the parity is consistent with the data after the concurrent write sequence and the read error correction sequence by the multiple redundancy controllers in the example of
With reference to
Also, other types of primitives are applicable in the context of this invention. For example, the performance of a critical sequence can be improved by combining primitives. For example, referring to
With reference to
In block 810, the stripe locking module 112 of a first redundancy controller, for instance, may request a lock from the media controller on which the stripe's parity is stored to perform a first sequence that accesses multiple memory modules in a stripe. A stripe may include data stored in at least one data memory module and parity stored in at least one parity module. As discussed above, since there is typically no single point of serialization for multiple redundancy controllers that access a single stripe, a point of serialization may be created at the parity media controller. The point of serialization may be created at the parity media controller because any sequence that modifies the stripe has to access the parity.
In block 820, the stripe locking module 112, for instance, may acquire the lock for the stripe. The lock may provide the first redundancy controller with exclusive access to the stripe. For instance, the lock prevents a second redundancy controller from concurrently performing a second sequence that accesses multiple memory modules in the stripe during the locked sequence. The method for acquiring the lock is discussed further below with reference to
In block 830, the read/write module 114 may perform the first sequence on the stripe. The first sequence may include a sequence that would be hazardous if not atomic, such as one that modifies memory in the stripe, or accesses multiple cachelines within the stripe. A read sequence not requiring error correction, however, is not hazardous because it only accesses a single memory module and so is inherently atomic, and it does not modify the stored value.
In block 840, the stripe locking module 112, for instance, may release the lock for the stripe. In one example, the lock may be removed once the first redundancy controller has completed the first sequence that modifies the multiple memory modules in the stripe.
With reference to
In block 910, a new or queued lock request from a redundancy controller may be processed by the media controller to determine if a stripe is locked. If it is determined that the stripe is currently locked at block 920, the media controller may add the lock request to a stripe-specific conflict queue as shown in block 930. The queued lock request will remain queued at block 935 until the stripe is unlocked. After the stripe is unlocked, the lock request may be removed from the conflict queue 938. Accordingly, the lock request may be granted the lock when the current lock is released, as shown in block 940 as discussed further below.
If it is determined that the stripe is not locked at block 920, the media controller may immediately grant a lock for the stripe, as shown in block 940. At this point, the media controller may wait to receive a corresponding stripe unlock request or a timer expiration, as shown in block 950. In block 960, the media controller determines whether a stripe unlock request has been received.
In response to receiving a stripe unlock request corresponding to the locked stripe, the media controller may release the lock, as shown in block 990. On the other hand, when no corresponding stripe unlock request has been received, the media controller may check a lock timer to determine whether the duration of the lock has exceeded a predetermined time threshold, and if so, infer that the lock must be held by a redundancy controller that has failed, as shown in block 970. If the lock is determined to be expired at block 972, the corresponding parity cacheline is poisoned or flagged as invalid as shown in block 980. Accordingly, the lock is released as shown in block 982 and the poisoned parity of the stripe may subsequently be reconstructed by a redundancy controller using an error correction sequence. Alternatively, if the lock is determined not to be expired at block 972, the media controller may wait to receive a stripe unlock request corresponding to the locked stripe or a timer expiration, as shown in block 950.
What has been described and illustrated herein are examples of the disclosure along with some variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the scope of the disclosure, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/053704 | 9/2/2014 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2016/036347 | 3/10/2016 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
5032744 | Liu | Jul 1991 | A |
5243592 | Perlman et al. | Sep 1993 | A |
5327553 | Jewett et al. | Jul 1994 | A |
5533999 | Hood et al. | Jul 1996 | A |
5546535 | Stallmo et al. | Aug 1996 | A |
5555266 | Buchholz et al. | Sep 1996 | A |
5633996 | Hayashi et al. | May 1997 | A |
5633999 | Clowes et al. | May 1997 | A |
5724046 | Martin et al. | Mar 1998 | A |
5905871 | Buskens et al. | May 1999 | A |
6073218 | DeKoning et al. | Jun 2000 | A |
6081907 | Witty et al. | Jun 2000 | A |
6092191 | Shimbo et al. | Jul 2000 | A |
6141324 | Abbott et al. | Oct 2000 | A |
6151659 | Solomon et al. | Nov 2000 | A |
6181704 | Drottar et al. | Jan 2001 | B1 |
6389373 | Ohya | May 2002 | B1 |
6457098 | Dekoning et al. | Sep 2002 | B1 |
6467024 | Bish et al. | Oct 2002 | B1 |
6490659 | McKean et al. | Dec 2002 | B1 |
6502165 | Kishi et al. | Dec 2002 | B1 |
6510500 | Sarkar et al. | Jan 2003 | B2 |
6542960 | Wong et al. | Apr 2003 | B1 |
6654830 | Taylor et al. | Nov 2003 | B1 |
6735645 | Weber et al. | May 2004 | B1 |
6826247 | Elliott et al. | Nov 2004 | B1 |
6834326 | Wang et al. | Dec 2004 | B1 |
6911864 | Bakker et al. | Jun 2005 | B2 |
6938091 | Das Sharma | Aug 2005 | B2 |
6970987 | Ji et al. | Nov 2005 | B1 |
7366808 | Kano et al. | Apr 2008 | B2 |
7506368 | Kersey et al. | Mar 2009 | B1 |
7738540 | Yamasaki et al. | Jun 2010 | B2 |
7839858 | Wiemann et al. | Nov 2010 | B2 |
7908513 | Ogasawara et al. | Mar 2011 | B2 |
7934055 | Flynn et al. | Apr 2011 | B2 |
7996608 | Chatterjee et al. | Aug 2011 | B1 |
8005051 | Watanabe | Aug 2011 | B2 |
8018890 | Venkatachalam et al. | Sep 2011 | B2 |
8054789 | Boariu et al. | Nov 2011 | B2 |
8103869 | Balandin et al. | Jan 2012 | B2 |
8135906 | Wright et al. | Mar 2012 | B2 |
8161236 | Noveck et al. | Apr 2012 | B1 |
8169908 | Sluiter et al. | May 2012 | B1 |
8171227 | Goldschmidt et al. | May 2012 | B1 |
8332704 | Chang et al. | Dec 2012 | B2 |
8341459 | Kaushik et al. | Dec 2012 | B2 |
8386834 | Goel et al. | Feb 2013 | B1 |
8386838 | Byan et al. | Feb 2013 | B1 |
8462690 | Chang et al. | Jun 2013 | B2 |
8483116 | Chang et al. | Jul 2013 | B2 |
8605643 | Chang et al. | Dec 2013 | B2 |
8619606 | Nagaraja | Dec 2013 | B2 |
8621147 | Galloway et al. | Dec 2013 | B2 |
8667322 | Chatterjee et al. | Mar 2014 | B1 |
8700570 | Marathe et al. | Apr 2014 | B1 |
8793449 | Kimmel | Jul 2014 | B1 |
8812901 | Sheffield, Jr. | Aug 2014 | B2 |
9128948 | Raorane | Sep 2015 | B1 |
9166541 | Funato et al. | Oct 2015 | B2 |
9298549 | Camp | Mar 2016 | B2 |
9621934 | Seastrom et al. | Apr 2017 | B2 |
20010002480 | Dekoning et al. | May 2001 | A1 |
20020162076 | Talagala et al. | Oct 2002 | A1 |
20030037071 | Harris et al. | Feb 2003 | A1 |
20030126315 | Tan et al. | Jul 2003 | A1 |
20040133573 | Miloushev et al. | Jul 2004 | A1 |
20040233078 | Takehara | Nov 2004 | A1 |
20050027951 | Piccirillo et al. | Feb 2005 | A1 |
20050044162 | Liang et al. | Feb 2005 | A1 |
20050120267 | Burton et al. | Jun 2005 | A1 |
20050144406 | Chong, Jr. | Jun 2005 | A1 |
20050157697 | Lee et al. | Jul 2005 | A1 |
20060112304 | Subramanian et al. | May 2006 | A1 |
20060129559 | Sankaran et al. | Jun 2006 | A1 |
20060264202 | Hagmeier et al. | Nov 2006 | A1 |
20070028041 | Hallyal et al. | Feb 2007 | A1 |
20070140692 | Decusatis et al. | Jun 2007 | A1 |
20070168693 | Pittman | Jul 2007 | A1 |
20070174657 | Ahmadian et al. | Jul 2007 | A1 |
20080060055 | Lau | Mar 2008 | A1 |
20080137669 | Balandina et al. | Jun 2008 | A1 |
20080201616 | Ashmore | Aug 2008 | A1 |
20080281876 | Mimatsu | Nov 2008 | A1 |
20090080432 | Kolakeri et al. | Mar 2009 | A1 |
20090259882 | Shellhamer | Oct 2009 | A1 |
20090313313 | Yokokawa et al. | Dec 2009 | A1 |
20100107003 | Kawaguchi | Apr 2010 | A1 |
20100114889 | Rabii et al. | May 2010 | A1 |
20110109348 | Chen et al. | May 2011 | A1 |
20110173350 | Coronado et al. | Jul 2011 | A1 |
20110208994 | Chambliss et al. | Aug 2011 | A1 |
20110213928 | Grube et al. | Sep 2011 | A1 |
20110246819 | Callaway et al. | Oct 2011 | A1 |
20110314218 | Bert | Dec 2011 | A1 |
20120032718 | Chan et al. | Feb 2012 | A1 |
20120059800 | Guo | Mar 2012 | A1 |
20120096329 | Taranta, II | Apr 2012 | A1 |
20120137098 | Wang et al. | May 2012 | A1 |
20120166699 | Kumar et al. | Jun 2012 | A1 |
20120166909 | Schmisseur et al. | Jun 2012 | A1 |
20120201289 | Abdalla et al. | Aug 2012 | A1 |
20120204032 | Wilkins et al. | Aug 2012 | A1 |
20120213055 | Bajpai et al. | Aug 2012 | A1 |
20120297272 | Bakke et al. | Nov 2012 | A1 |
20120311255 | Chambliss | Dec 2012 | A1 |
20130060948 | Ulrich et al. | Mar 2013 | A1 |
20130128721 | Decusatis et al. | May 2013 | A1 |
20130128884 | Decusatis et al. | May 2013 | A1 |
20130138759 | Chen et al. | May 2013 | A1 |
20130148702 | Payne | Jun 2013 | A1 |
20130227216 | Cheng et al. | Aug 2013 | A1 |
20130246597 | Iizawa et al. | Sep 2013 | A1 |
20130297976 | McMillen | Nov 2013 | A1 |
20130311822 | Kotzur et al. | Nov 2013 | A1 |
20130312082 | Izu et al. | Nov 2013 | A1 |
20140067984 | Danilak | Mar 2014 | A1 |
20140095865 | Yerra et al. | Apr 2014 | A1 |
20140115232 | Goss et al. | Apr 2014 | A1 |
20140136799 | Fortin | May 2014 | A1 |
20140269731 | Decusatis et al. | Sep 2014 | A1 |
20140281688 | Tiwari et al. | Sep 2014 | A1 |
20140304469 | Wu | Oct 2014 | A1 |
20140331297 | Innes et al. | Nov 2014 | A1 |
20150012699 | Rizzo et al. | Jan 2015 | A1 |
20150095596 | Yang | Apr 2015 | A1 |
20150146614 | Yu et al. | May 2015 | A1 |
20150199244 | Venkatachalam et al. | Jul 2015 | A1 |
20150288752 | Voigt | Oct 2015 | A1 |
20160034186 | Weiner | Feb 2016 | A1 |
20160170833 | Segura et al. | Jun 2016 | A1 |
20160196182 | Camp | Jul 2016 | A1 |
20160226508 | Kurooka et al. | Aug 2016 | A1 |
20170253269 | Kanekawa et al. | Sep 2017 | A1 |
20170302409 | Sherlock | Oct 2017 | A1 |
20170346742 | Shahar et al. | Nov 2017 | A1 |
Number | Date | Country |
---|---|---|
101576805 | Nov 2009 | CN |
102521058 | Jun 2012 | CN |
104333358 | Feb 2015 | CN |
1347369 | Nov 2012 | EP |
1546MUM2013 | Mar 2015 | IN |
201346530 | Nov 2013 | TW |
WO-02091689 | Nov 2002 | WO |
2014120136 | Aug 2014 | WO |
Entry |
---|
Amiri, K. et al., Highly Concurrent Shared Storage, (Research Paper), Sep. 7, 1999, 25 Pages. |
International Search Report and Written Opinion; PCT/US2014/053704; dated May 15, 2015; 13 pages. |
Almeida, P. S., et al; Scalable Eventually Consistent Counters Over Unreliable Networks; Jul. 12, 2013; 32 Pages. |
International Search Report and Written Opinion; PCT/US2015/013898; dated Oct. 8, 2015; 11 pages. |
Razavi, B. et al., “Design Techniques for High-Speed, High-Resolution Comparators,” (Research Paper), IEEE Journal of Solid-State Circuits 27.12, Dec. 12, 1992, pp. 1916-1926, http://www.seas.ucla.edu/brweb/papers/Journals/R%26WDec92_2.pdf. |
Xingyuan, T. et al., “An Offset Cancellation Technique in a Switched-Capacitor Comparator for SAR ADCs”; (Research Paper), Journal of Semiconductors 33.1, Jan. 2012, 5 pages, http://www.jos.ac.cn/bdtxbcn/ch/reader/create_pdf.aspx?file_no=11072501. |
Mao, Y. et al., A New Parity-based Migration Method to Expand Raid-5, (Research Paper), Nov. 4, 2013, 11 Pages. |
Kimura et al., “A 28 Gb/s 560 mW Multi-Standard SerDes With Single-Stage Analog Front-End and 14-Tap Decision Feedback Equalizer in 28 nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 49, No. 12, Dec. 2014, pp. 3091-3103. |
International Searching Authority, The International Serach Report 6 and the Written Opinion, dated Feb. 26, 2015, 10 Pages. |
Li, M. et al: Toward I/O-Efficient Protection Against Silent Data Corruptions in RAID Arrays, (Research Paper); Jun. 2-6, 2014; 12 Pages. |
International Searching Authority, The International Search Report and the Written Opinion, dated Feb. 26, 2015, 10 Pages. |
International Search Report and Written Opinion; PCT/US2015/023708; dated Apr. 22, 2016; 13 pages. |
International Search Report and Written Opinion; PCT/US2015/013921; dated Oct. 28, 2015; 12 pages. |
International Search Report and Written Opinion received for PCT Patent Application No. PCT/US2015/023708, dated Apr. 22, 2016, 11 pages. |
International Search Report and Written Opinion received for PCT Patent Application No. PCT/US2015/013921, dated Oct. 28, 2015, 10 pages. |
International Search Report and Written Opinion received for PCT Patent Application No. PCT/US2015/013898, dated Oct. 8, 2015, 9 pages. |
International Search Report and Written Opinion received for PCT Patent Application No. PCT/US2015/013817, dated Oct. 29, 2015, 11 pages. |
International Preliminary Report on Patentability received for PCT Patent Application No. PCT/US2015/023708, dated Oct. 12, 2017, 10 pages. |
International Preliminary Report on Patentability received for PCT Patent Application No. PCT/US2015/013921, dated Aug. 10, 2017, 9 pages. |
International Preliminary Report on Patentability received for PCT Patent Application No. PCT/US2015/013898, dated Aug. 10, 2017, 8 pages. |
International Preliminary Report on Patentability received for PCT Patent Application No. PCT/US2015/013817, dated Aug. 10, 2017, 9 pages. |
International Preliminary Report on Patentability received for PCT Patent Application No. PCT/US2014/053704, dated Mar. 16, 2017, 10 pages. |
International Preliminary Report on Patentability received for PCT Patent Application No. PCT/US2014/049193, dated Feb. 9, 2017, 7 pages. |
EMC2; High Availability and Data Protection with EMC Isilon Scale-out NAS, (Research Paper); Nov. 2013; 36 Pages. |
Do I need a second RAID controller for fault-tolerance ?, (Research Paper); Aug. 22, 2010; 2 Pages; http://serverfault.com/questions/303869/do-i-need-a-second-raid-controller-for-fault-tolerance. |
Kang, Y. et al., “Fault-Tolerant Flow Control in On-Chip Networks,” (Research Paper), Proceedings for IEEE, May 3-6, 2010, 8 pages, available at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.7865&rep=rep1&type=pdf. |
International Search Report and Written Opinion received for PCT Patent Application No. PCT/US2014/062196, dated Jun. 30, 2015, 13 pages. |
International Preliminary Report on Patentability received for PCT Patent Application No. PCT/US2014/062196, dated May 4, 2017, 12 pages. |
Number | Date | Country | |
---|---|---|---|
20170185343 A1 | Jun 2017 | US |