This disclosure generally relates to information handling systems, and more particularly relates to improving parity redundant array of independent drives (RAID) write latency in non-volatile memory express (NVMe) devices.
As the value and use of information continues to increase, individuals and businesses seek additional ways to process and store information. One option is an information handling system. An information handling system generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes. Because technology and information handling needs and requirements may vary between different applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software resources that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
An information handling system includes a host to write a non-volatile memory express (NVMe) command, and a plurality of NVMe devices configured as a RAID array. Each of the NVMe devices may use internal hardware resources to perform offload operations of the NVMe command.
It will be appreciated that for simplicity and clarity of illustration, elements illustrated in the Figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements are exaggerated relative to other elements. Embodiments incorporating teachings of the present disclosure are shown and described with respect to the drawings presented herein, in which:
The use of the same reference symbols in different drawings indicates similar or identical items.
The following description in combination with the Figures is provided to assist in understanding the teachings disclosed herein. The following discussion will focus on specific implementations and embodiments of the teachings. This focus is provided to assist in describing the teachings, and should not be interpreted as a limitation on the scope or applicability of the teachings.
In an embodiment, the host 100 includes a system memory 102 that further includes an application program 103 executing within an operating system (OS) 104. The host 100 includes one or more CPUs 105 that are coupled to the system memory 102 in which the application program 103 and the operating system 104 have been stored for execution by the CPU(s). A chip set 106 may further provide one or more input/output (I/O) interfaces to couple external devices to the host 100.
The host 100 may generate I/O transactions 110 targeting a coupled storage subsystem 120 that includes a virtual hard drive (VHD) 122. The host 100 further employs a storage cache device 130 that is configured to cache the I/O transactions 110. The storage cache device 130 is analogous to an L1 data cache employed by the CPU. The storage cache device 130 includes one or more cache storage devices 132 and cache metadata 134 that is maintained by a storage cache module in OS 104. The host 100 enables and supports the storage cache device 130 with the storage cache module in the OS 104.
At the storage subsystem 120, a storage controller 124 may map the VHD 122 to a RAID array 140. In an embodiment, the storage controller 124 includes a RAID controller 126 that may be configured to control multiple NVMe devices 142-146 that make up the RAID array 140. The number of NVMe devices presented is for ease of illustration and different numbers of NVMe devices may be utilized in the RAID array 140. The NVMe devices may be independent solid state data storage drives (SSD) that may be accessed through a peripheral component interconnect express (PCIe) bus 150.
In an embodiment, the host 100 is configured to write an NVMe command. The NVMe command may be directed to the storage controller 124 and the RAID array 140. In this embodiment, the NVMe command may include features to improve parity RAID write latency in the information handling system.
As an overview of the read-modify-write process, the host 100 sends a write command including a new data write (D′) that replaces a value of data (D) in the data drive. Based from this write command, the RAID controller sends a first XOR offload instruction that is implemented by the data drive. The data drive performs the first XOR offload instruction and stores results to a buffer memory. The data drive may further fetch the D′ to update the data drive. To update parity, the RAID controller is aware of stored result's location and it sends a second XOR offload instruction to the servicing parity drive. The parity drive performs the second XOR offload instruction between the results stored in the memory buffer and parity data to generate a new parity data (P′). The parity drive is then updated with the new parity value and the RAID controller may send a write completion to the host.
In an embodiment, an NVMe command 200 includes a request to write the new data, which is represented by a D′ 202. For example, the D′ 202 is used to overwrite the value of data (D) 204 in the NVMe device 142 of the RAID array 140. As a consequence, parity data (P) 206 in the NVMe device 146 needs to be recalculated and re-written to the same parity drive. In this embodiment, the D 204 and the P 206 may belong to the same RAID stripe. Furthermore, for this RAID stripe, the NVMe device 142 is configured to store data while the NVMe device 146 is configured to store the parity data. The NVMe devices 142 and 146 may be referred to as the data drive and parity drive, respectively.
To implement the new data write and the parity update in the NVMe devices 142 and 146, respectively, a first XOR offloading operation 208 and a second XOR offloading operation 210 are performed by the corresponding NVMe devices on the RAID array. The RAID array, for example, is configured as a RAID 5 that uses disk striping with parity. Other RAID configurations such as RAID 6 may utilize the XOR offloading and updating operations as described herein.
In an embodiment, the first XOR offloading operation 208 may involve participation of the RAID controller 126 and the NVMe device 142. The first XOR offloading operation includes sending of a first XOR offload instruction 212 by the RAID controller to the NVMe device 142, an XOR calculation 214 performed by the NVMe device 142, a sending of a first XOR command completion 216 by the NVMe device 142 to the RAID controller, a writing 218 of the D′ by the RAID controller to the NVMe device 142, and a sending of new data write completion status 220 by the NVMe device 142 to the RAID controller to complete the first XOR_AND_UPDATE offload instruction. In this embodiment, the first XOR command completion 216 may not be sent and the NVMe device 142 may fetch the D′ to replace the D 204. After overwriting of the D 204, the first XOR offload instruction 212 is completed and the NVMe device may send the write completion status 220. In this embodiment still, a result of the XOR calculation between the D and D′ is stored in controller memory buffer (CMB) storage 222.
The CMB storage may include a persistent memory or a volatile memory. That is, the XOR offloading operation is not dependent on persistency of intermediate data and the RAID controller may decide on the usage of the intermediate data and longevity of the data in the CMB storage. In an embodiment, the CMB storage may include PCIe bar address registers (BARs) or regions within the BAR that can be used to store either generic intermediate data or data associated with the NVMe block command. The BARs may be used to hold memory addresses used by the NVMe device. In this embodiment, each NVMe device in the RAID array may include a CMB storage that facilitates easy access of data during the offload operations to ease the calculation burden of the RAID controller, to minimize use of RAID controller DRAMs, and to open the RAID controller interfaces to other data movement or bus utilization.
In an embodiment, the second XOR offloading operation 210 may involve participation of the RAID controller, the NVMe device 146 as the parity drive, and the NVMe device 142 as the peer drive that stored previous partial XOR operation results in the temporary buffer such as the CMB storage 222. The second XOR offloading operation includes sending of a second XOR offload instruction 224 by the RAID controller 126 to the NVMe device 146, requesting a CMB memory range 226 by the NVMe device 146 to the peer NVMe device 142, a reading 228 by the NVMe device 146 of the stored results including the requested CMB memory range from the CMB storage, an XOR calculation 230 by the NVMe device 146, and a sending of a second XOR offloading command completion 232 by the NVMe device 146 to the RAID controller. In this embodiment, the XOR calculation is performed on the read CMB memory range stored in the CMB storage 222 and the parity data 206 to generate a new parity data (P′) 234. The P′ 234 is then stored into the NVMe device 146 to replace the P206. Afterward, the RAID controller may send a write completion 236 to the host 100 to complete the D′ 202.
In an embodiment, the first XOR offload instruction 212 and the second XOR offload instruction 224 may take the form:
XOR_AND_UPDATE(Input1,Input2,Output)
where the XOR_AND_UPDATE indicates the NVMe command or instruction for the servicing NVMe device to perform partial XOR offload operation and to perform the update after completion of the partial XOR calculation. The action to XOR command may be performed to update a logical block address (LBA) location such as when the D′ is written to the data drive or when the P′ is written to update the parity drive. The action to the XOR command may also result to holding the resultant buffer in the temporary memory buffer such as the CMB storage. In this embodiment, the two inputs Input1 and Input 2 of the XOR_AND_UPDATE command may be taken from at least two of the following: from LBA range (starting LBA and number of logical blocks) on the NVMe drive that is servicing the command; from an LBA range on a peer drive where the peer drive is identified with its BDF (Bus, Device, Function); or from a memory address range. The output parameter Output of the XOR_AND_UPDATE command may include the LBA range of the NVMe drive that is servicing the command, or the memory address range with the same possibilities as the input parameter. The memory address range may allow addressing of the CMB storage on the local drive, CMB storage on the remote drive, the host memory, and or remote memory.
For example, the first input for the first XOR offload instruction 212 includes the 204 on a first LBA range on the NVMe device 142 that is servicing the XOR_AND_UPDATE command while the second input includes the D′ that may be fetched from an input buffer of a host memory or a RAID controller's memory. In another example, the D′ may be fetched from the host memory or the RAID controller's memory and stored in the second LBA range of the same NVMe device. In this other example, the two inputs D and D′ are read from the first and second LBA ranges, respectively, of the same NVMe device. After the completion of partial XOR operation between the D and the D′, the D′ may be written to replace the D 204 on the first LBA range. Furthermore, the updating portion of the XOR_AND_UPDATE command includes storing of the partial XOR calculation results to a memory address range in the CMB storage. The memory address range in the CMB storage includes the output parameter in the XOR_AND_UPDATE command. In an embodiment, the RAID controller takes note of this output parameter and is aware of the memory address range in the CMB storage where the partial XOR calculation results are stored.
To update the parity drive, the two inputs for the second XOR offload instruction 224 may include, for example, the memory address range of the CMB storage where the previous partial XOR calculation results are stored, and the LBA range on the servicing NVMe device 146 that stored the parity data. The two inputs in this case do not use logical blocks but rather, one input includes the LBA range while the other input includes the memory address range of the CMB storage. In an embodiment, the NVMe device 146 is configured to access the stored partial XOR calculation results from the peer drive such as the NVMe device 142. In this embodiment, the NVMe device 146 may access the CMB storage without going through the RAID controller since the CMB storage is attached to the peer drive. For example, the NVMe device 146 is connected to the peer NVMe device 142 through a PCIe switch. In this example, the NVMe device 146 may access the stored partial XOR calculation results by using the memory address range of the CMB storage. The memory address range is one of the two inputs to the second XOR offload instruction.
In an embodiment, the XOR_AND_UPDATE command may be integrated to the NVMe protocol such that each NVMe device in the RAID array may be configured to receive the NVMe command and to use its internal hardware resources to perform the offload operations of the NVMe command.
In an embodiment and for the write operation that includes the writing of the D′ to the NVMe device 142, the host 100 may send the first XOR offload instruction 212 to the data drive. The host may refer to host memory or RAID controller's memory. In this embodiment, the two inputs to the first XOR offload instruction may include a first LBA range on the storage media 302, and the other input may be taken from an input buffer of the host memory or the RAID controller's memory. The host 100 may transfer 304 the D′ in response to data request from the NVMe device and the internal processors may perform the XOR calculations between the D and the D′. In a case where the host transfers 304 the D′ to a second LBA range on the storage media 302, then the other input for the first XOR offload instruction may include the second LBA range on the same NVMe device. In this regard, the internal processors of the data drive may read the current data on the first LBA range and the D′ on the second LBA range, and perform the XOR calculations between D and D′. After the XOR calculations, the D′ is written to the first LBA range to replace the D. Here, the write D′ is similar to write D′ 218 in
The XOR circuit 300 may include a hardware circuit with an application program that performs the XOR operation between the two inputs of the XOR_AND_UPDATE command. In an embodiment, the first input includes the read current data on the first LBA range while the second input D′ may be received from the write from the host. In this embodiment, the XOR circuit performs the XOR operations to generate results that will be stored, for example, in a first memory address range of the CMB storage 222. The usage and or longevity of the stored data in the first memory address range of the CMB storage 222 may be managed by the RAID controller. In other embodiment, the generated results may be stored in the CMB storage on a remote drive, on the host memory, and or the remote memory. After completion of the first XOR offloading operation, the NVMe device may send the write completion status 220 to the RAID controller.
In an embodiment and to update the parity drive, the host 100 may send the second XOR offload instruction 212 to the NVMe device 146. The two inputs for the second XOR offload instruction may include, for example, a third LBA range on the storage media 402 and the first memory address range of CMB storage 222. In this example, the third LBA range may include the parity data such as the P 206 in
The XOR circuit 400 may include a hardware circuit that performs the XOR operation between the two inputs. In an embodiment, the XOR circuit performs the XOR operations to generate the P′ in a write buffer. In this embodiment still, the parity drive writes the P′ from the write buffer to the storage media 406 where the P′ may be stored in the same or different LBA range where the old parity data was originally stored. The parity drive may maintain atomicity of data for the XOR operation and updates the location only after successful XOR calculation or operation has been performed or completed. After completion of the second XOR offloading operation, the NVMe device that is servicing the command may send the write completion 236 to the host.
In another embodiment such as the read-peers method, the new parity data may be calculated by performing the XOR operation on the value of the D′ and data of the NVMe device 144 in the same RAID stripe. In this other embodiment, the read and XOR offload operations may be performed at each NVMe device following the XOR offloading operation and updating processes described herein. However, the P′ is written on a different peer drive such as the parity drive.
For both read-modify-write and read-peers processes, XOR offload operations may be performed in parallel by distributing the XOR_AND_UPDATE commands to multiple devices instead of centralized implementation at the RAID controller. This may result to optimized data path, reduces number of steps required for I/O completion, and leverages PCIe peer-to-peer transaction capabilities.
At block 614, sending the second XOR offload instruction 224 to the parity drive. The second XOR offload instruction includes the XOR_AND_UPDATE command with two inputs and one output. At block 616, reading old parity data from the third LBA range in the parity drive. The third LBA range, for example, is one of the two inputs of the XOR_AND_UPDATE command. At block 618, reading the results from the memory address range of the CMB storage 226 by the parity drive. At block 620, performing the XOR calculation 230 by the NVMe device 146 to generate the P′. And at block 622, storing the P′ from the write buffer or buffer memory to the storage media.
Although only a few exemplary embodiments have been described in detail herein, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of the embodiments of the present disclosure. Accordingly, all such modifications are intended to be included within the scope of the embodiments of the present disclosure as defined in the following claims. In the claims, means-plus-function clauses are intended to cover the structures described herein as performing the recited function and not only structural equivalents.
Devices, modules, resources, or programs that are in communication with one another need not be in continuous communication with each other, unless expressly specified otherwise. In addition, devices, modules, resources, or programs that are in communication with one another can communicate directly or indirectly through one or more intermediaries.
The above-disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover any and all such modifications, enhancements, and other embodiments that fall within the scope of the present invention. Thus, to the maximum extent allowed by law, the scope of the present invention is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description.