The present disclosure relates to Peripheral Component Interconnect Express (PCIe) and Non-Volatile Memory Express (NVMe) data storage and communications.
Non-Volatile Memory Express (NVMe) is a leading technology for directly connecting storage (e.g., a solid state drive drive (SSD) or Flash drive) to a host (e.g., a computer or server) via a Peripheral Component Interconnect Express (PCIe) interface. The directly-connected host transfers data to and from an NVMe storage device using direct memory access (DMA).
When implemented in a network, for example, the host (e.g., a server) directly attached to the NVMe storage device may also be connected to other computers (e.g., clients) through the network. The client computers, which are not directly attached to the NVMe storage device, access the NVMe storage device in a conventional manner through the network and the server.
RDMA (Remote Direct Memory Access) and NVMe-over-RDMA are known protocols for enabling remotely-connected hosts to transfer data to and from the NVMe storage device. For example, in a write operation, the RDMA protocol communicates data from the system memory of the client over a network to the system memory of the server. At the server, the data is buffered in the system memory of the server, and then the NVMe storage device may access the buffered data. If the server is attached to many clients and each client wishes to access the storage device(s) of the server, the system memory of the server may become a bottleneck.
A known improvement of the RDMA method partially lessens the bottleneck caused by the system memory of the server by utilizing a buffer memory associated with the controller of the NVMe storage device. However, this known method still depends on the involvement of the system memory of the server host and further requires use of special, proprietary, or non-standard NVMe storage devices, limiting the compatibility of this known method in various system implementations.
Therefore, improvements to networked-NVMe transfer approaches are desirable.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the attached Figures.
The present disclosure provides a method and system for transferring NVMe data over a network. Embodiments of the present disclosure comprise using a discrete buffer memory device to transfer the user data from a client host of the network, and process command requests so that a NVMe storage device of the network can be instructed to directly access the memory of the discrete buffer memory. In an implementation, the discrete buffer memory device comprises a controller and a random access memory for generating commands and storing the commands in a submission queue of the random access memory. The controller can clear commands from the submission queue based on completion commands received in a completion queue of the random access memory.
Embodiments of the method and system according to the present disclosure provide improvements to NVMe-over-RDMA operations. In particular, these improvements reduce or eliminate the bottleneck caused by the system memory of a server host, which is found in conventional RDMA approaches, without necessitating the use of special, proprietary, or non-standard NVMe storage devices that include a memory buffer. By using a discrete buffer memory device according to the embodiments of the present disclosure, the involvement of the server's CPU and/or the server's system memory, in the data transfer between the client and the NVMe storage device attached to the server, may be reduced or eliminated.
An embodiment of the present disclosure provides a discrete buffer memory device comprising: a peripheral component interconnect express (PCIe) bus interface for connecting the discrete buffer memory device to a server host, the server host in communication with a plurality of NVMe storage devices and the server host in communication with a client host through a network; a controller connected to the PCIe bus for, independently of the server host: receiving a non-volatile memory express (NVMe)-over-remote direct memory access (RDMA) write command request and user data from the client host, generating a write command from the NVMe-over-RDMA write command request, and sending an interrupt signal to a particular NVMe storage device of the plurality; and a random access memory connected to the controller for, independently of the server host, storing the user data received from the client host and storing the write command.
In an embodiment, the present disclosure provides a system comprising: a client host having user data stored thereon, the client host configured to generate a non-volatile memory express (NVMe)-over-remote direct memory access (RDMA) write command request for requesting a direct memory access transfer of the user data; a server host in communication with the client host through a network; a NVMe storage device in communication with the server host; and a discrete buffer memory device in communication with the server host, the discrete buffer memory device configured to, independently of the server host: generate a write command from the NVMe-over-RDMA write command request; store the user data from the client host; and send an interrupt signal to the NVMe storage device.
In an example embodiment, the discrete buffer memory device comprises: a controller for generating the write command; and a random access memory for storing the user data and storing the write command in a submission queue allocated in the random access memory.
In an example embodiment, the controller is configured to submit the write command into the submission queue and send the interrupt signal for the NVMe storage device to retrieve the write command from the submission queue.
In an example embodiment, the interrupt signal causes the NVMe storage device to retrieve the write command from the random access memory, perform the direct memory access transfer of the stored user data from the random access memory to the NVMe storage device, generate a completion command, perform a direct memory access transfer of the completion command into a completion queue allocated in the random access memory of the discrete buffer memory device, and send a second interrupt signal to the controller of the discrete buffer memory device.
In an example embodiment, the controller, upon receiving the second interrupt signal, is configured to clear from the submission queue the write command associated with the completion command, generate from the completion command a response to the NVMe-over-RDMA write command request, and transmit the response to the client host.
In an example embodiment, the system further comprises a plurality of NVMe storage devices attached to the server host and a plurality of submission queues are allocated in the random access memory of the discrete buffer memory device such that each submission queue is associated with a respective NVMe storage device of the plurality of NVMe storage devices.
In an example embodiment, the NVMe-over-RDMA write command request is associated with a particular NVMe storage device of the plurality and the controller is configured to generate the write command according to the particular NVMe storage device and submit the write command into a particular submission queue associated with the particular NVMe storage device.
In an example embodiment, the discrete buffer memory device further comprises: a battery for providing power to the device during a power failure; and a non-volatile Flash memory for storing the data and commands from the random access memory, during and after the power failure.
In an example embodiment, the discrete buffer memory device comprises a virtual-to-logical block address mapping table for storing logical block addresses of the plurality of NVMe devices.
In an example embodiment, the discrete buffer memory device configures the server host to route NVMe-over-RDMA requests and the user data to the discrete buffer memory device.
In another embodiment, the present disclosure provides a discrete buffer memory device comprising: a peripheral component interconnect express (PCIe) bus interface for connecting the discrete buffer memory device to a server host, the server host in communication with a plurality of NVMe storage devices and the server host in communication with a client host through a network; a controller connected to the PCIe bus for, independently of the server host: receiving a non-volatile memory express (NVMe)-over-remote direct memory access (RDMA) write command request and user data from the client host, generating a write command from the NVMe-over-RDMA write command request, and sending an interrupt signal to a particular NVMe storage device of the plurality; and a random access memory connected to the controller for, independently of the server host, storing the user data received from the client host and storing the write command.
In an example embodiment, a plurality of submission queues are allocated in the random access memory of the discrete buffer memory device such that each submission queue is associated with a respective NVMe storage device of the plurality of NVMe storage devices.
In an example embodiment, the NVMe-over-RDMA write command request is associated with the particular NVMe storage device of the plurality of NVMe storage devices and the controller is configured to generate the write command according to the particular NVMe storage device and submit the write command into a particular submission queue associated with the particular NVMe storage device.
In an example embodiment, the controller is configured to submit the write command into the particular submission queue and send the interrupt signal for the particular NVMe storage device to retrieve the write command from the particular submission queue.
In an example embodiment, a completion queue is allocated in the random access memory of the discrete buffer memory device, and the completion queue receives a direct memory access transfer of a completion command generated from the particular NVMe storage device after the interrupt signal causes the particular NVMe storage device to retrieve the write command from the random access memory and perform a direct memory access transfer of the stored user data from the random access memory to the particular NVMe storage device.
In an example embodiment, the controller receives a second interrupt signal from the particular NVMe storage device, clears from the submission queue the write command associated with the completion command, generates from the completion command a response to the NVMe-over-RDMA write command request, and transmits the response to the client host.
In an example embodiment, the discrete buffer memory device further comprises a battery for providing power to the device during a power failure, and a non-volatile Flash memory for storing the data and commands from the random access memory, during and after the power failure.
In an example embodiment, the discrete buffer memory device further comprises a virtual-to-logical block address mapping table for storing logical block addresses of the plurality of NVMe devices.
In an example embodiment, the controller configures the server host to route NVMe-over-RDMA requests and the user data to the discrete buffer memory device.
For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. Numerous details are set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described.
Before describing details of embodiments of the present disclosure, further discussion of the surrounding context will be provided.
The NVMe protocol is a logical protocol for block device access of local storage devices. The NVMe protocol is designed to deal with local non-volatile memory access using logical block addresses (LBA) for read and write input/output (IO) commands. The NVMe protocol specification comprises submission queues for receiving commands, completion queues for clearing commands in submission queues, and a special memory-mapped PCIe address range in the storage device allocated for NVMe commands only. The NVMe command address range in the NVMe storage device acts as the interface for the host to communicate with the NVMe storage device.
The NVMe protocol is a latest generation storage protocol that is targeted to deliver low latency access to data. However, using NVMe devices in a network may increase the latency of access to data stored on a given NVMe device of the network. In a conventional network (not shown), the network interface controller (NIC) of the host attached to the given NVMe storage device adds an additional layer of processing and latency during remote access of NVMe storage devices.
The NVMe-over-Remote Direct Memory Access (RDMA) protocol simplifies the remote access of NVMe storage devices over the network by eliminating some intermediate processing between the host CPU and the NIC and by offloading some processing from the host CPU to the NIC. The NVMe-over-RDMA protocol also integrates the NVMe block device logical protocol with the DMA operations on the server host's system memory, allowing a client host to send and receive data to the NVMe storage device with less intermediate processing.
Despite this reduction of intermediate processing steps, the known NVMe-over-RDMA approach still carries significant inefficiencies, as described in relation to the example of
The NVMe-over-RDMA write command data structure and the user data are transferred from the system memory of the client host 12 to the system memory of a server host 16 over a network 18 according to the RDMA mechanism described above. Once the user data and the write command request data structure are located in the system memory of the server host 16, the CPU of the server host 16 processes the write command request into a submission queue allocated in the system memory of the server host.
Finally, the server host's CPU signals for the NVMe storage device 14 to perform a DMA transfer of the user data from the system memory of the server host into a non-volatile memory of the NVMe storage device.
Though a read operation is not illustrated, it is similar to the write operation of
In the operation of
In another known implementation of the NVMe protocol, a Controller Memory Buffer (CMB, not shown) is integrated with each NVMe storage device 14 and partially assumes some of the NVMe-over-RDMA operations of the server host's CPU and system memory. The CMB is a controller and buffer memory integrated inside the NVMe storage device.
The buffer memory of the CMB stores user data for the internal NVMe non-volatile memory to perform a DMA transfer. Although the CMB approach helps reduce data transfer between the system memory of the server host and the NVMe non-volatile memory of the CMB, the approach still depends on the CPU and system memory of the server host 16 to manage and process NVMe-over-RDMA command requests and NVMe device interrupts. In particular, submission and completion queues are instantiated in the server host 16 memory, while the server host 16 CPU must parse the NVMe-over-RDMA commands and send the parsed commands to the submission queue in the server host 16 memory. Therefore, the processes still depend on the server host 16 and the system memory of the server host 16 may still become a bottleneck.
Furthermore, the known CMB approach also creates significant limitations that affect the practicality and compatibility of the CMB approach. The CMB approach requires use of special, proprietary, or non-standard NVMe storage devices. These special NVMe storage devices require extra memory, which increases the cost of the NVMe storage device. Additionally, an NVMe storage device that supports CMB only supports a limited size CMB due to NVMe storage device size limitations.
Another limitation of the known CMB approach is that the CMB uses a volatile memory and the user data transferred to and buffered in the CMB's volatile memory can be lost during a power failure.
Another limitation is, in the case of multiple NVMe storage devices attached to the server host, that the client host must be aware of the topology of the NVMe storage devices attached to the server host and their respective CMB addresses.
Another limitation, in the case of the multiple NVMe storage devices attached to the server host, is the need to synchronize the accesses to the CMB with the server host CPU and the server host NIC. Since there is no indication that the CMB is occupied by user data that is sent by the server host CPU to the CMB, any data transfer using the RDMA protocol from the NIC of the server host into the CMB may cause data corruption.
Accordingly, the present disclosure provides a system comprising a discrete buffer memory device for overcoming one or more of the deficiencies in the known approaches described above. The discrete buffer memory device allows for NVMe-over-RDMA to be used either with conventional NVMe storage devices (i.e., devices which do not have CMBs) or NVMe devices with CMBs.
A discrete buffer memory device according to an embodiment of the present disclosure provides a cost effective solution to effectively support NVMe-over-RDMA protocol data transfers without the need to implement a customized or proprietary solution such as CMB. In an embodiment, the discrete buffer memory device is used to eliminate the server host CPU involvement with parsing the NVMe-over-RDMA commands and sending the parsed commands to a NVMe submission queue. In an embodiment, the discrete buffer memory device additionally eliminates the need for the NVMe storage device attached to the server host to transfer the NVMe command data structures between the system memory of the server host and the NVMe device itself. In an embodiment, the discrete buffer memory device is also used to eliminate the client host from being aware of the topology of the NVMe storage devices attached to the server host. Furthermore, in an embodiment the discrete buffer memory device allows for extended functionality such as virtualization of multiple storage devices attached to the server.
The server host 116 is in communication with a client host 112 through a network 118. In an example implementation, each of the server host 116 and the client host 112 connect to the network 118 through a network interface controller (NIC) connected to a PCIe switch.
The server host 116 is also in communication with a NVMe storage device 114 through the PCIe bus interface, which may also comprise the PCIe switch. The system 100 may optionally comprise additional NVMe storage devices 114 so that the server host 116 is in communication with a plurality of NVMe storage devices 114.
In an example write operation of the system 100, the client host 112 has user data that it wishes to write to the NVMe storage device 114. The client host generates a NVMe-over-RDMA write command request for requesting a direct memory access transfer of the user data. In an example embodiment, the discrete buffer memory device 110 is discrete in that it is separate and distinct from both the server host 116 and the NVMe storage device 114.
In an embodiment, the discrete buffer memory device 110 has configured the PCIe switch of the server host 116 to route NVMe-over-RDMA requests and user data, for each NVMe storage device 114 attached to server host 116, to the discrete buffer memory device 110. Therefore, the request and user data from the client host 112 are sent to the discrete buffer memory device 110.
The discrete buffer memory device 110 generates a write command based on the NVMe-over-RDMA write command request, and stores the write command in internal memory of the discrete buffer memory device 110, for example in RAM 124 discussed below in relation to
The interrupt signal instructs the NVMe storage device 114 to retrieve the write command from the discrete buffer memory device 110. The write command enables the NVMe storage device 114 to perform the direct memory access transfer of the stored user data from the discrete buffer memory device 110 to the NVMe storage device 114.
The discrete buffer memory device 110 comprises a PCIe bus interface 120 for connecting the discrete buffer memory device 110 to the server host 116.
The discrete buffer memory device 110 also comprises a controller 122 connected to the PCIe bus interface 120. The controller 122 receives the NVMe-over-RDMA write command request and user data from the client host 112 through the PCIe bus interface 120. The controller 122 also generates and sends the interrupt signal to the NVMe storage device 114 through the PCIe bus interface 120.
The controller 122 is connected to a random access memory (RAM) 124 for storing the user data received from the client host 112 and storing the write command generated by the controller 122. The RAM 124 may be a double data rate (DDR) RAM connected to the controller 122 through a DDR memory bus.
The interrupt signal sent over the PCIe bus interface 120 causes the NVMe storage device 114 to retrieve the write command from the RAM 124 and perform a direct memory access transfer of the stored user data from the RAM 124 to the NVMe storage device 114.
The above-mentioned operations of the controller 122 are performed independently of the server host 116.
In a further embodiment, the discrete buffer memory device 110 comprises a battery 126 and a non-volatile memory 128. The controller 122 is connected to non-volatile memory 128 by a Flash bus. Since the RAM 124 is a volatile memory, any data buffered or stored on the RAM 124 will be lost when the discrete buffer memory device 110 loses power.
In the case of a power failure or power shutdown event, the battery 126 provides the discrete buffer memory device 110 with power. Thus, the controller 122 can transfer or backup data from the RAM 124 to the non-volatile memory 128. On boot time, the controller 122 can load the data from the non-volatile memory 128 back into the RAM 124 and continue to work from the last state it stopped at due to the power down event. Although non-volatile memory 128 is illustrated in an example embodiment in
The client host 112 has direct memory access to the discrete buffer memory device 110 because the discrete buffer memory device 110 is exposed to the client host 112 according to the PCIe standard via memory mapping. Specifically, the client host 112 has address space allocated for mapping to hardware registers of the discrete buffer memory device 110 and for mapping to RAM 124 addresses of the discrete buffer memory device 110. The PCIe memory map of the discrete buffer memory device 110 in the client host 112 is divided into the following three types of memory address ranges:
1) PCIe registers map as defined by the PCIe standard—this interface is used by the host to read/write to the PCIe registers of the device.
2) NVMe registers map (a subset of the PCIe registers address range) as defined by the NVMe standard—this interface is used by the host to read/write to the NVMe registers of the device.
3) Random access memory (RAM)—mapped as PCIe BAR (base address register) area as specified in PCIe standard specification. The client host 112 can directly access the RAM 124 of the discrete buffer memory device 110 by addressing according to this memory map.
Embodiments of the present disclosure overcome one or more of the limitations of known approaches by providing a discrete buffer memory device 110 that enables access, according to the NVMe-over-RDMA protocol, between the client host 112 and the NVMe storage device 114, independently of the server host 116. In other words, the discrete buffer memory device 110 can perform necessary NVMe-over-RDMA protocol tasks without the intervention of the server host system memory and CPU.
Referring to
According to an embodiment of the present disclosure, the NVMe-over-RDMA request is transferred through the network (including the NIC and the PCIe switch), to the discrete buffer memory device 110. The NVMe-over-RDMA command data structures are stored into a PCIe address range that is allocated in the RAM 124 for managing NVMe submission and completion queues.
At 201, the discrete buffer memory device 110 receives an NVMe-over-RDMA command request.
Next, at 202, controller 122 parses the NVMe-over-RDMA request into a NVMe command data structure.
At 203, the controller 122 writes the NVMe command data structure into a submission queue 132 associated with the relevant NVMe storage device 114, where the submission queue 132 is instantiated in addresses of the RAM 124 allocated for this purpose.
At 204, the discrete buffer memory device 110 generates and writes an interrupt signal into a doorbell register of the relevant NVMe storage device 114 indicative that an NVMe command is in the submission queue 132 associated with that NVMe storage device 114.
Upon receiving the interrupt signal, at 205, the NVMe storage device 114 performs a DMA transfer of the command from the associated submission queue 132 in the RAM 124. In an embodiment of the present disclosure, the command instructs the NVMe storage device 114 to perform a DMA transfer of user data from specific buffer space 130 addresses in the RAM 124 where the user data is stored.
Upon completion of handling the NVMe command (e.g., performing the DMA transfer), at 206, the NVMe storage device 114 generates an NVMe completion command.
At 207, the NVMe storage device 114 performs a DMA write of the NVMe completion command to the completion queue 134 in the RAM 124.
At 208, the NVMe storage device 114 sends an interrupt signal to the discrete buffer memory device 110 that the completion command is in the completion queue 134.
Upon receiving the interrupt signal, at 209, the controller 122 clears, from the submission queue 132, the command that corresponds to the completion command received in the completion queue 134.
At 210, controller 122 generates a response to the NVMe-over-RDMA protocol request by transferring a PCIe message to the client host 112 through the network 118.
In a further embodiment of the present disclosure, the discrete buffer memory device 110 also comprises a virtual-to-logical (V2L) block address mapping table (not shown) for storing logical block addresses of the plurality of NVMe devices 114. The discrete buffer memory device 110 can receive an NVMe-over-RDMA request and decide which of the NVMe storage devices 114 will handle the request. For example, the virtual-to-logical table comprises virtual block address entries X that store the NVMe storage device number N and the real LBA address in the NVMe storage device N.
The discrete buffer memory device 110 only exposes the V2L table to its internal controller 122. The discrete buffer memory device 110 exposes the virtual block addresses to the client host 112. In an embodiment, the exposed virtual block addresses indicate to the client host 112 that the discrete buffer memory device 110 has a storage capacity equal to the sum total of the storage capacities of the plurality of NVMe devices 114. The V2L table allows the discrete buffer memory device 110 to hide the topology of the NVMe storage devices 114.
Various embodiments of the present disclosure provide a number of advantages over known NVMe-over-RDMA approaches.
The discrete buffer memory device 110 provides a cost effective solution to effectively support NVMe-over-RDMA without the need to implement a customized and proprietary solution such as CMB.
Furthermore, the discrete buffer memory device 110 may be used to eliminate the server host CPU involvement with parsing the NVMe-over-RDMA commands and sending the parsed commands to a NVMe submission queue. The discrete buffer memory device additionally eliminates the need for the NVMe storage device attached to the server host to transfer the NVMe command data structures between the system memory of the server host and the NVMe device itself.
Furthermore, the discrete buffer memory device may allow for extended functionality such as virtualization of multiple storage devices attached to the server.
The discrete buffer memory device may also be used to hide the topology of the NVMe storage devices from the client host.
The discrete buffer memory device may also provide a power failure recovery management method. This is enabled in an example embodiment in which the discrete buffer memory device, which includes a battery backup, makes sure that any data that is already located in the nonvolatile random access memory of the discrete buffer memory device will be flushed into the discrete buffer memory device non-volatile memory upon sudden power failure.
In the preceding description, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the embodiments. However, it will be apparent to one skilled in the art that these specific details are not required. In other instances, well-known electrical structures and circuits are shown in block diagram form in order not to obscure the understanding. For example, specific details are not provided as to whether the embodiments described herein are implemented as a software routine, hardware circuit, firmware, or a combination thereof.
Embodiments of the disclosure can be represented as a computer program product stored in a machine-readable medium (also referred to as a computer-readable medium, a processor-readable medium, or a computer usable medium having a computer-readable program code embodied therein). The machine-readable medium can be any suitable tangible, non-transitory medium, including magnetic, optical, or electrical storage medium including a diskette, compact disk read only memory (CD-ROM), memory device (volatile or non-volatile), or similar storage mechanism. The machine-readable medium can contain various sets of instructions, code sequences, configuration information, or other data, which, when executed, cause a processor to perform steps in a method according to an embodiment of the disclosure. Those of ordinary skill in the art will appreciate that other instructions and operations necessary to implement the described implementations can also be stored on the machine-readable medium. The instructions stored on the machine-readable medium can be executed by a processor or other suitable processing device, and can interface with circuitry to perform the described tasks.
The above-described embodiments are intended to be examples only. Alterations, modifications and variations can be effected to the particular embodiments by those of skill in the art without departing from the scope, which is defined solely by the claims appended hereto.
This application claims the benefit of U.S. Provisional Application No. 62/270,348, filed Dec. 21, 2015, which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62270348 | Dec 2015 | US |