The present disclosure relates generally to remote direct memory access (RDMA) operations and particularly to improved RDMA techniques utilizing network interface controllers having onboard processors.
Remote direct memory access (RDMA) allows a client device to access remote memory devices over a network supporting such features using a RDMA network interface controller (rNIC). There are protocols extending the remote memory access functionality of RDMA to storage access, for example non-volatile memory express (NVMe)-over-Fabrics. While this is advantageous in itself, approaches implemented to date use significant processing resources of the client device's central processing unit (CPU) to provide some storage services. As such, ‘smart’ rNICs (SmartNICs) have been proposed as a solution, which include an onboard processor to offload at least a portion of the client CPU's operations to the rNIC.
It would therefore be advantageous to provide a solution that would overcome the challenges noted above.
A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments, and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term “certain embodiments” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure.
Certain embodiments disclosed herein include a method for improved remote direct memory access (RDMA) for multi-host network interface controllers (NIC), the method including: allocating a first key to a first host, the first key corresponding to a first address of a memory device of the first host; and allocating the first key to a second host, wherein the second host is an RDMA NIC (rNIC) configured to offload at least a portion of storage operations from the first host.
Certain embodiments disclosed herein also include a non-transitory computer readable medium having stored thereon instructions for causing a processing circuitry to perform a process, the process including: allocating a first key to a first host, the first key corresponding to a first address of a memory device of the first host; and allocating the first key to a second host, wherein the second host is an RDMA NIC (rNIC) configured to offload at least a portion of storage operations from the first host.
Certain embodiments disclosed herein also include a system for multi-host network interface controllers (NIC), including: a processing circuitry; and a memory, the memory containing instructions that, when executed by the processing circuitry, configure the system to: allocate a first key to a first host, the first key corresponding to a first address of a memory device of the first host; and allocate the first key to a second host, wherein the second host is an RDMA NIC (rNIC) configured to offload at least a portion of storage operations from the first host.
The subject matter disclosed herein is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the disclosed embodiments will be apparent from the following detailed description taken in conjunction with the accompanying drawings.
It is important to note that the embodiments disclosed herein are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed embodiments. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality. In the drawings, like numerals refer to like parts through several views.
The various disclosed embodiments include network interface controllers (NICs) that include an onboard processor capable of offloading remote direct memory access (RDMA) operations from the main processor of a client device. The rNIC defines each processor as a host and allocates addresses in a memory, which are addressable by a key. The present disclosure suggests utilizing a single key per address, allowing the main processor and the rNIC processor to generate instructions based on a single key. This improves the efficiency and speed of the operations between a client device and a remote storage device, accessible over a network.
The processing circuitry 110 is coupled via a bus 105 to a memory 120. The memory 120 may include a memory portion 122 that contains instructions that when executed by the processing circuitry 110 performs the method described in more detail herein. The memory 120 may be further used as a working scratch pad for the processing circuitry 110, a temporary storage, and others, as the case may be. The memory 120 may be a volatile memory such as, but not limited to random access memory (RAM), or non-volatile memory (NVM), such as, but not limited to, flash memory. Memory 120 may further include a memory portion 124 containing addresses which can be associated with keys allocated from an NIC 130, to which the processing circuitry 110 is further coupled to.
The NIC 130 is further discussed below, and may provide connectivity over a network to one or more storage servers. The processing circuitry 110 may be coupled to a storage 140, which may be used for the purpose of holding a copy of the method executed in accordance with the disclosed technique. The storage 140 may also be used for storing therein data blocks received over a network from a storage server. The processing circuitry 110 and/or the memory 120 may also include machine-readable media for storing software. Software shall be construed broadly to mean any type of instructions, whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise. Instructions may include code (e.g., in source code format, binary code format, executable code format, or any other suitable format of code). The instructions, when executed by the one or more processors, cause the processing system to perform the various functions described in further detail herein.
The processing circuitry 210 may be realized as one or more hardware logic components and circuits. For example, and without limitation, illustrative types of hardware logic components that can be used include field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), system-on-a-chip systems (SOCs), general-purpose microprocessors, microcontrollers, digital signal processors (DSPs), and the like, or any other hardware logic components that can perform calculations or other manipulations of information.
The rNIC 200 is configured to offload at least a portion of the storage operations of a client device to utilize less resources of the client device processor when performing various storage tasks. The rNIC 200 is connected to a client processor device, such as the processing circuitry 110 of the client device 100 of
The client device processing circuitry 110 may be defined by the rNIC 200 as a first host, with the rNIC 200 processing circuitry 210 being defined as a second host. Typically, each host is allocated a plurality of keys by the rNIC 200, where each key corresponds to a single address in a memory, such as memory 120 of the client device 100. For example, the first host may request a remote data block be fetched, i.e., read, from a remote storage device to a first key, corresponding to a first address of the memory into which the remote data block should be read. The rNIC will receive the request, and then send an RDMA request with a second key corresponding to a second address, to the remote storage server to read the block into the second address.
A third operation will be initiated to move the data block from the second address to the first address, in order to complete the operation. This approach includes multiple reads and memory allocations, and it may be beneficial to avoid at least some of these operations in order to increase the efficiency of the process. For example, by allocating a first key to a first host corresponding to a first address, and allocating a second key to a second host corresponding to the first address, the second host is able to send the remote storage server an RDMA request to read data directly into the first memory address, without having to go through the second memory address. This may increase the total speed of the transaction.
The client device 100 is connected to a storage server 300, e.g., via an network connection. The storage server 300 includes a storage rNIC 310, a storage memory 320, and a plurality of storage devices 330-1 through 330-N, where N is an integer equal to or greater than 1, generally referred to as storage device 330. A storage device 330 may be, for example, a solid state storage device (SSD), and may be addressable as a plurality of data blocks with associated addresses.
The client device 100 may initiate a storage operation, such as a read operation, with a storage device 330. Typically, the client device processor 110 will send a request to read a data block from a remote storage 330. The request may include a key which is associated with an address of a memory 120 of the client device to which the data block is to be written. In some RDMA schemes, the rNIC 200 is configured to then send a request to the storage server 300 using a second key associated with a second address of the memory 120, or associated with a first address of memory 220. The storage server 300 reads the data block from the storage device 310, and sends the data block to the address associated with the second key. The rNIC 200 will then send the data block from the address associated with the second key to the address associated with the first key. This is wasteful of system resources, such as incurring unnecessary write operations, which not only add unnecessary time to the operation, but execute superfluous write commands that can accelerating drive failure. In embodiments where the second key is associated with a first address of the rNIC memory 220, a further write operation is required between the rNIC memory 220 and the client device memory 120. This may introduce a further bottleneck, and will unnecessarily limit the operation to the speed of whichever memory is slower (typically, the memory 220 of the rNIC 200 is slower).
Therefore it is proposed that the rNIC 200 associate the first key and the second key with a first address of the memory 120. This way, when the rNIC 310 of the storage server 300 performs the read operation, the data block will be written directly to its final destination. In some embodiments, the client device 100 may be a virtual machine, container, or other such virtualization. In such embodiments, the client device 100 is not an actual physical machine, but rather a virtualization itself, which may be identifiable to the rNIC 200 as a machine. In such embodiments, a plurality of virtualizations may each communicate as a host with the rNIC 200, for example as a single-root I/O virtualization.
In an embodiment, the network 410 may be configured to provide connectivity of various sorts, as may be necessary, including but not limited to, wired and/or wireless connectivity, such as, for example, local area network (LAN), wide area network (WAN), metro area network (MAN), worldwide web (WWW), Internet, and any combination thereof, as well as cellular connectivity.
The network 410 further provides connectivity to a plurality of storage servers 300-1 through 300-K, where K is an integer equal to or greater than 1. Each storage server 300 includes one or more storage devices, as shown in
At S510, a first key associated with a first address of a memory is allocated to a first host, e.g., by an rNIC. The first host may be, for example, a processor of a client device, a virtual machine, container, or other virtualization. The first key may be an identifier used by the rNIC to uniquely identify the first address by the first host. The first key may be defined implicitly, e.g., on the rNIC, so that the second host may be able to refer to it via RDMA.
At S520, a second key associated with the first address of the memory is allocated to a second host. The second host may be a processing circuitry (or core of a multi-core processor) of an rNIC. In an embodiment, the second key may be identical to the first key, so that the same key may be allocated twice such that both hosts can refer to the same address equally. In this embodiment, the second host is allocated a second key, and the second key is associated with the first address, thereby allowing both hosts to access the same memory address.
At S530, a check is performed to determine if additional keys should be allocated. If so, execution continues at S510, otherwise execution terminates. In a non-limiting example, the check is performed by determining the size of the address space, and then checking if there are unallocated addresses.
When initiating an RDMA operation, a first host will typically send the rNIC a request with a first key, e.g., allocated by an rNIC. The first host in this example is the client device. A request for a data block stored on a remote storage device accessible via a storage server over a network is generated in response, e.g., via the rNIC. The request of the rNIC, which is a second host, includes the data block, and a second key (allocated to the second host) associated with the first memory. When the request is received, e.g., by the storage server, the storage server is responds by sending the data directly to the first address rather than going through an intermediate address which would typically be associated with the second key.
Upon completion of the response on the client memory, a completion notification is generated by the second host. The second host is configured to respond by generating a completion notification on the first host. The client device (i.e. first host) may communicate with the rNIC using different protocols, for example NVMe, NVMe-over-fabrics, or iSCSI. In certain embodiments, a plurality of first hosts may communicate with the rNIC, each utilizing a different protocol.
At S610, an instruction is generated, e.g., by a client device, for an rNIC to write a data block to an address of a remote storage device, such as a remote storage device connected to a remote storage server. The instruction may include a first key associated with a first address of a client device. In some embodiments, the client device may not be aware that the remote storage device is in fact remote. For example, the remote storage device may be exposed to the client device as a virtual storage, local storage, and the like. A virtual storage may include virtual addresses, each corresponding to a single physical address of any of a plurality of remote storage devices.
At S620, a request is generated for the remote storage device to write the data block associated with the first key to the remote storage device.
At S630, a completion response based on the first key is received in response to the remote storage device writing the data block. By using the first key, and not as previous solutions suggest using an intermediate second key, the client device receives the data block directly (without going through a memory allocated to the rNIC) and a completion indication may be sent to the rNIC. In certain embodiments, the rNIC may then generate a completion indication for the request from the first host.
At S640, a check is performed to determine if another instruction should be executed. If so, execution continues at S610, otherwise execution terminates. In some embodiments, if another instruction should be executed, execution may continue at S710 of
At S710, an instruction is generated, e.g., from a client device, for an rNIC to read a data block from an address of a remote storage device connected to a remote storage server. The request may include a first key, allocated by the rNIC, associating the first key with a first address of the client device, into which the data block should be read. In some embodiments, the client device may not be aware that the remote storage device is in fact remote. For example, the remote storage device may be exposed to the client device as a virtual storage, local storage, etc. A virtual storage may include virtual addresses, each corresponding to a single physical address of any of a plurality of remote storage devices.
At S720, a request is generated for the remote storage device to read the data block into the memory address associated with the first key.
At S730, the remote storage server is caused to write the data block based on the first key (i.e. in the memory of the first host) in response to reading the data block from the remote storage device. By using the first rather than an intermediate second key, the first host receives the data block in a shorter time period. A completion indication is generated, e.g., by the remote storage device to the rNIC, which then generates a completion indication to the first host.
At S740, a check is performed to determine if another instruction should be executed.
If so, execution continues at S710, otherwise execution terminates. In some embodiments, if it is determines that another instruction should be executed, execution continues at S610 of
The methods described above may extend to more complex operations performed on a storage device. For example, a first host may request a data block be written to a storage device. The request includes a first key associated with a first address (or other identifier) of a data block to be written. The storage device includes a data redundancy scheme, in this example mirrored volumes. In response to receiving the request, the rNIC generates a first write instruction for a first remote storage device, and a second write instruction for a second remote storage device, which is a mirror of the first remote storage device. In other embodiments, the rNIC may generate a plurality of write instructions, each for a remote storage device. Each write instruction includes the first key. The rNIC then receives completion indications from each remote storage device, and generates a completion indication for the first host when all (or, in some embodiments, a certain number) of the remote storage devices have returned a completion indication.
In another example, the client device generates a request to write a new data block at a first memory address to a storage device having erasure coding protection, such as use of a parity volume, containing blocks which are generated by calculating a parity between, for example, a first block from a one storage device, and a second block from another storage device. In such an example, a request to write the new data block is sent to the rNIC from the first host, associated with a first key. The second host (e.g. rNIC processor) may determine to write the data block on a first storage device, and generate a parity to be written on a second storage device. The rNIC may generate a parity by reading an old parity block from the second storage device, and generating a new parity block between the old parity block and the new data block (to remove the old data block from the parity). In some embodiments, the new parity block is generated by the rNIC, in others it may be generated by the parity storage device. Once the rNIC receives a completion indication from the first storage device (where the data block was written to) and the second storage device (where the new parity block is written to), the second host generates a completion indication to the first host. The method described in U.S. patent application Ser. No. 15/684,439 assigned to common assignee and incorporated by reference herein, can be modified using the teachings herein as indicated by the above example.
The various embodiments disclosed herein can be implemented as hardware, firmware, software, or any combination thereof. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium consisting of parts, or of certain devices and/or a combination of devices. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such a computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
As used herein, the phrase “at least one of” followed by a listing of items means that any of the listed items can be utilized individually, or any combination of two or more of the listed items can be utilized. For example, if a system is described as including “at least one of A, B, and C,” the system can include A alone; B alone; C alone; A and B in combination; B and C in combination; A and C in combination; or A, B, and C in combination.
All examples and conditional language recited herein are intended for pedagogical purposes to aid the reader in understanding the principles of the disclosed embodiment and the concepts contributed by the inventor to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the disclosed embodiments, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
This application claims the benefit of U.S. Provisional Application No. 62/629,825 filed on Feb. 13, 2018, the contents of which are hereby incorporated by reference. This application is also a continuation in part of: (a) U.S. patent application Ser. No. 15/975,379 filed on May 9, 2018, now pending, which is a continuation of U.S. patent application Ser. No. 14/726,919 filed Jun. 1, 2015 now U.S. Pat. No. 9,971,519, which claims the benefit of U.S. Provisional Application Nos.: 62/126,920 filed on Mar. 2, 2015, 62/119,412 filed on Feb. 23, 2015, 62/096,908 filed on Dec. 26, 2014, 62/085,568 filed on Nov. 30, 2014, and 62/030,700 filed Jul. 30, 2014; (b) U.S. patent application Ser. No. 14/934,830 filed on Nov. 6, 2015, which claims the benefit of U.S. Provisional Application 62/172,265 filed Jun. 8, 2015; and (c) U.S. patent application Ser. No. 15/684,439 filed Aug. 23, 2017, which claims the benefit of U.S. Provisional Application No. 62/381,011 filed Aug. 29, 2016. All of the applications referenced above are herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
5309451 | Noya et al. | May 1994 | A |
5717691 | Dighe et al. | Feb 1998 | A |
5745671 | Hodges | Apr 1998 | A |
5805788 | Johnson | Sep 1998 | A |
5889934 | Peterson | Mar 1999 | A |
6108812 | Born | Aug 2000 | A |
6839803 | Loh et al. | Jan 2005 | B1 |
7515612 | Thompson | Apr 2009 | B1 |
7539780 | Makhervaks et al. | May 2009 | B2 |
7577667 | Hinshaw et al. | Aug 2009 | B2 |
7590768 | Gormley | Sep 2009 | B2 |
7710968 | Cornett et al. | May 2010 | B2 |
8037154 | Biran et al. | Oct 2011 | B2 |
8103785 | Crowley et al. | Jan 2012 | B2 |
8122155 | Marti | Feb 2012 | B1 |
8233380 | Subramanian et al. | Jul 2012 | B2 |
8265095 | Fritz et al. | Sep 2012 | B2 |
8307271 | Liu et al. | Nov 2012 | B1 |
8407448 | Hayden et al. | Mar 2013 | B1 |
8433848 | Naamad et al. | Apr 2013 | B1 |
8706962 | Belluomini et al. | Apr 2014 | B2 |
8775718 | Kanevsky et al. | Jul 2014 | B2 |
8832216 | Bugge | Sep 2014 | B2 |
8910031 | Liu et al. | Dec 2014 | B1 |
9241044 | Shribman et al. | Jan 2016 | B2 |
9462308 | LaBosco et al. | Oct 2016 | B2 |
9467511 | Tamir et al. | Oct 2016 | B2 |
9467512 | Tamir et al. | Oct 2016 | B2 |
9529773 | Hussain et al. | Dec 2016 | B2 |
9639457 | Piszczek et al. | May 2017 | B1 |
20050038850 | Oe et al. | Feb 2005 | A1 |
20050129039 | Biran et al. | Jun 2005 | A1 |
20060059408 | Chikusa et al. | Mar 2006 | A1 |
20060230219 | Njoku et al. | Oct 2006 | A1 |
20060235999 | Shah et al. | Oct 2006 | A1 |
20080109616 | Taylor | May 2008 | A1 |
20080126509 | Subramanian et al. | May 2008 | A1 |
20080181245 | Basso et al. | Jul 2008 | A1 |
20090300023 | Vaghani | Dec 2009 | A1 |
20110131377 | Gray et al. | Jun 2011 | A1 |
20120079143 | Krishnamurthi et al. | Mar 2012 | A1 |
20120144233 | Griffith et al. | Jun 2012 | A1 |
20120300633 | Friedman et al. | Nov 2012 | A1 |
20130019032 | Han et al. | Jan 2013 | A1 |
20130054726 | Bugge | Feb 2013 | A1 |
20130073821 | Flynn et al. | Mar 2013 | A1 |
20130198311 | Tamir et al. | Aug 2013 | A1 |
20130198312 | Tamir et al. | Aug 2013 | A1 |
20130254321 | Johnsen et al. | Sep 2013 | A1 |
20130262614 | Makhervaks et al. | Oct 2013 | A1 |
20140089444 | Makhervaks et al. | Mar 2014 | A1 |
20140211808 | Koren | Jul 2014 | A1 |
20140297982 | Duzett | Oct 2014 | A1 |
20140317336 | Fitch et al. | Oct 2014 | A1 |
20150006663 | Huang | Jan 2015 | A1 |
20150026286 | Sharp et al. | Jan 2015 | A1 |
20150030034 | Bogdanski et al. | Jan 2015 | A1 |
20150089121 | Coudhury et al. | Mar 2015 | A1 |
20150319237 | Hussain et al. | Nov 2015 | A1 |
20160034418 | Romem et al. | Feb 2016 | A1 |
20160036913 | Romem et al. | Feb 2016 | A1 |
20160057224 | Ori | Feb 2016 | A1 |
20160253267 | Wood et al. | Sep 2016 | A1 |
20160266965 | B et al. | Sep 2016 | A1 |
20160371226 | Shalf et al. | Dec 2016 | A1 |
20170093792 | Marom | Mar 2017 | A1 |
20170134269 | Bogdanski et al. | May 2017 | A1 |
20170187496 | Shalev et al. | Jun 2017 | A1 |
20170289036 | Vasudevan | Oct 2017 | A1 |
20180293188 | Katayama | Oct 2018 | A1 |
Entry |
---|
Sathiamoorthy, et al., “XORing Elephants: Novel Erasure Codes for Big Data”, Proceedings of the VLDB Endowment, vol. 6, No. 5, 2013, pp. 325-336. |
Number | Date | Country | |
---|---|---|---|
20190187916 A1 | Jun 2019 | US |
Number | Date | Country | |
---|---|---|---|
62629825 | Feb 2018 | US | |
62381011 | Aug 2016 | US | |
62172265 | Jun 2015 | US | |
62126920 | Mar 2015 | US | |
62119412 | Feb 2015 | US | |
62096908 | Dec 2014 | US | |
62085568 | Nov 2014 | US | |
62030700 | Jul 2014 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14726919 | Jun 2015 | US |
Child | 15975379 | US | |
Parent | 16270239 | US | |
Child | 15975379 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15975379 | May 2018 | US |
Child | 16270239 | US | |
Parent | 15684439 | Aug 2017 | US |
Child | 16270239 | US | |
Parent | 14934830 | Nov 2015 | US |
Child | 15684439 | US |