1. Technical Field
The present invention relates generally to RDMA (remote data memory access) systems, and more particularly relates to an asynchronous completion notification system for an RDMA network interface card.
2. Related Art
RDMA (remote data memory access) is a network interface card (NIC) feature that lets one computer directly place information into the memory of another computer. The technology reduces latency by minimizing demands on bandwidth and processing overhead. Traditional hardware and software architecture imposes a significant load on a server's CPU and memory because data must be copied between the kernel and application. Memory bottlenecks become more severe as connection speeds exceed the processing power and memory bandwidth of servers.
RDMA gets around this by implementing a reliable transport protocol in hardware on the NIC and by supporting zero-copy networking with kernel bypass. Zero-copy networking lets the NIC transfer data directly to or from application memory, eliminating the need to copy data between application memory and the kernel.
Kernel bypass lets applications issue commands to the NIC without having to execute a kernel call. The RDMA request is issued from user space to the local NIC and over the network to the remote NIC without requiring any kernel involvement. This reduces the number of context switches between kernel space and user space while handling network traffic.
The RDMA protocol is defined by the RDMA Consortium, which in part, maps the RDMA features of Infiniband onto Ethernet. The RDMA and InfiniBand standards provide the concept of a completion queue (CQ) for holding “consumer reports” about completion requests posted to the work (i.e., send or receive) queue. Each entry in the CQ is called a completion queue entry (CQE). The standards also provide the concept of an asynchronous completion notification mechanism, which is used to notify the consumer when a new CQE is placed in the CQ. In this mode of operation, the consumer can register an asynchronous completion notification handler, which is called when:
As part of any efficient software implementation using an Asynchronous Completion Notification mechanism, the RNIC (remote network interface card) needs to guarantee that, given proper software behavior, no CQE will ever be left unattended in the CQ, i.e., each CQE placed to the CQ will either be retrieved by a Poll For Completion, or be indicated by a call to the Asynchronous Completion Notification Handler routine registered by the software.
Unfortunately, using known implementation techniques, situations may arise wherein one or more CQE's may be left unattended in the CQ. Accordingly, a need exists for an Asynchronous Completion Notification system that can guarantee that no CQE will ever be left unattended in the CQ.
The present invention addresses the above-mentioned problems, as well as others, by providing an asynchronous completion notification system and method that guarantees that no CQE will ever be left unattended in the CQ. In a first aspect, the invention provides an asynchronous completion notification system for use in an RDMA (remote data memory access) network interface card (RNIC) having a completion queue (CQ) for holding completion queue entries (CQEs), comprising: a system for storing a first CQE number of the most recent CQE placed into the CQ; a system for storing a second CQE number of the most recent CQE retrieved from the CQ; a system for packaging the second CQE number with each request completion notification verb that is issued; and a processing system for processing the request completion notification verb, wherein the processing system compares the first CQE number with the second CQE number to determine whether asynchronous completion notification should be immediately performed.
In a second aspect, the invention provides a method for implementing asynchronous completion notification in an RDMA (remote data memory access) network interface card (RNIC) having a completion queue (CQ) for holding completion queue entries (CQEs), comprising: storing a first CQE number of a most recent CQE placed into the CQ; storing a second CQE number of a most recent CQE retrieved from the CQ; issuing a request for completion notification; packaging the second CQE number with the request; and processing the request, wherein the processing step compares the first CQE number with the second CQE number to determine whether asynchronous completion notification should be immediately performed.
In a third aspect, the invention provides a system for implementing asynchronous completion notification in an RDMA (remote data memory access) network interface card (RNIC) having a completion queue (CQ) for holding completion queue entries (CQEs), comprising: means for storing a first CQE number of a most recent CQE placed into the CQ; means for storing a second CQE number of a most recent CQE retrieved from the CQ; means for issuing a request for completion notification; means for packaging the second CQE number with the request; and means for processing the request, wherein the processing means compares the first CQE number with the second CQE number to determine whether asynchronous completion notification should be immediately performed.
These and other features of this invention will be more readily understood from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings in which:
Overview
Described below is an asynchronous completion notification system and method applicable to an RNIC (or other system implementing RDMA or InfiniBand), which guarantees that no CQE will ever be left unattended in the CQ. It is assumed for the purpose of this description that the reader has an understanding of the RDMA protocol and its implementation in an RNIC environment. The RDMA protocol is available on the Web at <www.rdmaconsortium.org/home>.
Exemplary Software Approaches
Various software approaches can be utilized to facilitate operations of an asynchronous completion notification system. As noted above, in an RDMA environment, an asynchronous completion notification system needs to guarantee that, given proper software behavior, no CQE will ever be left unattended in the CQ. A simple software operational mode to address this would be to:
The problem that can arise with this approach is the possibility of a race between software calling a Request Completion Notification and hardware placing the next CQE in the CQ. When hardware gets the completion notification request, it cannot associate this request with CQEs placed to the CQ (i.e., it cannot identify a CQE that was placed before and after the Request Completion Notification verb has been called). The race may cause an undershoot condition in which a CQE is placed in the memory but no notification is set. In a worst-case scenario, a deadlock could occur when the application is waiting for notification for the last packet on which notification will not be asserted.
A second possible software approach, which would help address the problem identified with the first approach, is to apply a race resolution (similar to that used in addressing race problems in interrupt handling flows). In this case, race resolution is implemented in software by performing the following steps:
Note that the problem could be overcome by taking the overshoot approach in which the worst-case delays are calculated between the notification and the last Poll For Completion request. However, the overshoot approach may result in a high probability of cases in which Poll For Completion is invoked when actually there is no new available CQE in the CQ. Accordingly, the present invention provides a more efficient asynchronous completion notification system that does not incur this overhead.
Asynchronous Completion Notification System
Referring to
Asynchronous completion notification system 10 operates in conjunction with a completion queue (CQ) 12, which holds completion queue entries (CQEs) 14. CQE's 14 are typically placed to the CQ 12 by a hardware device, shown here as a CQ placement system 24. CQE's 14 are typically retrieved out of the CQ 12 via software 26, i.e., using a poll for completion verb, depicted here as a CQE retrieval system 28. In the example depicted in
Software 26 also includes a completion request system 32 for issuing a request 34 for asynchronous completion notification to the notification processing system 11. Request 34 generally comprises a request completion notification verb 35, which is defined in the RDMA protocol. Upon receiving the request 34, notification processing system 11 must determine when asynchronous completion notification (ACN) should be performed by the asynchronous completion notification report system 22. This process is facilitated by a comparison system 20, which examines CQE data to determine whether or not one or more CQEs 14 were placed in the CQ 12 since the last poll for completion. An exemplary methodology for implementing this process is described below with reference to both
For each CQ 12, a counter (LastPlacedCQENumber) 16 is maintained. This counter is incremented whenever a new CQE is placed to the CQ 12, see step S1 of
When software 26 retrieves a CQE from the completion queue 12, the verb layer 31 stores the CQENumber of the last retrieved CQE in LastPolledCQENumber 30 (step S2). In the example shown in
If the LastPolledCQENumber 30 equals the LastPlacedCQENumber 16, then no CQE was placed to CQ since the last Poll For Completion, which preceded Request Completion Notification 35. Therefore, the asynchronous completion notification 32 should be performed only when the next CQE is placed to the CQ (step S6).
However, if the LastPolledCQENumber 30 is smaller than the LastPlacedCQENumber 16 (step S7), then one or more CQEs were placed to the CQ 12 after the software 26 called Request Completion Notification 35. Therefore, the asynchronous completion notification report system 22 is notified to immediately perform asynchronous completion notification (step S8).
In the example shown in
This implementation of the asynchronous completion notification system 10 allows software 26 to use the simple software approach described above, without any risk of missing a CQE placed to CQ 12. Namely:
If software 26 implements the race resolution approach,
It is further noted that the present embodiment can be extended to support both types of completion notifications, solicited and unsolicited defined by RDMA and InfiniBand. Solicited refers to a completion notification of a request requiring solicited notification, or request completed in error. Unsolicited refers to a completion notification of any other request. The asynchronous completion notification system 10 described above with reference to
When the software 26 issues a request 34, it not only passes the Request Completion Notification 35 and LastPolledCQENum (step S12), but also specifies a notification type 36, i.e., either solicited or unsolicited. Then, when the processing system 11 detects request completion notification (step S13), processing system 11 checks the type of completion notification at step S14 (i.e., solicited/unsolicited or both). For an unsolicited notification request, the methodology described above with reference to
For a solicited notification request, comparison system 20 examines LastPlacedSolicitedCQENumber 18 instead of LastPlacedCQENumber 16. As noted, LastPlacedSolicitedCQENumber 18 includes the CQE number of the last CQE that was identified as solicited (i.e., solicited CQE). If LastPolledCQENumber 30 is equal to or greater than (e.g., if LastPolledCQENumber was 115) the LastPlacedSolicitedCQENumber 18 (step S16), then no solicited CQE was placed to CQ 12 since the last Poll For Completion, which preceded Request Completion Notification 35. Therefore, the asynchronous completion notification should be performed only when the next solicited CQE is placed to the CQ (step S17).
However, if the LastPolledCQENumber 30 is smaller than the LastPlacedSolicitedCQENumber 18 (step S18), then one or more solicited CQEs were placed to the CQ 12 after the software 26 called Request Completion Notification 35. Therefore, the asynchronous completion notification report system 22 should immediately perform asynchronous completion notification (step S19).
In the example depicted in
It is understood that the systems, functions, mechanisms, methods, engines and modules described herein can be implemented in hardware, software, or a combination of hardware and software. They may be implemented by any type of computer system or other apparatus adapted for carrying out the methods described herein. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, controls the computer system such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks of the invention could be utilized. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods and functions described herein, and which—when loaded in a computer system—is able to carry out these methods and functions. Computer program, software program, program, program product, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
The foregoing description of the preferred embodiments of the invention has been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teachings. Such modifications and variations that are apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims.
Number | Name | Date | Kind |
---|---|---|---|
5671365 | Binford et al. | Sep 1997 | A |
5875343 | Binford et al. | Feb 1999 | A |
6070189 | Bender et al. | May 2000 | A |
6594712 | Pettey et al. | Jul 2003 | B1 |
6601148 | Beukema et al. | Jul 2003 | B2 |
6691217 | Beukema et al. | Feb 2004 | B2 |
6711644 | Accapadi et al. | Mar 2004 | B1 |
6718370 | Coffman et al. | Apr 2004 | B1 |
6901463 | Jay et al. | May 2005 | B2 |
7116673 | Kashyap et al. | Oct 2006 | B2 |
7177941 | Biran et al. | Feb 2007 | B2 |
7200688 | Day et al. | Apr 2007 | B2 |
7224692 | Fan | May 2007 | B2 |
7290051 | Dobric et al. | Oct 2007 | B2 |
7383312 | Biran et al. | Jun 2008 | B2 |
7404190 | Krause et al. | Jul 2008 | B2 |
7457861 | Ebersole et al. | Nov 2008 | B1 |
20020062402 | Regnier et al. | May 2002 | A1 |
20020078265 | Frazier et al. | Jun 2002 | A1 |
20020124117 | Beukema et al. | Sep 2002 | A1 |
20030050990 | Craddock et al. | Mar 2003 | A1 |
20030061296 | Craddock et al. | Mar 2003 | A1 |
20030065856 | Kagan et al. | Apr 2003 | A1 |
20040019882 | Haydt | Jan 2004 | A1 |
20040049580 | Boyd et al. | Mar 2004 | A1 |
20040049601 | Boyd et al. | Mar 2004 | A1 |
20040073622 | McDaniel et al. | Apr 2004 | A1 |
20040085984 | Elzur | May 2004 | A1 |
20040243738 | Day et al. | Dec 2004 | A1 |
20050066333 | Krause et al. | Mar 2005 | A1 |
20050120360 | Makhervaks et al. | Jun 2005 | A1 |
20050132017 | Biran et al. | Jun 2005 | A1 |
20050144310 | Biran et al. | Jun 2005 | A1 |
20050149623 | Biran et al. | Jul 2005 | A1 |
20050223118 | Tucker et al. | Oct 2005 | A1 |
20060129699 | Kagan et al. | Jun 2006 | A1 |
20060184948 | Cox | Aug 2006 | A1 |
20060212563 | Boyd et al. | Sep 2006 | A1 |
20060259570 | Feng et al. | Nov 2006 | A1 |
20080168194 | Gregg et al. | Jul 2008 | A1 |
20080256280 | Ma | Oct 2008 | A1 |
Number | Date | Country | |
---|---|---|---|
20050117430 A1 | Jun 2005 | US |