The present invention pertains to the field of computer architecture and more specifically to the efficient processing of RNIC interface (RI) management control operations (e.g. memory registration) required by RDMA (Remote Direct Memory Access) type work requests issued by an RNIC interface (RI) running on computer systems such as servers.
In complex computer systems, particularly those in large transaction processing environments, a group of servers is often clustered together over a network fabric that is optimized for sharing large blocks of data between the servers in the cluster. In such clustering fabrics, the data is transferred over the fabric directly between buffers resident in the host memories of the communicating servers, rather than being copied and packetized first by the operating system (OS) of the sending server and then being de-packetized and copied to memory by the OS of the receiving server in the cluster. This saves significant computing resources in the transacting servers in the form of OS overhead that may be applied to other tasks. This technique for establishing connections that bypass the traditional protocol stack resident in the OS of transacting servers and instead transacting data directly between specified buffers in the user memory of the transacting servers is sometimes generally referred to as remote data memory access or RDMA.
Different standards have been established defining the manner and the protocols by which direct memory connections between servers are securely established and taken down, as well as the manner in which data is transferred over those connections. For example, Infiniband is a clustering standard that is typically deployed as a fabric that is separate and distinct from fabrics handling other types of transactions between the servers and devices such as user computers or high-performance storage devices. Another such standard is the iWARP standard that was developed by the RDMA Consortium to combine RDMA type transactions with packet transactions using TCP/IP over Ethernet. Copies of the specifications defining the iWARP standard may be obtained at the Consortium's web site at www.rdmaconsortium.org. The iWARP specifications and other documents available from the RDMA Consortium web site are incorporated herein in their entirety by this reference. These and other RDMA standards, while differing significantly in their transaction formats, are typically predicated on a common paradigm called a queue pair (QP). The QP is the primary mechanism for communicating information about where data is located that should be sent or received using one of the standard RDMA network data transfer operations.
A QP is typically made up of a send queue (SQ) and a receive queue (RQ), and can also be associated with at least one completion queue (CQ). QPs are created when an application running on a local server issues a request to an RNIC interface (RI) that a memory transaction be processed that directly accesses host memory in the local server and possibly host memory in a remote server. The QPs are the mechanism by which work request operations associated with the processing of the transaction request made by the application are actually queued up, tracked and processed by the RNIC adapter.
The memory region(s) specified in a direct memory transaction are logically (although not typically physically) contiguous. Thus, the RI also coordinates retrieving a virtual to physical translation for the pages of physical memory actually used by a memory region and programs the RNIC adapter with this information so that the RNIC may directly access the actual physical locations in host memory that make up the memory region as if they were physically contiguous. Access privileges are also retrieved for that memory region and stored in the RNIC with the address translation information. This RI management process is known as memory registration. Most RI management processes, including memory registration, are presumed by the RDMA standards to be a synchronous process such that they will complete before any associated work request is processed by the RNIC on behalf of the application. Thus, a management process such as memory registration blocks the processing of any associated work request by the RNIC until it is complete.
Because memory registration operations (MR OPs) must access many of the same resources in the adapter that are also processing the execution of previously enqueued work requests, because they can be large in number, and because they can be quite time consuming to perform when the virtual to physical translations lead to many physical addresses which all must be transferred to and stored within the RNIC, the completion of memory registration operations may be significantly delayed. This forces the adapter to block further processing of work requests associated with the MR OPs for the entire length of the delay. These factors can significantly increase the overall transaction latency from the perspective of the application, and thus decrease throughput of the fabric in general. This may not be tolerable for many applications.
Therefore, it would be desirable to decrease the latency of RDMA type transactions (and thereby increase network throughput) between servers caused by the blocking of RNIC work requests while they await completion of requisite RI management transactions such as memory registration operations. It would be further desirable to achieve this reduced latency/increased throughput while maintaining compatibility with the specifications of RDMA protocols that require serial completion of memory registration operations prior to performing RDMA memory operations from and to those regions.
Processing of RDMA type network transactions between servers over a network typically requires that the memory regions comprising the source and target buffers for such transactions be pre-registered with their respective RDMA capable adapters through which the direct data placement transactions will be conducted. The memory registration process provides each adapter with a virtual to physical address translation for the pages of physical memory that make up the contiguous virtual memory region being specified in the RDMA operation, as well as the access privilege information associated with the memory region. Specifications for RDMA standard protocols, such as iWARP, require that this memory registration process be complete before the work request generated in response to the RDMA transaction specifying the memory region may be processed.
Embodiments of the present invention are disclosed herein that provide two separate pipelines. One is the traditional transmit and receive transaction pipeline used to process RDMA work requests, and the other is a management/control pipeline that is designated to handle RI control operations such as the memory registration process. Embodiments of the invention employ a separate QP-like structure, called a control QP (CQP), which interfaces with a control processor (CP) to form the pipeline designated to handle all control path traffic associated with the processing of work requests, including memory registration operations (MR OPs), the creation and destruction of QPs used for posting and tracking RDMA transactions requested by applications running on the system.
In processing an RDMA memory transaction request from an application in accordance with embodiments of the invention, an RDMA verb is called that identifies the requisite RI management processes that must be executed to program the adapter (i.e. RNIC) in support of that memory transaction. Among these is typically a memory registration operation (MR OP) that is enqueued in a CQP of the adapter. Once the MR OP has been queued in the control path pipeline of the adapter to register the memory region specified by the memory transaction, a pending bit is set for that memory region and the call to the RDMA verb is returned. The RDMA transaction is posted to the appropriate QP and the RI generates a work request for the adapter specifying access to the memory region being registered by the pending MR OP. This work request is enqueued in the transaction pipeline of the adapter.
The processing of the work request is permitted to proceed as if the processing of the associated MR OP has already been completed. If the work request gets ahead of the MR OP, the pending bit associated with the memory region being registered will notify the adapter's work request transaction pipeline to stall (and possibly reschedule) completion of the work request until the processing of the MR OP for that memory region is complete. When the memory registration process for the memory region is complete, the pending bit for that memory region is reset and the adapter transaction pipeline is permitted to continue processing the work request using the newly registered memory region. Whenever the MR OP completes prior to the adapter transaction pipeline attempting to complete the QP work request, no transaction processing is stalled and the latency inherent in what has been traditionally performed as a serial process is completely hidden from the application requesting the RDMA memory transaction. This serves to lower the overall latency as well as increase the throughput of the network commensurately with the number and size of pending memory registration operations. At the same time, the memory registration process is guaranteed to complete before the work request is completed, thus maintaining compatibility with the RDMA specification.
Common to both clustering implementations of
An application running on host processor (610,
Provided that the adapter resources (e.g. sufficient adapter memory 654,
Thus, the RI is now free to continue processing the RDMA type memory request operation to this memory region x even though the actual registration process may not as of yet begun. The RI is now free to post a work request on the appropriate QP to initiate the processing of the transaction. This also involves a write to the WQE allocate register, which informs the CUWS 954,
Once received over the SQbus and scheduled for execution by the context update and work scheduler (CUWS) 954,
Thus, the memory registration process and the associated work request are able to proceed in parallel and independent of one another. The CP 964 is free to process management control operations (including the MR OPs) posted to the CQP 939 and the transaction pipeline (including the transmit (TX) 966 and receive (Rx) 968 pipelines) proceed with processing the QP work request (500,
Specific examples of the pipelined execution of work requests in parallel with the memory registration operations in accordance with embodiments of the invention are illustrated in
The example of
As previously mentioned, a call to the registration verb is returned after the foregoing steps have been performed, notwithstanding that the MR OP has not yet been processed. As shown in Row 2, this permits the RI running on the local server host processor 610,
As indicated in Row 3 of
Those of skill in the art will appreciate that the example of
In the example of
Once returned from the verb call, the RI is free to post a SEND OP on its QPN that advertises to the remote application running on the remote server that the source of the data will be sourced from memory region x using an STag=x. Those of skill in the art will recognize that the STag (also know as a Steering Tag) is the format defined by the iWARP specification for identifying memory regions. This posted SEND also requests an RDMA write operation. This is indicated in Row 2 of the pipelined sequence. This posted SEND also includes a write to the WQE allocate register to notify the Context Update and Work Scheduler 954,
At some time in the future, the TX pipeline of the local adapter begins to process the SEND OP, but because this SEND OP does not require access to the memory region x, its processing does not need to be suspended notwithstanding that the MR OP has not yet completed. This step is indicated in Row 3 of the sequence. Sometime after, as indicated in Row 4, the remote node receives the SEND OP requesting the RDMA Write operation to the memory region x STag and this is posted on the SQ of the remote node's QP.
At some point in the future, as indicated in Row 5, the local server adapter's RX pipeline receives the RDMA write as a work request from the remote server, but because the memory region x is going to be the sink for this transaction, and because in this scenario the pending bit has yet to be cleared for memory region x because the MR OP has not been completed, the RX pipeline processing of this RDMA write work request is suspended until that happens. Finally, in Row 6, the MR OP has been completed and the pending bit has been cleared through mechanisms previously discussed, and thus the RDMA Write Op work request is resumed and completed to memory region x subsequently in Row 7.
Those of skill in the art will appreciate that it is much more likely that the MR OP will have been completed while the servers are exchanging operations (i.e. Rows 3, 4 and 5) and that the completion of the RDMA transaction will not be held up. Moreover, it should be appreciated that the scenario illustrated in
RDMA Read operations are similar to the RDMA Write operations as shown in
As shown in the block diagram of
The schedules are effectively developed in a work queue manager (WQM) 1025. The WQM 1025 handles scheduling for all transmissions of transactions of all protocol types in the protocol engine 901. One of the main activities of the WQM 1025 is to determine when data needs to be retrieved from the adapter memory 654,
A TCP off-load engine (TOE) 1035 includes sub modules of transmit logic and receive logic to handle processing for accelerated TCP/IP connections. The receive logic parses the TCP/IP headers, checks for errors, validates the segment, processes received data, processes acknowledges, updates RTT estimates and updates congestion windows. The transmit logic builds the TCP/IP headers for outgoing packets, performs ARP table look-ups, and submits the packet to the transaction switch 970,
Typically the host operating system provides the adapter 650 with a set of restrictions defining which user-level software processes are allowed to use which host memory address ranges in work requests posted to the adapter 650. Enforcement of these restrictions is handled by an accelerated memory protection (AMP) module 1028. The AMP module 1028 validates the iWARP STags using the memory region table (MRT) 980,
As previously discussed, when work has been placed on a QP or a CQP, a doorbell is rung to inform the protocol engine 901 that work has been place in those queues that must be performed. Doorbell 1005 is provided to form an interface between the host CPU 610,
The CP 964 has the capability to initialize and destroy QPs and memory window and regions. As previously discussed, while processing RDMA QP transactions, the iWARP module 1030 and other QP transaction pipeline components monitor the registration status of the memory regions as maintained in the MRT in the adapter memory and will stall any QP work requests referencing memory regions for which registration has not yet completed (i.e. for which the pending bit is still set). Stalled QP work requests can be rescheduled in any manner known to those of skill in the art. The rescheduled QP work transactions will be permitted to complete when a check of the pending bit for the referenced memory region of each work request has been cleared.
A second processor is the out-of-order processor (OOP) 1041. The out-of-order processor 1041 is used to handle the problem of TCP/IP packets being received out-of-order and is responsible for determining and tracking the holes and properly placing new segments as they are obtained. A transmit error processor (TEP) 1042 is provided for exception handling and error handling for the TCP/IP and iWARP protocols. The final processor is an MPA reassembly processor 1044. This processor 1044 is responsible for managing the receive window buffer for iWARP and processing packets that have MPA FPDU alignment or ordering issues.
Embodiments of the present invention have been disclosed herein that provide a pipeline for handling management control operations such as memory registration that is independent of the one that handles QP work requests generated for RDMA type memory transactions. In embodiments of the invention, the queue pair paradigm is leveraged to make integration of the control pipeline with the QP work request pipeline more straightforward. The QP work request pipeline monitors the completion of pending memory registration operations for each memory region, and stalls the processing of any QP transactions using memory regions for which registration has not completed. Because most of the control operations will complete before the processing of their associated QP work requests complete, the latency that is typically associated with the control operations such as memory registration is eliminated and throughput of the network is increased. Because the processing of those QP work requests that do win the race may be suspended and rescheduled, the serial nature of the registration process is still maintained per existing RDMA standards, and the mechanism is hidden from the applications running on the servers in a network such as a server cluster.
It will be understood from the foregoing description that modifications and changes may be made in various embodiments of the present invention without departing from its true spirit. The descriptions in this specification are for purposes of illustration only and are not to be construed in a limiting sense. The scope of the present invention is limited only by the language of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5400326 | Smith | Mar 1995 | A |
5434976 | Tan et al. | Jul 1995 | A |
5758075 | Graziano et al. | May 1998 | A |
5832216 | Szczepanek | Nov 1998 | A |
5953511 | Sescila, III et al. | Sep 1999 | A |
6052751 | Runaldue et al. | Apr 2000 | A |
6067300 | Baumert et al. | May 2000 | A |
6145045 | Falik et al. | Nov 2000 | A |
6199137 | Aguilar et al. | Mar 2001 | B1 |
6243787 | Kagan et al. | Jun 2001 | B1 |
6389479 | Boucher et al. | May 2002 | B1 |
6400730 | Latif et al. | Jun 2002 | B1 |
6408347 | Smith et al. | Jun 2002 | B1 |
6418201 | Holland et al. | Jul 2002 | B1 |
6427171 | Craft et al. | Jul 2002 | B1 |
6502156 | Sacker et al. | Dec 2002 | B1 |
6535518 | Hu et al. | Mar 2003 | B1 |
6591310 | Johnson | Jul 2003 | B1 |
6594329 | Susnow | Jul 2003 | B1 |
6594712 | Pettey et al. | Jul 2003 | B1 |
6601126 | Zaidi et al. | Jul 2003 | B1 |
6625157 | Niu et al. | Sep 2003 | B2 |
6658521 | Biran et al. | Dec 2003 | B1 |
6661773 | Pelissier et al. | Dec 2003 | B1 |
6675200 | Cheriton et al. | Jan 2004 | B1 |
6690757 | Bunton et al. | Feb 2004 | B1 |
6693901 | Byers et al. | Feb 2004 | B1 |
6694394 | Bachrach | Feb 2004 | B1 |
6697868 | Craft et al. | Feb 2004 | B2 |
6704831 | Avery | Mar 2004 | B1 |
6751235 | Susnow et al. | Jun 2004 | B1 |
6760307 | Dunning et al. | Jul 2004 | B2 |
6763419 | Hoese et al. | Jul 2004 | B2 |
6778548 | Burton et al. | Aug 2004 | B1 |
7093024 | Craddock et al. | Aug 2006 | B2 |
7149817 | Pettey | Dec 2006 | B2 |
7149819 | Pettey | Dec 2006 | B2 |
7177941 | Biran et al. | Feb 2007 | B2 |
7299266 | Boyd et al. | Nov 2007 | B2 |
7308551 | Arndt et al. | Dec 2007 | B2 |
7376755 | Pandya | May 2008 | B2 |
7376765 | Rangan et al. | May 2008 | B2 |
7376770 | Arndt et al. | May 2008 | B2 |
7383483 | Biran et al. | Jun 2008 | B2 |
7392172 | Rostampour | Jun 2008 | B2 |
7401126 | Pekkala et al. | Jul 2008 | B2 |
7426674 | Anderson et al. | Sep 2008 | B2 |
7451197 | Davis et al. | Nov 2008 | B2 |
7688838 | Aloni et al. | Mar 2010 | B1 |
7782869 | Srinivasa | Aug 2010 | B1 |
7782905 | Keels et al. | Aug 2010 | B2 |
7843906 | Chidambaram et al. | Nov 2010 | B1 |
7849232 | Sharp et al. | Dec 2010 | B2 |
7889762 | Keels et al. | Feb 2011 | B2 |
20010049740 | Karpoff | Dec 2001 | A1 |
20020073257 | Beukema et al. | Jun 2002 | A1 |
20020085562 | Hufferd et al. | Jul 2002 | A1 |
20020147839 | Boucher et al. | Oct 2002 | A1 |
20020161919 | Boucher et al. | Oct 2002 | A1 |
20020172195 | Pekkala et al. | Nov 2002 | A1 |
20030031172 | Grinfeld | Feb 2003 | A1 |
20030050990 | Craddock et al. | Mar 2003 | A1 |
20030097428 | Afkhami et al. | May 2003 | A1 |
20030165160 | Minami et al. | Sep 2003 | A1 |
20030169775 | Fan et al. | Sep 2003 | A1 |
20030200284 | Philbrick et al. | Oct 2003 | A1 |
20030217185 | Thakur et al. | Nov 2003 | A1 |
20030237016 | Johnson et al. | Dec 2003 | A1 |
20040010545 | Pandya | Jan 2004 | A1 |
20040010594 | Boyd et al. | Jan 2004 | A1 |
20040015622 | Avery | Jan 2004 | A1 |
20040030770 | Pandya | Feb 2004 | A1 |
20040037319 | Pandya | Feb 2004 | A1 |
20040049600 | Boyd et al. | Mar 2004 | A1 |
20040049774 | Boyd et al. | Mar 2004 | A1 |
20040062267 | Minami et al. | Apr 2004 | A1 |
20040083984 | White | May 2004 | A1 |
20040085984 | Elzur | May 2004 | A1 |
20040093389 | Mohamed et al. | May 2004 | A1 |
20040093411 | Elzur et al. | May 2004 | A1 |
20040098369 | Elzur | May 2004 | A1 |
20040100924 | Yam | May 2004 | A1 |
20040153578 | Elzur | Aug 2004 | A1 |
20040193908 | Garcia et al. | Sep 2004 | A1 |
20040221276 | Raj | Nov 2004 | A1 |
20050044264 | Grimminger et al. | Feb 2005 | A1 |
20050080982 | Vasilevsky et al. | Apr 2005 | A1 |
20050102682 | Shah et al. | May 2005 | A1 |
20050149623 | Biran et al. | Jul 2005 | A1 |
20050220128 | Tucker et al. | Oct 2005 | A1 |
20050223118 | Tucker et al. | Oct 2005 | A1 |
20050265352 | Biran et al. | Dec 2005 | A1 |
20060039374 | Belz et al. | Feb 2006 | A1 |
20060045098 | Krause | Mar 2006 | A1 |
20060105712 | Glass et al. | May 2006 | A1 |
20060126619 | Teisberg et al. | Jun 2006 | A1 |
20060146814 | Shah et al. | Jul 2006 | A1 |
20060193327 | Arndt et al. | Aug 2006 | A1 |
20060195617 | Arndt et al. | Aug 2006 | A1 |
20060230119 | Hausauer et al. | Oct 2006 | A1 |
20060235977 | Wunderlich et al. | Oct 2006 | A1 |
20060236063 | Hausauer et al. | Oct 2006 | A1 |
20060248047 | Grier et al. | Nov 2006 | A1 |
20060251109 | Muller et al. | Nov 2006 | A1 |
20060259644 | Boyd et al. | Nov 2006 | A1 |
20060274787 | Pong | Dec 2006 | A1 |
20070083638 | Pinkerton et al. | Apr 2007 | A1 |
20070136554 | Biran et al. | Jun 2007 | A1 |
20070150676 | Arimilli et al. | Jun 2007 | A1 |
20070165672 | Keels et al. | Jul 2007 | A1 |
20070168567 | Boyd et al. | Jul 2007 | A1 |
20070168693 | Pittman | Jul 2007 | A1 |
20070198720 | Rucker | Aug 2007 | A1 |
20070208820 | Makhervaks et al. | Sep 2007 | A1 |
20070226386 | Sharp et al. | Sep 2007 | A1 |
20080028401 | Geisinger | Jan 2008 | A1 |
20080043750 | Keels et al. | Feb 2008 | A1 |
20080147822 | Benhase et al. | Jun 2008 | A1 |
20080244577 | Le et al. | Oct 2008 | A1 |
20090254647 | Elzur et al. | Oct 2009 | A1 |
20100332694 | Sharp et al. | Dec 2010 | A1 |
20110099243 | Keels et al. | Apr 2011 | A1 |
Number | Date | Country | |
---|---|---|---|
20070226750 A1 | Sep 2007 | US |