1. Field of the Invention
The invention relates to data transfer between network devices, and more particularly to direct memory access transfers between two network devices.
2. Description of the Related Art
A converged network would support the following four broad classes of traffic:
1) Data networking also known as IP (TCP/IP) networking,
2) Storage,
3) High-performance computing/clustering (HPC) and
4) Management (especially sideband management traffic)
HPC traffic is dependent heavily on the application and the manner in which parallelism has been extracted from the application. On one end is the need for very low latency, small message communication. At the same time, different nodes running a parallelized application may exchange fairly large amounts of data (Oracle RACK, Luster file system, etc.).
The data exchanges may be, at least partially, addressed by RDMA (Remote Direct Memory Access) in which one end of an application can cause a large amount of data to be moved to/from its memory to its remote peer with low CPU utilization.
The high-end of the HPC market is seeing increasing penetration by InfiniBand which offers low-latency, RDMA, and high bandwidth. But the use of InfiniBand is opposite the trend to a converged network based on a single underlying layer 2 transport such as Ethernet.
The iWARP (Internet Wide Area RDMA Protocol) protocol suite provides support for RDMA over a layer 2 Ethernet environment and is layered on top of the TCP protocol. However, it is a fairly heavyweight protocol since TCP is designed to run over lossy networks, with long round-trip delays and significant bookkeeping and the like. Thus iWARP is not a satisfactory solution due to cost and performance issues.
The development of Converged Enhanced Ethernet (CEE) has brought new capabilities to Ethernet in the data center, particularly the development of Fibre Channel over Ethernet (FCoE) to address the storage requirement, instead of iSCSI over TCP. iSCSI has similar drawbacks as iWARP due to the inclusion of TCP. FCoE provides a high performance protocol over a layer 2 Ethernet network, so that a CEE network can readily provide data networking using IP and storage using FCoE but a satisfactory solution is still needed for RDMA support for HPC environments.
iWARP RDMA is based on two elements. First is a suite of layered protocols specified by the IETF that enable one application end-point to safely read and write memory of another end-point, with no application interaction during the data transfer phase. The suite of protocols consist of RDMAP (RDMA Protocol) and Direct Data Placement (DDP) at the application layer. These are layered on top of different transports. The transport layer is either a combination of MPA (Marker PDU Aligned) layered on top of TCP; or SCTP (Stream Control Transmission Protocol) by itself Second is a service access model that provides a virtual network interface to applications. This model is specified in the RDMA Consortium's RNIC (RDMA Network Interface Controller) Verbs specification, and provides a safe, connection oriented, virtual network interface that can be used by applications directly from user space.
As mentioned earlier, there are two alternatives for the transport layer of iWARP. The most common layering is combination of TCP and MPA. The MPA layer provides a “framing” mechanism on top of TCP, i.e., changes TCP from a byte-stream oriented protocol to a “packetized” or “framing” protocol. There is another alternative for the transport layer and that is to use SCTP. However, there seem to be no commercial implementations of SCTP.
Above the transport is the DDP layer which provides a generic mechanism for sending tagged (direct placement) and untagged messages from the sender to the receiver. In addition, the DDP layer specifies memory protection semantics. The RDMAP layer on top of the DDP layer adds the specific capabilities required to support the RNIC Verbs, such as RDMA Read, RDMA Write, Send, Send with Invalidate, etc.
The service interface provided by the RDMAP forms an integral component of iWARP/RDMA. This is sometimes called RNIC API or Verbs. Multiple protocols can use the Verbs to provide richer services such as iSCSI over RDMA (iSER); NFS over RDMA; MPI, which is a very common middleware for clustering; Sockets Direct Protocol (SDP) to provide sockets like API; Oracle Reliable Datagram Service (RDS) for Oracle's RACK cluster database product; etc.
The iWARP protocol layers are a modified TCP implementation, an MPA layer, a DDP layer and an RDMAP layer. The TCP implementation is expected to be modified for iWARP. The modifications are to allow for Upper Layer Protocol Data Units (ULPDUs) to be aligned with TCP segment boundaries. In a traditional TCP implementation, typically the alignment between Upper Layer PDUs and TCP headers is not maintained. The other modification is to allow TCP to pass in an incoming out-of-order Upper Layer PDU to the MPA layer, something a traditional TCP implementation will not do. These modifications together are intended to convert TCP from a byte-stream oriented protocol to a “reliable” data-gram type protocol, without changing the wire-protocol of TCP.
The MPA Layer adds a protocol header and marker. The purpose is two-fold. One is to provide a framing mechanism for iWARP. This is accomplished through the definition of a header for an iWARP frame (called the Framing PDU) and a trailer CRC. The other purpose is to provide for recovery of the Framing PDU (FPDU) header when TCP segments are received out-of-order. This is accomplished by insertion of a periodic marker in the TCP byte stream. The MPA fields include the ULPDU Length field, which indicates the length of the Framing PDUs; a CRC which starts on the next 4-byte boundary after the FPDU; and a number of PAD bytes, which are implicit depending on the length of the FPDU. Markers occur at periodic interval of 512 bytes—but this does not imply that marker location can be determine by a fixed arithmetic on the TCP sequence number. The starting TCP sequence number for markers is different for different connections. In other words, to locate a marker for a TCP connection, the connection context must be located to determine the starting sequence number for the marker.
The DDP layer uses the DDP header. Its purpose is to transmit tagged and untagged messages from the sender to the receiver. Key features are listed here.
1) Message Identification, segmentation and reassembly of Untagged Messages: Untagged Messages are not “Direct Placement” messages. The DDP protocol provides a Message Sequence number and a Queue Number to identify a message within a connection stream. The Sequencing information allows a message to be segmented and reassembled.
2) Tagged Messages for Direct Placement: These messages are for direct placement to a location at the sink defined by the Stag (scatter-gather list handle) and an offset within that list.
3) The untagged messages are “delivered” to the remote consumer whereas the tagged messages are not delivered to the remote consumer. In other words, the tagged messages are consumed within the RNIC.
4) The DDP Layer depends on the underlying transport to provide reliable delivery of messages. Out of order messages can be provided for placement, but delivery is in order.
5) DDP Messages are within the context of a stream or a connection. This feature is provided by the underlying transport layer.
6) The DDP layer requires that the Stag on which the operation is performed be valid for the stream on which the data transfer is taking place. It also requires offset checks.
The RDMA protocol layer further refines the DDP layer to provide the defined RNIC interface.
1) The DDP untagged message with Queue Number 0 supports 4 different variations of the ability to send a message from the source to the destination
2) The DDP untagged message with Queue Number 1 supports sending an RDMA read request message to the remote side (data source). It is assumed that the RNIC on the remote end can properly follow all the rules, and return the read data without informing the consumer.
3) The DDP untagged message with Queue Number 2 supports sending an error message
4) The DDP tagged message can be of two different types
5) RDMA headers are defined only for the RDMA Read Request and Error messages.
6) The RDMA protocol layer further clarifies the protection and ordering semantics associated with RDMA.
The iWARP protocol suite makes a significant distinction between placement and delivery. Placement of an inbound DDP Segment is allowed as long as the segment can be cleanly delineated and meets other rules.
1) Delineation implies that alignment is maintained at the sender and through middle-boxes. Delineation is recovered by confirming markers and CRC in the context of the TCP sequence number
2) Rules means it passes all of the protocol validation rules and all access control rules
But messages can be delivered only in order. This implies that a complete RNIC implementation has to track “meta-data” at the message-level in case of out of order packets. This imposes a significant complexity on a conformant implementation. Some examples are given below, but it is not a complete list.
1) If there is an Send with Invalidate or Send with Solicited Event and an Invalidate message is received out of order, the invalidation component cannot be acted on until the missing segments are received. This is because if a Stag is invalidated, and the missing segment targeted that Stag, it will encounter a “false” protection fault.
2) The same discussion applies to RDMA Read Requests. If an RDMA Read Request is received, processing of the request must not start while the stream has missing segments. The missing segments could have an impact on data returned with the Read Request or the validity of the Read Request itself
3) RDMA Read responses also have an impact on ordering. A user making an RDMA Read Request can specify invalidation of the local Stag after the completion of the RDMA Read Request.
It is important to note from the above discussion that the issues are not simply around implementing a TCP offload engine (TOE). There are complexities in the iWARP specification that complicate its implementation. These are further factors limiting the success and usefulness of iWARP.
Following are some of the features of the RNIC Interface. The RDMA Verbs specification allows for direct user space access to the RNIC for sending and receiving untagged messages, and RDMA Read and Write operations; without going through the operating system kernel. In other words, user space applications have access to a Virtual Interface.
1) Protection and Isolation of host memory as well as RNIC resources is critical to the concept of user space access
To achieve protection and isolation along with direct user space access, the life-cycle of resources such as Queues, Protection Domains, and Memory Handles/Stags is managed by a kernel mode driver.
All communication happens on a bidirectional stream. The simplest mapping of that is to a TCP connection, or any other mechanism offered by the underlying transport. The implication of this is that, whether in software or in hardware, there is the concept of a “stream context” or a “connection context”.
A queue pair is tied one-to-one with a stream.
1) The Send Queue of the pair is used for sending commands to the RNIC. These commands include Send commands that transfer data from the local system to the remote end point of the connection, RDMA commands that can be either RDMA Read or RDMA Write commands, commands that invalidate Stags, and commands that associate a Memory Window with a memory region.
2) The Receive Queue is used to provide Receive Buffers for inbound data (Send data from the remote node perspective). The inbound Receive Buffers are expected to be consumed in order, i.e., the buffers are expected to be consumed in a one-to-one relationship with the Message Sequence Number in the untagged DDP messages.
There is very flexible mapping between Completion Queues and Send and Receive Queues from different connections.
Instead of each connection having its own receive queues, shared receive queues can be shared among multiple connections. Shared Receive Queues do not have the same ordering constraints as Receive Queues. A RNIC is allowed to limit the number of Completion Queues and Shared Receive Queues it supports.
All Queue Pairs/Connections and Stags/Memory Regions are associated with Protection Domains. Operations are allowed on an Stag/Memory Region only when the Protection Domains of the memory region and the queue pair are identical. This is however not a sufficient condition for the operation to be allowed. Explicit permissions for the operation being performed must match as well.
Fundamental to achieving protection and isolation, while at the same time allowing a remote end-point to read and write memory on the system, is the concept of Stags—which are “RDMA Memory Handles”. The concept is similar to how the virtual to physical memory mapping on a system provides protection and isolation among user space applications on a system.
1) The remote end-point in an RDMA connection is provided Stags for the local end-point for RDMA operations. On the local end-point, the Stags map to a set of physical addresses (a list of pages, a start offset, and a total length).
2) The Stags are associated with specific protection domains, and have specific access rights enabled (local read, local write, remote read, remote write).
3) In general, the creation of Stags, their association with system physical memory, and access rights is performed through the privileged module in kernel space.
4) The RNIC is required to enforce all the access rights when reading from or writing into the memory associated with an Stag (in addition to performing bounds checks, state validity etc.). This ensures that a non-privileged application cannot read from or write into local or remote memory to which it has not been granted appropriate access.
5) A special type of memory region is called a shared memory region. This is a memory region with a Stag like other types of memory regions. However, there are other Stags that refer to the memory associated with this Stag as well. Therefore the same “memory list” or “memory registration” can be used with different Stags, each with their own Protection Domain or access rights.
6) Memory Windows are a special case of Stags. They expose a Window into a memory region. The key here is that association of a memory window and controlling access rights (subset) is performed by an operation on the Send Queue and not through the kernel module. This operation can be performed by a non-privileged application. A non-privileged application would use the kernel interface to create a memory region with the associated memory. It would also use the kernel interface to allocate a memory window Stag. The memory window stag can then be associated with a window into the previously registered memory region through the Send Queue in what is considered a light-weight operation.
In a traditional I/O model, the driver provides a scatter-gather list to the NIC/HBA (Host Bus Adaptor). Each item/element in the scatter-gather list consists of a host memory physical address and a length. The NIC reads from or writes into the memory described by the scatter-gather list.
1) However in the RDMA model, with direct user space access, each scatter-gather list element uses an Stag and an offset within the Stag, instead of the host memory physical address. This allows the RNIC to ensure that the consumer has permissions to access the memory that it wishes to through a combination of protection domain with which the consumer and Stag are associated, and the indirection from the Stag/offset to the host physical memory.
2) Stag Value of 0: Privileged users are allowed to use an Stag value of 0 in scatter-gather lists only; which means that the offset is the actual host physical address.
There are two operations on the Send Queue that are allowed or disallowed depending on the privilege level of the consumer. One is the use of Stag value of 0 in the scatter-gather lists. The other is the ability to associate an already allocated memory region (not window) Stag with a set of host pages. The two privileges are controlled independently, however it is likely that they will be allowed/disallowed to the same consumers. Having this privilege means that the application can perform memory management operations without having to go through the privileged kernel module.
Fencing an operation implies that the operation will not start until all prior operations that may impact this operation have completed. There are two types of fencing attributes defined, read fence and local fence.
In addition to the ordering requirements imposed by the protocol, additional ordering requirements are imposed by the interface. Work Items must be completed in the order they are posted on the Send Queue and the Receive Queue.
The following are some of the challenges of implementing TCP and MPA in an adapter.
1) TCP is byte-stream oriented whereas DDP requires the transport layer to preserve segment boundaries. This requires the addition of the MPA layer. In addition, it requires buffering for, and detection/handling of, partial PDUs
2) One of the mechanisms used by MPA for frame recovery is markers. Markers are inserted relative to the initial sequence number of the TCP connection, i.e., to detect the location of a marker in a TCP segment, the TCP context is required.
3) MPA marker insertion and deletion itself adds additional requirements. One example is when retransmitting segments when the path MTU has changed.
4) TCP context lookup requires specialized hardware support. This is because the TCP context is identified by a sparsely populated, long 4-tuple<source IP address, destination IP address, source port, destination port) whose size on ipv6 is practically unmanageable. There are two classic ways of doing the lookup. One is the use of Context Addressable Memory (CAM), which is very expensive in terms of silicon size and does not scale easily to the number of connections (64K or more) required for some applications. The other mechanism is to use hash lists, which do not have a deterministic search time.
5) TCP state machine processing creates an inter-dependence between the receive path and the send path. This is due to the “ACK clocking” and window updates of the TCP protocol.
6) TCP processing requires double traversal of the send queue. The first traversal is to send the data, and the second traversal is to be able to signal completion to the host. In theory, whenever a new sequence number is acknowledged by the remote peer, the second queue must be traversed to determine the point to which completions for work requests can be signaled to the host driver.
7) The TCP state machine includes many timers that have to be tracked such as delayed ACK timer, retransmission timer, persist timer, keep-alive timer and 2MSL timer. Even if the TCP context which is fairly large is kept elsewhere, the “data-path” timers for every connection must be accessed relatively frequently by the device. This adds cost in terms of on-chip memory, or attached SRAM, or consumes interconnect bus bandwidth.
8) The standard allows for multiple frames in a single TCP segment. In other words, the upper layer state machine may have to be applied multiple times to the same TCP segment.
9) Out-of-order placement and in-order delivery requires the device to track holes in the TCP byte stream, and track actions that have been deferred and must be performed when the holes are filled. This adds significant complexity in the processing stream, and additional storage requirements.
As mentioned earlier, all of these complexities have been overcome, but add cost and cause performance degradation. Thus a solution to alleviate these complexities while increasing performance and decreasing cost is designable to provide RDMA over Ethernet and allow a fully converged layer 2 network.
The solution should provide reliable delivery, preserve DDP Segment boundaries, provide a strong digest (stronger than TCP checksum) to protect the DDP Segment, indicate path MTU to the DDP layer, be able to carry the length of the DDP Segment, be stream oriented so that a DDP stream maps to a single TCP connection when running over TCP and either have in-order delivery, or have the delivery order exposed to DDP.
Embodiments according to the present invention utilize a new transport protocol between the IP layer and the DDP layer. Further, the embodiments all operate on a CEE-compliant layer 2 Ethernet network to allow the new transport protocol to be simplified, providing higher performance and simpler implementation. Thus the use of the new transport protocol allows a CEE-compliant layer 2 Ethernet network to provide data networking using IP, storage using FCoE and RDMA using IP and the new transport protocol, without suffering the previous performance penalties in any of these aspects.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of apparatus and methods consistent with the present invention and, together with the detailed description, serve to explain advantages and principles consistent with the invention. In the drawings,
The CPU 200, CPU 208 and the network interface 214 interact to form an RDMA stack according to the preferred embodiment. Certain layers may be handled in the network interface 214, such as the layer 2 operations, while higher layers are handled by the CPUs 200, 208 using their respective software.
These are simplified illustrations of host and storage and other configurations can readily be developed and are known to those skilled on the art.
Shown on the right in
With that explanation, more detail of the DRCP layer is provided here. From a service user perspective, there is no difference between for an application when running over DRCP (over CEE) or over TCP with MPA. However the transport is greatly simplified as discussed below.
The preferred DRCP layer provides several advantages. The Protocol header is optimized for use with the DRCP layer. The context size is optimized for the DRCP layer and is significantly smaller than the TCP context. There will be no confusion with prior art usage which can cause issues due to the direct data placement feature. Typically for TCP or UDP, there is always the option of using unassigned port numbers for an application. When a packet is received, it is not entirely clear whether the packet is a candidate for direct data placement or not. This is resolved with the DRCP layer. There are no implementation issues in cases such as starting in streaming mode and switching to MPA mode as is true for iWARP over TCP
In certain embodiments, RDMA Read Requests and/or RDMA Read Responses participate in the same sequence numbering as the other types of packets. Practically speaking, RDMA Read Responses do not require sequence numbers if somehow they can be associated with the corresponding request (an additional requirement is that packets of a response are transmitted with increasing offset). In certain embodiments the RDMA Read Requests are numbered like any other packets, but the response packets carry the sequence number of the RDMA Read Request packet; and are also identified through a special flag in the header. A missing RDMA read response packet can be easily detected in this case through a combination of response time-out and incorrect offset value.
These embodiments shift some of the implementation burden from the transmitter to the receiver. If the implementation of RDMA Read response is done inside the RNIC instead of the driver; the RNIC does not have to change the sequence number order from the order known to the driver.
As known to those skilled in the art CEE, formed by the combination of the Priority-Based Flow Control, IEEE P802.1Qbb; Enhanced Transmission Selection, IEEE P802.1Qaz; and Congestion Notification, IEEE P802.1Qau projects of IEEE, provides a near-lossless layer 2 transport that also provides congestion management. This makes it possible to provide a reliable transport with a relatively simple protocol.
In the background some of the complications with the existing protocol were described. Table 1 below describes how those complications are mitigated with DRCP.
DRCP is connection oriented since DDP as defined requires a connection oriented protocol. The connection state is relatively small and is described fully later.
The source and destination IP addresses and the source and destination port numbers identify the two end-points of a connection uniquely.
DRCP supports a simplified connection establishment model as compared to regular TCP. One end of the connection, as defined by the protocol running on top of the transport, must be an active opener of the connection, and the other end must be the listener or a passive opener. Typically the listener will wait for connection requests on the well-defined port number for an upper layer protocol. The well-defined port number is preferably assigned by the appropriate standards organization. And the active opener will use as its source port number a unused dynamic port number that is not part of the well-defined port numbers for any ULPs. Therefore the problem of simultaneous open is avoided.
The Listener, upon receiving the packet, may decide to accept, reject or ignore the packet. If it decides to accept the packet, it sets up the fields as follows (there must not be any payload). The source and destination ports are reversed from the packet sent by the active opener. The Sequence Number is also called the Initial Sequence Number (ISN). The Sequence number should be selected in a manner that minimizes confusion with delayed packets in the network. This is called the ISN for listener end of the connection. Acknowledgement Number is set to one greater than the Sequence Number in the packet received from the active opener. The SYN and ACK flags are set. The Message Window is a value picked by the listener. It should typically be a large value. Receive Handle (64-bits) is the receiver handle of the listener and is selected by local means.
The Active Opener, upon receiving the packet, will send an acknowledgement to the Listener. Payload may accompany the packet. The source and destination ports are the same as sent by the Active Opener in the first packet it sent to the listener. The Sequence Number depends on whether a payload is present or not. Sequence Number is incremented by one from ISN if a payload is present. Otherwise it is the same as the ISN. Acknowledgement Number is set to one greater than the Sequence Number in the packet received from the active opener. The SYN and ACK flags are set. The Message Window is a value picked by local means. Receive Handle (64-bits) is the receiver handle of the active opener as received in the packet from the listener. All subsequent packets from the active opener to the listener carry the receive handle sent by the listener for this connection.
At this point the connection is open and data can be transferred.
Reliable data transfer requires the ability to detect missing data, the ability to acknowledge received data and the ability to detect missing packets and retransmit data. CEE satisfies the first requirement of reliable transfer of data by using a unique sequence number for every packet containing data or packets in which the SYN or FIN flag is set. It is noted that numbering pure ACK or NACK packets increases the size of metadata required for retransmission because the sender not only has to keep track of sequence numbers used for data packets, it also has to keep track of sequence numbers for pure ACK/NACK packets. For this reason, the preferred embodiment numbers only packets containing data or the SYN/FIN flag.
The following is an overview of the packets that may be sent from one end to the other during the established phase of the connection. See the description on termination for a usage of the FIN and RST flags. Also see the description on handling loss of packets for usage of NACK flag.
One end of the connection may send packets containing payload to the other end. These packets may be formatted as follows. The source and destination ports usage should be as previously discussed. Since a payload is present in the packet, the sequence number is one greater than the last packet containing payload that was sent (see discussion of retransmission later). If the ACK flag is set, the Acknowledgement Number is valid. See rules for Acknowledgement and setting the ACK Sequence Number below. The Message Window is a value picked by local means. The payload consists of DDP and RDMAP protocol headers, and if required, a DDP/RDMAP payload. An entire DDP message should either consist of a single packet or which is the last packet or one or more non-last packets followed by a last packet. All packets but the last packet must be equal sized. The size (including transport layer, DDP/RDMAP headers, and payload, and trailing CRC) of all packets but the last packet must be a multiple of 4 bytes, equal to or less than the path MTU selected by the sender. The last packet may require padding for the CRC, done using the pad bytes in the header. The ability to send payload packets is governed by the Window Size. The message sequence number must lie in the Window advertised by the sender.
The receiver must acknowledge receipt of data by sending acknowledgements to the sender. The acknowledgement may be piggy-backed onto a packet containing payload, or it may be sent on a pure ACK packet (that does not contain any ULP payload). The following are some points for sending acknowledgements. An acknowledgement is sent when the ACK flag is set. The Acknowledgement Number is set to one greater than the sequence number being acknowledged. All packets prior to the sequence number being acknowledged are implicitly acked. The ACK flag will be set in most packets. The value of the Acknowledgement Number may or may not change from packet to packet. The Acknowledgement Number is always increasing (using modulo 2**64 arithmetic). The Acknowledgement is at message boundaries. In other words, the acknowledgement number will always be one greater than the sequence number corresponding to the end of a message. The receiver may send the acknowledgement as part of a data packet, or it may send acknowledgement in a packet that does not contain data. The former is called a piggy-backed ACK and the latter is called a pure ACK. The receiver may coalesce ACKs, i.e., not send an ACK for every message received. The receiver can use a timer of 200 milliseconds or other configurable value to send an acknowledgement. The receiver also should be aware of the outstanding Message Window it has advertised to determine when to send an acknowledgement. The sender may request that acknowledgement be expedited by setting the “ACK Requested” flag. The flag may be set in any of the packets but is meaningful only in the last packet of a message. The receiver should try to honor this request and send an acknowledgement. Excessive use of the flag is discouraged. Its primary purpose is to trigger an ACK under the following conditions. When the sender is sending large messages to this receiver and would like to free memory relatively quickly or when the sender has sent a large number of messages to this receiver.
It is expected that DRCP will run over CEE with pause enabled for the priority used to transmit this protocol. As such, it is expected that packet loss is an extremely infrequent events, with the following two being the primary reason for lost/dropped/out-of-order packets: A link/route flap due to failure or additions of equipment and rare cases of link transmission error.
DRCP must handle these conditions reasonably well. It must not require a connection reset in these conditions. It must be able to recover the missing data packets.
Since packet loss or out of order is considered an infrequent event, it is considered acceptable to retransmit all of the data from the point where missing or out of order segments are detected.
The following sub-sections analyze some of the common scenarios. The first scenario is focused on the case when the packets that are misplaced in a sequence. This is the most likely scenario as well. The case where multiple non-contiguous packets are lost should be uncommon and is discussed later. Let us consider the following scenario. Message M is in transmission. One or more packets for message M are lost or out-of-order. The trailing packets of message M are received. The receiver will detect that some packets are out of order. The receiver will send a NACK to the sender. The NACK Acknowledgement Sequence Number contains the sequence number of the next packet that was expected. If the missing packets are lost, the recovery is fairly straightforward. The peer will retransmit all of the byte stream from the point of the first detected missing segment (see later driver action). Everything works as normal from the point of retransmission. If the missing segments were out of order, things get more complicated. Essentially, all of the packets will be seen twice, but the problem is solved by discarding duplicate packets. Let us assume that the proper stream is made up of 4 chunks—A, B, C and D each following the other in order. In other words the Starting sequence number of B is the ending sequence number of A and so on. Let us say the stream got reordered as A, C, B, D. The receiver detects the A to C jump and sends a NACK to the sender. The sender retransmits B and beyond that. The original and retransmitted stream are A, C, B, D, B, C, D, E, F. The receiver detects the B to D jump and sends a NACK to the sender. The host requests the peer to retransmit from C onwards. The original and retransmitted streams are A, C, B, D, B, C, D, E, F, C, D, E, F, G. This situation must be avoided by one of two ways. The receiver avoids sending a NACK for some time after sending one. The receiver sends a NACK but the sender avoids retransmission with the knowledge that it has sent D twice. A double loss will be recovered due to the sender's timer-based algorithm.
The scenario of the loss of trailing data packets is detected by the transmitter, as it does not receive ACKs from the peer. The standard TCP method of retransmitting the first un-acknowledged packet does not work. This is because the protocol expects acknowledgement on message boundaries, and duplicate packets will be silently dropped by the device.
The preferred option is to transmit a pure-ACK packet that does not advance either the ACK sequence number or the window. If these values do have to be advanced, the sender can send two pure-ACK packets; one that advances the ACK sequence number and/or the window, and the second one that is an exact replica of the first.
The receiver on noticing a pure-ACK packet that does not advance either the ACK sequence number or the window, will generate a pure-ACK that acknowledges the last complete message received. This enables the sender to determine the message that was lost, and it can retransmit from that point onwards. It does waste an additional one RTT, but this is likely to be small compared to the total detection time.
Consider the case of a pure-ACK packet that is received out of order or is lost. A pure-ACK packet carries a sequence number that is the same as that of the last packet with payload. If the receiver detects that the sequence number has jumped unexpected, the behavior should be the same as in the case of detection of out of order packets described previously. The receiver should be aware that it can receive two pure-ACK packets that get out of order with respect to each other. In other words, the receiver will see that a pure-ACK packet acknowledges a sequence number higher than that acknowledged by a pure-ACK packet that follows immediately after. The latter packet should be ignored. The receiver may see a pure ACK packet with a sequence number lower than that of the preceding data packet (accounting for modulo 2**64-1). In such a case, the receiver can either discard the pure ACK packet or go into a mode similar to that of out of order packets. Either behavior should provide proper recovery. A pure-ACK packet may be lost in the middle somewhere. Typically the subsequent data packet will provide the acknowledgement sequence number and window update as required. No recovery action will be needed. A trailing pure-ACK packet may be lost. The recovery in this case will be similar to the previously described “trailing packet lost” scenario.
DRCP utilizes concepts of Message Windows and Window Updates. This is somewhat similar to the concept in TCP, but different. The size of the Window is not in terms of bytes or packets, but in terms of messages. The reason for this is that a host resource is consumed when a message is sent. Each untagged message consumes a single work item from the receive queue. Each tagged message requires an acknowledgement, and thereby may consume a resource at the receiver.
The size of the Message Window is fairly large and the advertised window should be large enough such that the RNIC does not have to create an interlock between the send path and the receive path. If the advertised window is large, the host driver can queue a significant number of requests on the Send Queue to the extent allowed by the Window. As the Window is extended, the driver can observe the value from completion and queue any additional items on the Send Queue.
Finally, if an application protocol is designed in a manner such that flow control at the DRCP level is not required, an extremely large windows (16 million messages) can be advertised at all times.
Window updates will typically piggyback packets with payload or pure-ACK packets that advance the acknowledgement sequence number. However, there may be a need to send window updates on packets that are pure-ACK packets that do not otherwise update the acknowledgement sequence number. This may happen when there is a resource constraint at the receiver, and it does not wish to receive additional messages.
Only one scenario has to be considered as an exception. That is of a trailing window update in a pure-ACK packet that does not increment the acknowledgement sequence number. Such a scenario should be treated like the “trailing packet lost” scenario discussed earlier.
Connection termination is modeled after TCP. There are two ways to close a connection. Typically the connection should be shut down gracefully. This is accomplished by using a segment with the FIN flag set. A packet with the FIN flag set must not carry any payload information or have the SYN or the RST flag set. It is assigned a unique packet sequence number unlike pure-ACK packets, i.e., it carries a sequence number one larger than the last packet carrying payload. If one end sends a FIN to the other, it can only send pure-ACK packets after that unless the packets are required for retransmission. A half-close condition of the connection is supported. Simultaneous close is supported. A connection is closed when both ends have sent a FIN to each other and sent acknowledgements for the FIN packets. Acknowledgement for the FIN packet may be piggybacked on a non-FIN or a FIN packet. If a connection cannot be closed gracefully due to an error condition, one end can send a packet with the RST flag set. This must be done on a packet that does not contain a payload. An acknowledgement is not expected from the other end.
The following timers are used for DRCP:
1) Delayed ACK timer, probably with the same value as is used in TCP of 200 msecs;
2) Retransmission/persist timer with a fixed value of 1 sec and the Send Queue being drained;
3) 2MSL for TIME WAIT state is the timer used in TCP by the end that sent the last FIN, i.e., did not receive an acknowledgement for the FIN, before reusing the 4-tuple for the connection. A timer similar to this is used but the value should be more realistic than that in TCP. It should be a multiple of the maximum lifetime of a packet in a CEE network, probably a minute should be sufficient. When this timer is running, no other timers are running for this connection; and
4) Optional keep-alive timer. Locally determined value greater than 60 minutes is preferred. Used when the keep-alive timer is running, none of the other timers is running for this connection.
The following is a re-iteration of some of the requirements from layer 2. DRCP is defined for use over CEE links only. The entire path from one end-point to the other end-point should be a CEE link. The following requirements are also end to end. The priority/priorities assigned to DRCP must have Priority Flow Control enabled. ETS must be enabled. Congestion Management must be enabled. The two end-points of the connection should support outbound scheduling at the granularity of a connection specific Send Queue. It is desirable to restrict CEE to a single subnet, at least in its initial deployments. There is one way to restrict it by protocol and that is to use a specific ether-type, though that is optional. By practice, it can be done by setting the TTL bit in the IP header (if used) to 1 and by ensuring that the source and destination are on the same subnet.
Therefore a new transport layer for use with RDMA operations is provided. The new layer, termed DRCP, overcomes many of the difficulties of the use of TCP and MDA as in iWARP. This occurs in part because the DRCP layer is intended for use in a CEE-compliant Ethernet network. DRCP then handles the transport layer functions while also smoothly and simply interfacing between the requirements of RDMAP and DDP and IP. The DRCP layer is defined to allow simple and high performance operation so that a converged Ethernet environment becomes feasible and practical.
While certain exemplary embodiments have been described in details and shown in the accompanying drawings, it is to be understood that such embodiments are merely illustrative of and not devised without departing from the basic scope thereof, which is determined by the claims that follow.
This application claims the benefit under 35 U.S.C. §119(e) of U.S. Provisional Patent Application Ser. No. 61/146,218, entitled “Protocols and Partitioning for RDMA over CEE” by Steven Wilson, Scott Kipp and Somesh Gupta, filed Jan. 21, 2009, which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61146218 | Jan 2009 | US |