The present invention relates to supporting replication within a protocol offload network controller. Replication is supported in the send direction by sending the same copy of the data using multiple offloaded connections to possibly multiple destinations through possibly multiple ports. Replication is supported in the receive direction by forwarding received data to multiple destinations.
A Network Interface Controller (NIC)—which may be, for example, network interface circuitry, such as within a system on a chip (SoC)—is typically used to couple one or more processors to a packet network through at least one interface, called a port. NIC circuitry has been an area of rapid development as advanced packet processing functionality and protocol offload has become common for so called “smart NICs.”
In accordance with an aspect of the invention, network interface circuitry, forming a local node, is configured to couple a host computer and a network. The network interface circuitry comprises at least one processor configured to offload from the host computer at least a portion of communication protocol processing to accomplish at least two stateful communication connections between the host computer and a peer via the network, further the processor has the capability to replicate data, to send the same data over multiple connections in the send direction, and separately to send the same data over multiple connections in the receive direction.
The uses for replication include data replication in Distributed File Systems (DFS) such as Hadoop® FS, and the Microsoft Azure™ cloud. These distributed file systems typically replicate data on write to implement reliability, e.g. when data is written in the Hadoop® FS, the default is to replicate the data three times, typically over a network connection, and write the data to three different locations to minimize the chance of catastrophic failure resulting from the malfunction of a storage device. The current state of the art writes the data three times from the storage controller host application as if it were three different writes. This invention describes how the sending of the replicated data can be offloaded to the protocol offload device and thereby making the replication more efficient, requiring less CPU cycles, less host memory bandwidth, in some use cases less PCIe bandwidth, and in some use cases less Ethernet bandwidth.
Another application is to implement a reliable multi-cast service on top of a reliable transport layer service such as TCP, on top of an unreliable data link layer service such as Ethernet. The TCP/IP protocol is point-to-point but, with the replication capability, a reliable multi-cast can be implemented to deliver reliable messaging services to multiple subscribers.
The inventor has realized that data replication can be supported efficiently in a protocol offload device (such as a protocol offload device to offload transport layer protocol processing from a host) by supporting a shared memory (SHM) abstraction for the send and receive buffers that are used in protocol offload devices. The protocol offload send and receive buffers are accessed using a per offloaded connection virtual address method that maps transport protocol sequence numbers (such as TCP protocol sequence numbers), to memory locations within buffers, and, for example, either page tables and paged memory or segment tables and segmented memory are used to access the memory. The mapping process from sequence numbers to memory addresses may be similar to a process used in conventional computer systems to map a virtual address to an address in memory. The SHM abstraction in the protocol offload device may be implemented by adding a reference count to each allocated page or segment similar to the way it would be implemented in a conventional computer system. The reference count, in the protocol offload device, is initialized to the number of sharers (transport layer connections collectively referred to as a replication group of connections) when the page or segment is allocated from a free list pool and is decremented by each of the sharers when it has finished using the page. The page is returned to a free list pool when all the sharers have finished using the page. A sharer is using a page until all the data has been successfully sent for that sharer, and this is typically detected by observing the value of a connection sequence number. Once the sequence number has progressed beyond the end of a page or segment, the sharer is no longer requiring the page or segment and the page can be freed. When all the sharers in a replication group have progressed beyond a page the reference count has been decremented back to zero and a page can be returned to what is conventionally referred to as a free list pool. With the SHM abstraction a protocol offload device has thus accomplished sending the same data multiple times, has replicated the data from a single copy of the data provided to the network interface circuitry from the host. A protocol offload device can support multiple offloaded connections simultaneously and the invention enables supporting multiple replication groups simultaneously. A DFS could for example create a replication group for each write, or it could use multiple replication groups of connections and then load balance the different groups based on the progress within each group.
Referring to
The first example of replication shown in the Figure has node0 replicate the send data to node2, node3, and node 5, and the replication is accomplished in the protocol offload device within node0 as is described in more detail below. In the case where the switches 221 and 222, within cluster 220, are PCIe switches the protocol offload device within 227 optionally replicates the data in the receive direction, processes the TCP/IP protocol in the receive direction and then sends multiple copies on PCIe to different nodes within a PCIe fabric. In the case where 221 and 222 are Ethernet switches, the protocol offload device within 227 replicates the data in the send direction. For example, the storage controller software within node2 227 determines that it has received a write request and opens (or has already opened) three connections to node 226, node 225, and node 223, respectively, and writes the data to these three offloaded connections from the same SHM send buffer.
There can be failures in equipment during the replication process for example node5 250 might become unreachable during the replication process, and it is important that the replication process complete successfully in spite of failures if possible. When node5 becomes unreachable the TCP protocol retry mechanism will periodically try to resend the data that has not been acknowledged but will eventually give up after a settable number of maximum attempts have been made. The connections from node0 to node2 and the connection from node0 to node3 will independently send their data and as long as node2 and node3 are working correctly the sent data will be acknowledged, and the other two connections are not affected by the connection sending from node0 to node5 falling behind because its sent data is not being acknowledged. Once the sending connection from node0 to node5 gives up attempting to send the connection will be aborted and as part of the abort process the offload device will free any memory resources held by the aborting connection, the page reference count will be decremented for each of the pages allocated to the failing connection. When the failing connection is the last connection holding on to a page the page will be returned to the free list pool. The effect of the failure of the connection from node0 to node5 is therefore not detected by the other connections in the replication group; the other connections are not affected by a failure in one of the other connections in the group. In this failure case the software on node0 will be notified of the failure of one of the replication group connections and will need to react accordingly, in this case by creating a copy of the data that was not successfully created as part of the replication process and storing that instead on a different node.
The protocol processing pipeline employs CPL messages transmitted between the protocol processing pipeline circuitry and the host and a control plane processor (typically, network interface circuitry driver software operating on the host and there is typically a control plane processor in the same interface circuitry as the protocol processing pipeline): it receives and it sends messages. CPL messages are described, for example, in U.S. Pat. No. 7,945,705 entitled “METHOD FOR USING A PROTOCOL LANGUAGE TO AVOID SEPARATE CHANNELS FOR CONTROL MESSAGES INVOLVING ENCAPSULATED PAYLOAD DATA MESSAGES.” For this application, the CPL_TX_DATA_PACK message is added which is shown in
Refer now to
For the send direction from ACE or PCIe, a multi-header CPL_TX_DATA_PACK arrives in 104a and the arbiter 102 grants all the CPL_TX_DATA headers through the pipeline without allowing any other source for arbitration such as Ethernet packets or internal events such as timer events or tx or rx modulation events to win the grant. The CPL_TX_DATA headers each contain a connection identifier tid, and each in succession look up their 4-tuple information in DB 110 and when they get to the connection manager 112 a new page or pages are allocated if required. The CPL_TX_DATA_PACK message is shown in
The send buffer is accessed through a virtual address mechanism where the high order bits of the TCP sequence numbers are mapped to memory addresses by using page tables and the offset within those pages are the low order bits of the sequence numbers. The location of the data is flexible and it can reside in an SRAM, a DRAM, or it can reside in a system memory. The access to this memory can be through a second level of virtual memory access, e.g. the system memory is typically accessed through a second table that in effect maps from the protocol offload address space to virtual or physical system memory addresses. The SHM mechanism is not dependent on the physical location of the send buffer or the existence of these secondary mappings; the SHM mechanism operates in a virtual address space with the location and multiple levels of mapping hidden.
The processing of the multi-header CPL_TX_DATA_PACK is separate from the processing of the offloaded connection, i.e., each of the offloaded connections operates without knowledge of the SHM abstraction that is being employed. The offloaded connections that use the SHM abstraction support all the features supported for connections that do not use the SHM abstraction, that are participating in a replication group. For example, the offloading of the iWARP RDMA and iSCSI protocols and encryption/decryption and authentication protocols is still supported. The protocol processing part is not dependent on the pages being shared, and each offloaded connection progresses at its own pace and releases the allocated pages, decrementing the reference count when the particular connection has finished using a page. When all the connections have finished using a page, the reference count goes down to zero and, according to the SHM abstraction, the page is released back to the free-list pool. The same applies to a close connection Ethernet TCP/IP packet that carries a FIN or a RST message or any other of the TCP flags. They are processed independent of the sharing, and the same applies to CPL close and abort messages, that are initiated from the PCIe/ACE side. The SHM abstraction provides the illusion, to the processing of each individual connection, of dedicated memory pages, and it is left to the free-list manager to manage the abstraction. The independent processing of the CPL abort and TCP RST messages for the different connections can be important when one or more of the replication connections fails due to hardware or software failure somewhere in the path from the sender to one of the receivers. Such failures can include, for example, power outage and end equipment malfunctions of various kinds. The failing connection in this case periodically attempts to retransmit data, according to the rules specified as part of the TCP protocol specification, until the sender finally gives up after some number of re-transmit attempts. The control plane receives a CPL abort message from the connection and it then sends a message to a higher layer software entity that is managing the replication process. This removes the failed connection from the replication group and the higher layer software entity reacts to the connection failure according to its own rules and/or procedure such as, for example, designating a different connection to replace the failed connection in the replication group. As part of a connection abort any resources held by an offloaded connection are freed, the memory page reference count is decremented for any pages allocated to the aborting connection and the memory pages are returned to the memory page free list if the failing connection is the last connection to hold onto the pages.
Refer now to
For received packets that arrive on 104b to the arbiter, the 4-tuple is looked up to determine the tid of the offloaded connection that in turn is used to fetch the connection state from the CB 114. When the connection manager determines that the packet is to be accepted, it allocates a REG entry 108 to store the state connection state update and issues an rx modulation event to each of a list of connections, each identified by a tid, stored as part of the state of the connection. These rx modulation events have a payload that includes each tid and a REG index value where the state update is stored. When the register REG is allocated it has a reference count that indicates the number of sharers. Each of the rx-modulation events are injected into the pipeline in 104d, and use their tid to look up the connection state, and when they reach the connection manager 112 they use the register state to update the connection state and decrement the reference count of the register entry. When the reference count reaches 0, the register entry is freed and is available to be used for another group of receive sharers. In the case when a REG is not available when one needs to be allocated, the Ethernet TCP/IP packet is dropped, and in that case it will eventually be resent by the sender.
It is possible to receive multiple packets that each allocates a REG entry, and that there are multiple rx-modulation events issued, to each of the sharers. The rx-modulation events are FIFO ordered and multiple rx-modulation events can therefore be outstanding to the same tid and are processed in order. It is also possible to update the same REG entry and to keep track of which sharers have processed earlier updates and to issue only rx-modulation events for those sharers. Refer now to
The tid that receives the multi-cast can, in addition be a proxy tid that is forwarding its payload to another connection that is sending. The proxy messages, for example CPL_RX2TX_DATA shown in
The interface to the offloaded replication can either be part of, for example, RDMADataStreamer, or it can be part of a generalized memcpy( )-like Linux library function. In the first case, the multiple connections would be generated within the Java class and then the CPL_TX_DATA_PACK would be generated within the Java class. It is also possible that the protocol offload device participates in a ccNUMA protocol and that the interface to the replication facility includes a memcpy(dst0, dst1, . . . , dstn-1,src) library function that copies the data to dst0, dst1, dstn-1, from src using CPL_TX_DATA_PACK.
For receive side replication, the configuration may be accomplished through the Java class configuration, e.g., by configuring the receiving connection to copy the received data to objects that accomplish the local write part of the write operation.
The send and receive side replication can also be managed via a multicast join/leave type of mechanism—in which a join/leave is used to join/leave a multi-cast group implemented with the TCP/IP replication. The join/leave entails creating multiple connections in the send direction for each distinct destination, with another option being to create a proxy connection on the receive side. The replication on the receive side can be used to multi-cast to multiple subscribers reachable over the same PCIe bus or ACE interface.
We have thus described a system and method in which a protocol offload device (such as a NIC to offload transport layer protocol processing from a host) may reliably replicate and transmit data, originating from the host, to a plurality of peers via a network. The reliable transmission may be, for example, at the transport layer. Further, the NIC may employ a shared memory mechanism such that the host need not provide multiple copies of the data to the NIC but, rather, the host may consider the transmission, from its point of view, as a single transaction. The NIC handles transmission of same data to multiple peers and, upon completion of reliable transmission, notifies the host that the transmission transaction is complete. Furthermore the protocol offload device may reliably replicate received packets from the network to multiple destinations on a host. Finally the protocol offload device is capable of simultaneously replicating received packets to multiple destinations on that host and to proxy the received packets to connections that send the packets out to the network.
Number | Name | Date | Kind |
---|---|---|---|
5088032 | Bosack | Feb 1992 | A |
5572698 | Yen | Nov 1996 | A |
6427173 | Boucher | Jul 2002 | B1 |
6549516 | Albert | Apr 2003 | B1 |
7089281 | Kazemi | Aug 2006 | B1 |
7802001 | Petry | Sep 2010 | B1 |
7835380 | Aloni | Nov 2010 | B1 |
8656017 | Wang | Feb 2014 | B2 |
8868790 | Lovett | Oct 2014 | B2 |
8966112 | Franke | Feb 2015 | B1 |
9042244 | Senga | May 2015 | B2 |
9203728 | Pyatkovskiy | Dec 2015 | B2 |
9639553 | Hall | May 2017 | B2 |
9794222 | Pettit | Oct 2017 | B2 |
20010023460 | Boucher | Sep 2001 | A1 |
20020152299 | Traversat | Oct 2002 | A1 |
20020156927 | Boucher | Oct 2002 | A1 |
20020166080 | Attanasio | Nov 2002 | A1 |
20030167346 | Craft | Sep 2003 | A1 |
20040003126 | Boucher | Jan 2004 | A1 |
20040024894 | Osman | Feb 2004 | A1 |
20040044744 | Grosner | Mar 2004 | A1 |
20040122888 | Carmichael | Jun 2004 | A1 |
20040264481 | Darling | Dec 2004 | A1 |
20050063300 | Dominic | Mar 2005 | A1 |
20050213517 | Rodman | Sep 2005 | A1 |
20050259678 | Gaur | Nov 2005 | A1 |
20060004904 | Sarangam | Jan 2006 | A1 |
20060015618 | Freimuth | Jan 2006 | A1 |
20060104308 | Pinkerton | May 2006 | A1 |
20060165074 | Modi | Jul 2006 | A1 |
20060235977 | Wunderlich | Oct 2006 | A1 |
20070147390 | Jung | Jun 2007 | A1 |
20070168446 | Keohane | Jul 2007 | A1 |
20070283024 | Landrum | Dec 2007 | A1 |
20080310419 | Bansal | Dec 2008 | A1 |
20080320151 | McCanne | Dec 2008 | A1 |
20090217374 | Liu | Aug 2009 | A1 |
20100262650 | Chauhan | Oct 2010 | A1 |
20100325485 | Kamath | Dec 2010 | A1 |
20110320588 | Raleigh | Dec 2011 | A1 |
20120039332 | Jackowski | Feb 2012 | A1 |
20120078994 | Jackowski | Mar 2012 | A1 |
20130007480 | Wertheimer | Jan 2013 | A1 |
20130077486 | Keith | Mar 2013 | A1 |
20130132531 | Koponen | May 2013 | A1 |
20140012981 | Samuell | Jan 2014 | A1 |
20140204746 | Sun | Jul 2014 | A1 |
20140204954 | Nee | Jul 2014 | A1 |
20150032691 | Hall | Jan 2015 | A1 |
20150370666 | Breakstone | Dec 2015 | A1 |
20160043969 | Sung | Feb 2016 | A1 |
20160077857 | Dong | Mar 2016 | A1 |
20170063786 | Pettit | Mar 2017 | A1 |
20170237668 | Hall | Aug 2017 | A1 |
20180260125 | Botes | Sep 2018 | A1 |
Entry |
---|
Athresh, Akhila, “Final Report: Implementing Multi-cast Data Replication for Hadoop”, URL: http://www.cs.columbia.edu/˜msz/projects/2013-Spring-Hadoop/final_report.pdf, May 18, 2013, 6 pages. |