Certain embodiments of the invention may be found in a method and system for network protocol offloading. Aspects of the invention may comprise establishing a path between a host socket and an offloaded socket in a TCP offload engine (TOE) for offloading a TCP connection to the TOE. Offload functions associated with extensions to the host socket may enable TCP offload and IP layer bypass extensions in a network device driver for generating the offload path. In this regard, a flag, for example, in the host socket extensions may indicate when connection offloading is to occur. The offload path may be establishing after the TCP connection is established via a native stack in the host or after a listening socket is offloaded to the TOE for establishing the connection. Data for retransmission for the offloaded connection may be stored in the host or in the TOE. The offloaded connection may be terminated in the TOE or may be migrated to the host for termination.
The host 101 may comprise suitable logic, circuitry, and/or code that may enable performing user applications that may require network connections. For example, the host 101 may be an application server, a web server, and/or an email server. The host 101 may enable at least one user application to execute in at least one processor. The host 101 may also enable TCP processing or protocol offloading of TCP sessions or connections that may be associated with user applications. In this regard, the host 101 may select one of the TCP connections to be offloaded to the TOE 114 in the NIC 112. For example, the host 101 may offload connections that may be established for long periods of time, such as connections to security or emergency systems. In another example, the host 101 as a result of the overhead that may be required during the connection may process connections that may be set up and terminated quickly. Other criteria may be utilized without departing from the scope of the invention.
The host 101 may also enable the coexistent operation of a native software stack associated with an operating system (OS) executing in at least one of the processors within the host 101 with an offloaded protocol stack associated with software executing in the TOE 114. The offloaded protocol stack may process TCP sessions offloaded to the TOE 114, while the native stack may process those TCP sessions that remain in the host 101. In this regard, the host 101 may enable creation of a communication path between the host 101 and the TOE 114 for TCP offloading. The communication path may be referred to as a plumbing channel between a host endpoint and an offload endpoint, that is, a channel between the native stack and the offloaded protocol stack. The host 101 may enable extending the capabilities of operating systems, such a Linux and/or Windows OS, for example, to enable creation of plumbing channels for TCP offloading operations that remain transparent to user applications.
The CPUs 106a and 106b in the host 101 may comprise suitable logic, circuitry, and/or code that may be enabled for processing user applications, networking connections, and/or other operations, such as management and/or maintenance operations, for example. While
The south bridge 104 may be communicatively connected to the north bridge 102. The south bridge 104 may comprise suitable logic, circuitry, and/or code that may enable I/O expansion by allowing the I/O peripherals 110 to communicate with the north bridge 104. The I/O peripherals 110 may comprise suitable logic, circuitry, and/or code that may enable introducing information and/or commands to the host 101 and/or receiving and/or displaying information from the host 101.
The NIC 112 may comprise suitable logic, circuitry, and/or code that may enable performing networking processing operations. The NIC 112 may be communicatively coupled to the host 101. In some instances, the host 101 may be communicatively coupled to more than one NIC 112. Similarly, the NIC 112 may be communicatively coupled to more than one host 101. The TOE 114 in the NIC 112 may comprise suitable logic, circuitry, and/or code to perform network-processing operations offloaded from the host 101. In this regard, the TOE 114 may perform network-processing operations for at least one TCP connection offloaded from the host 101. The GbE MAC/PHY block 116 in the TOE 114 may comprise suitable logic, circuitry, and/or code to perform OSI layer 2 and layer 1 operations for communicating information in a TCP connection. While the GbE MAC/PHY block 116 is shown to support 1 Gigabit-per-second (Gbps) communication rate and/or 10 Gbps communication rate, it need not be so limited and may support a plurality of communication rates such as 10 Megabits-per-second (Mbps) and/or 100 Mbps, for example.
In operation, a user application, such as a web server application, for example, may require that a connection be established with a remote device in the network. The host 101 may establish the connection and may determine whether the connection is to be handled by the TOE 112. When the TOE 112 is to handle the network connection, the host 101 may offload the connection to the TOE 112. In some instances, the TOE 112 may be utilized to establish the connection when the host 101 is aware that the connection to be established is to be handled by the TOE 112. The TOE 112 may handle the TCP-related networking operations during the time the TCP connection is offloaded. In some instances, the host 101 may migrate the TCP connection back to the host 101 for handling TCP-related networking operations. When the connection is to be terminated, either the TOE 112 or the host 101 may handle the connection termination.
At the user level space 201a, there are shown user applications such as remote direct access memory (RDMA) applications and library (apps/lib) 202 and sockets applications 204, for example. At the kernel space level 201b, there are shown a plurality of software modules such as a system call interface module 206, a file system module 208, a small computer system interface (SCSI) module 210, an Internet SCSI (iSCSI) module 214, an iSCSI extension to RDMA (iSER) module 212, an RDMA VERB module 222, a switch module 216, an offload module 218, a TCP/IP module 220, and a network device driver module 224, for example. At the hardware device space 201c, there are shown a plurality of software modules such as a messaging interface and DMA interface module 226, an RDMA module 228, an offload module 230, a raw sockets Ethernet (RAW ETH) module 232, a TCP/IP engine module 234, and a MAC/PHY interface module 236, for example.
The modules in the kernel space 201b and in the hardware device space 201c that are shown with hash lines correspond to extensions in the system architecture that may enable creation of a communication path between the host 101 and the TOE 114 for TCP offloading. The communication path may be referred to as a plumbing channel between a host endpoint and an offload endpoint, that is, a channel between the native stack and the offloaded protocol stack. The switch 216, for example, may be utilized to intercept a system call for a TCP operation. The switch 216 may determine, based on information in the system call, whether the particular TCP session or connection has been offloaded to the TOE 114 or has not been offloaded to the TOE 114. When the TCP session has not been offloaded to the TOE 114, communication between the host CPU and the TOE may occur via the TCP/IP module 220 in the host CPU and the RAW ETH module 232 in the TOE. The path comprising the TCP/IP module 220 and the RAW ETH module 232 may be referred to as path 1. The MAC/PHY interface module 236 may enable communication between the TOE and the network when path 1 is selected via the switch 216.
When the TCP session has been offloaded to the TOE 114, communication between the host CPU and the TOE may occur via the offload module 218 in the host CPU and the offload module 230 in the TOE. The path or channel comprising the offload module 218 and the offload module 230 may be referred to as path 2. Path 2 may also be referred to as a plumbing channel, for example. In this regard, the offload module 218 may correspond to the host endpoint and the offload module 230 may correspond to the offload endpoint of the plumbing channel. The TCP/IP engine 234 and the MAC/PHY interface module 236 may enable communication between the TOE and the network when path 2 or plumbing channel is selected via the switch 216.
When TCP offload support in provided in the software architecture, as in the software architecture 200, for example, RDMA capabilities may also be provided on top of the offloading capabilities. For example, RDMA VERB module 222 may provide a management layer of software that enables controlling the hardware for RDMA operations. In this regard, the communication between the RDMA VERB module 222 and the system call interface module 206 and the iSER module 212 may be referred to an RDMA control path for the software architecture 200. An RDMA data path may be established based on the RDMA control operations between the RDMA apps/lib 202 in the user level space 201a and the RDMA module 228 in the hardware device space 201c, for example.
At the user level space 301a, a socket system call interface 302 may communicate with a socket data structure (sock) 304 at the INET level 301b. The INET level 301b may correspond to an Internet address family that supports communication via TCP/IP. The sock 304 may comprise a plurality of member functions such as inet_stream_connect 306, inet_accept 308, inet_sendmsg 310, inet_recvmsg 312, and additional member functions 314, for example. An application may call sock 304 via the socket system call interface 302 when trying to open a new TCP connection. The application may then call a member function associated with sock 304, such as inet_stream_connect 306, for example.
The sock 304 data structure may be utilized to call on and communicate with the sock data structure (sk) 316 at the TCP layer 301c. The sk 316 may comprise a plurality of member functions such as tcp_v4_connect 322, tcp_accept 324, tcp_sendmsg 326, tcp_recvmsg 328, and additional member functions 330, for example. An additional data structure, such as offload socket (offl_sk) 318, may be attached to sk 316 for extending the capabilities of sk 316 to enable channel plumbing. The sk 316 may comprise at least one flag that may indicate if a TCP connection is offloaded. Associated with offl_sk 318 may be a plurality of offload functions (offl_funcs) 320. The offload functions 320 may enable bypassing the TCP, the IP and Ethernet operations 332 in the TCP layer 301c, in the IP layer 301c to the device driver layer 301e when the offload session flag in the sk 316 is set. The device driver layer 301e may comprise a plurality of functions such as open 336, stop 338, hard_start_xmit 340, set_config 342, set_mac_address 344, and additional functions 346, for example. Offload extensions 334 to the device driver layer 301e may enable the network device driver to provide TCP offloading and kernel bypass operations. When the offload session flag in the sk 316 is not set, a direct connection between the TCP layer 301c and the IP layer 301d may take place. Notwithstanding the embodiment of the extensions to the native stack described in
The operating system 402 may comprise a native TCP/IP stack 410 and a data structure 408 associated with the offloaded TCP connection. The data structure 408 may be a host socket, for example. The data structure 408 may enable data to flow between the application in the host system and the TOE 406, which bypasses the kernel level. Communication between the operating system 402 and the TOE device driver 404 may occur via offload functions, such as the offload functions 320 in
The TOE device driver 404 may communicate with the TOE 406 via a messaging interface, such as the messaging interface and DMA interface 226 in the hardware device level 201c in
On the host side of the endpoint association, the first TCP session 420 may comprise a socket_1422 corresponding to the TCP layer, a route_1424 corresponding to the IP layer, an interface data structure (ifa) 436 corresponding to the TOE device driver. The ifa 436 may correspond to the Ethernet interface for data communication, for example. On the offloaded side of the endpoint association, the first TCP session 420 may comprise a plumbing channel 1 (plumb_1) 428 corresponding to session-specific information, a route_1430 corresponding to cached information, and a MAC interface 1 (MAC_int_1) 432 corresponding to permanent communication information. The socket_1422 in the host may be associated with the plumb_1428 in the TOE. Similarly, routing information in route_1424 in the host may be associated with routing information in route_1430 in the TOE. Moreover, the ifa 426 in the host may be associated with the MAC_int_1432 in the TOE.
The second TCP session 440 may comprise, on the host side of the endpoint association, a socket_2422 corresponding to the TCP layer, and on the offloaded side of the endpoint association, a plumb_2448 corresponding to session-specific information. The second TCP session 440 may utilize the same routing and interfacing capabilities, that is, route_1430 and MAC_int_1432 assocaited with route_1424 and ifa 426, respectively, that the first TCP session 420 utilizes, even when different sockets and plumbing channel exists for each of the TCP sessions.
The third TCP session 450, on the host side of the endpoint association, may comprise a socket_3442 corresponding to the TCP layer, a route_2454 corresponding to the IP layer, an ifa 456 corresponding to the TOE device driver. On the offloaded side of the endpoint association, the third TCP session 450 may comprise a plumbing channel 3 (plumb_3) 458 corresponding to session-specific information, a route_2460 corresponding to cached information, and a MAC interface 2 (MAC_int_2) 462 corresponding to permanent communication information. The socket_2452 in the host may be associated with the plumb_3458 in the TOE. Similarly, routing information in route_2454 in the host may be associated with routing information in route_2460 in the TOE. Moreover, the ifa 456 in the host may be associated with the MAC_int_2452 in the TOE.
In operation, when data is transmitted from a particular TCP session, the data may flow and/or utilize information from the various components illustrated in
Notwithstanding the exemplary endpoint associations for TCP offloading illustrated in
On the client 501 side, an application may call a socket 518 which may call a binding operation, bind 520, which may be utilized to get the local IP address and port number. After bind 520, a connect operation 522 may be called to initialize a handshake process to create or establish a TCP connection with the server 500. Both the server 500 and the client 501 may utilize their respective native stacks to handle the opening process. For example, the client 501 may communicate a request signal, SYN, via the RAW ETH 530 from the connect operation 522 to start the opening process. The server 500 may be listening until it receives the SYN signal via the RAW ETH 516 and may call an accept operation 508 to handle the opening process. The RAW ETH 530 and the RAW ETH 516 may be the same or substantially similar to the RAW ETH 232 illustrated in
After the TCP connection has been established, the server 500 may determine that the TCP connection is to be offloaded to the TOE 514. In this regard, the accept operation 508 in the server 500 may spawn a new socket 510. The new socket 510 may generate a message or signal 511a, such as MSG_TCP_CREATE_PLUMB, for example, to the TOE 514 to create a TCP plumbing channel. The message 511a may comprise information regarding the address of the new socket 510. The TOE 514 may respond by generating a message or signal 511b, such as TCP_PLUMB_RSP, for example, to the new socket 510 with the address of the plumbing channel 512. Once the endpoint association is established between the new socket 510 and an offloaded socket in the TOE 514 via the plumbing channel 512, the TCP connection with the client 501 may be offloaded to the TOE 514.
Similarly, after the TCP connection has been established, the client 501 may determine that the TCP connection is to be offloaded to the TOE 528. In this regard, the connect operation 522 in the client 501 may spawn a new socket 524. The new socket 524 may generate a message or signal 525a, such as MSG_TCP_CREATE_PLUMB, for example, to the TOE 528 to create a TCP plumbing channel. The message 525a may comprise information regarding the address of the new socket 524. The TOE 528 may respond by generating a message or signal 525b, such as TCP_PLUMB_RSP, for example, to the new socket 524 with the address of the plumbing channel 526. Once the endpoint association is established between the new socket 524 and an offloaded socket in the TOE 528 via the plumbing channel 526, the TCP connection with the server 500 may be offloaded to the TOE 528.
Regarding the offloading listening socket operations 602, before a TCP connection is established, a host socket (h_so) 622 and a listening socket 606 may be called by a host. The host socket 622 may correspond to an initial host endpoint for TCP offloading endpoint association. A plumbing channel 605 may be created between the host socket 622 and an offload socket (offl_so) 610 in the TOE. The offload socket 610 may correspond to an initial offload endpoint for TCP offloading endpoint association. The listening socket 606 may be utilized for listening to requests that may be sent by a client for opening a TCP connection. The listening socket 606 may send a message or signal 607a with the address of the listening socket 606 to the peer TOE to create a plumbing channel to enable offloading the listening operation to the TOE. The TOE may send a message or signal 607b back to the listening socket 606 in the host with the plumbing channel address. Once the plumbing channel is established, the listening operation may be offloaded to an offloaded listening socket 608.
The offloaded listening socket 608 may be utilized to open a TCP connection via a handshake process. When a request, SYN, is received from a client, the offloaded listening socket 608 may create a new offloaded socket (new_offl_so) 612 in the TOE and may also generate an acknowledgment, ACK, and its own synchronization request, SYN, back to the client. The new_offl_so 612 may be incomplete. When the client responds by sending its acknowledgement, ACK, the new_offl_so 612 may be completed and may comprise information regarding its own address and the address of the host socket 622. The new_offl_so_612 may correspond to a new offload endpoint for TCP offloading end[point association.
The new_offl_so 612 may be part of the offloading socket processing operations 604 associated with the offloading of the TCP connection. After the connection is established and the new_offl_so 612 is completed, the TOE may issue a message or signal to the host, such as MSG_TCP_NASCENT, for example, to indicate that the TCP connection has been established. The host may allocate a new host socket (new_ho_so) 620 as a result of the MSG_TCP_NASCENT message and may issue or send a message, such as a MSG_TCP_NASCENT_DONE, for example, to indicate to the TOE that the new host socket 620 has been allocated. The new host socket 620 may correspond to a new host endpoint for TCP offloading endpoint association. The message MSG_TCP_NASCENT_DONE may comprise information regarding the address of the new host socket 620 and of the new offloaded socket 612 to establish a plumbing channel that enables TCP offloading. Notwithstanding the processes or operations illustrated in
The network 806 may comprise suitable logic, circuitry, and/or code that may enable communication between the remote system 808 and the host 802 via the NIC 804. The remote system 808 may comprise suitable logic, circuitry, and/or code that may enable establishing a communication link for exchanging data with the host 802 via the network 806 and the NIC 804.
During transmission operation, the host socket 810 may send a message or signal, such as MSG_TCP_TX_REQ, to the TOE 816 to request that a packet of data from data_1812 and/or data _2814 be transmitted to the remote system 808 via the network 806. After the request is received, the data packet may be direct memory accessed (DMA) by the TOE 816 from the host 802. The TOE 802 may frame the data packet and may transmit the framed data packet to the remote system 808 via the network 806. When the remote system 808 receives the framed data packet, it may generate an acknowledgment message, ACK, that may be communicated to the to the TOE 816 via the network 806. After receiving the ACK message the TOE 816 may generate a message or signal to the host socket 810 via the plumbing channel to release the transmitted data from the send queue for retransmission purposes.
During transmission operation, the host socket 810 may send a message or signal, such as MSG_TCP_TX_REQ, to the TOE 816 to request that a packet of data from data_1812 and/or data _2814 be transmitted to the remote system 808 via the network 806. After the request is received, the data packet may be DMA by the TOE 816 from the host 802 and may be stored in the local copies data_1818 and data_2820. After the transfer is completed, the TOE 816 may generate a message or signal to the host 802 to indicate that the DMA transfer has been completed. The host 802 may release the transmitted data from the send queue for retransmission purposes. The TOE 802 may frame the data packet from the local copies and may transmit the framed data packet to the remote system 808 via the network 806. When the remote system 808 receives the framed data packet, it may generate an acknowledgment message, ACK, that may be communicated to the TOE 816 via the network 806. After receiving the ACK message the TOE 816 release the transmitted data from the offload send queue for retransmission purposes.
In operation, the TOE 924 shown in
During an active closing operation, the local server 1002 may initiate closing or termination of the TCP connection. In this regard, the tcp_offl_disconnect 1008 may generate a message or signal, such as MSG_TCP_TX_REQ, for example, to the TOE 1010. The message may have a flag set, such as fin=1, for example, to indicate to the TOE 1010 that the TCP connection with the peer 1004 may be finished or terminated. In active closing, the TOE 1010 may generate a message or signal, such as FIN, for example, to the peer 1004 requesting to terminate or close the TCP connection. The peer 1004 may acknowledge the request with an ACK signal to the TOE 1010. The peer 1004 may also send a FIN message requesting termination of the TCP connection to the TOE 1010. The TOE 1010 may acknowledge receipt of the request with an ACK signal to the peer 1004. After sending the ACK signal to the peer 1004, the TOE 1010 may generate a message or signal, such as MSG_TCP_MIGRATE_IND, for example, to the host socket to have the TCP session or connection migrated to the local host portion of the local server 1002. In this regard, migrating the TCP connection to the host for further processing and termination may enable the native stack in the local host to handle TIME_WAIT state information associated with the TCP connection to be terminated. The local host may then wait for some period of time, for example, approximately sixty (60) seconds, and may clean up all the data structures on the TIME_WAIT state related to the closed TCP connection.
In passive closing, the peer 1004 may generate a message or signal, such as FIN, for example, to the TOE 1010 requesting to terminate or close the TCP connection. The TOE 1010 may acknowledge the request with an ACK signal to the peer 1004. In this regard, the TEO 1010 may change from an established TCP state 1012 to a close_wait TCP state 1014. The established TCP state 1012 indicates that a TCP connection is established while the close_wait TCP state 1014 indicates that the TOE 1010 is waiting for local application to close or terminate the TCP connection. Once the local application calls the function tcp_close, a message MSG_TCP_TX_REQ with the termination flag fin set to 1 may be sent to the TOE 1010 via the offload function tcp_offl_disconnet 1008. The TOE 1010 may send a FIN message requesting termination of the TCP connection to the peer 1004. The TOE 1010 may change from the close_wait TCP state 1014 to the last_ack TCP state 1016, waiting for the peer 1004 to generate an acknowledgment, ACK, to close the connection. The TOE 1010 may handle the closing of the TCP connection with the peer 1004 and may generate a message or signal, such as MSG_TCP_UNPLUMB_IND, for example, to the local host portion of the local server 1002 to indicate that the TCP connection with peer 1004 has been closed.
In operation, the offload functions 1122 associated with the offload socket 1120, may be utilized to offload the route cache 1114 and the ARP cache 1116 to the TOE 1104. In this regard, functions such as tcp_offl_rtalloc and tcp_offl_arpresolve, for example, may be utilized to indicate that the route cache 1114 and the ARP cache 1116 are to be offloaded. The offload functions 1122 may be utilized to generate a message or signal, such as MSG_TCP_CREATE_PLUMB, for example, to create a plumbing channel 1126 for offloading to the TOE 1104. The TOE 1104 may respond to the request with a message or signal, such as MSG_TCP_CREATE_PLUMB_RSP, to indicate that the plumbing channel 1126 has been created. Once the plumbing channel exists, the route cache 1114 and the ARP cache 1116 may be offloaded to the TOE 1104 as route cache 1128 and the ARP cache 1130, for example.
The approach described herein may enable offloading protocol processing for selected TCP sessions from host processors to a TCP offload engine in operating systems that do not have a standard manner of supporting protocol offloading by providing extensions to the native stack that generate an offload path or plumbing channel between a host endpoint and an offload endpoint.
Accordingly, the present invention may be realized in hardware, software, or a combination of hardware and software. The present invention may be realized in a centralized fashion in at least one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems. Any kind of computer system or other apparatus adapted for carrying out the methods described herein is suited. A typical combination of hardware and software may be a general-purpose computer system with a computer program that, when being loaded and executed, controls the computer system such that it carries out the methods described herein.
The present invention may also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which when loaded in a computer system is able to carry out these methods. Computer program in the present context means any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form.
While the present invention has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the scope of the present invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present invention without departing from its scope. Therefore, it is intended that the present invention not be limited to the particular embodiment disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims.