1. Field of the Invention
The present invention relates to a method, system, and program for managing data transmission through a network.
2. Description of Related Art
In a network environment, a network adapter on a host computer, such as an Ethernet controller, Fibre Channel controller, etc., will receive Input/Output (I/O) requests or responses to I/O requests initiated from the host. Often, the host computer operating system includes a device driver to communicate with the network adapter hardware to manage I/O requests to transmit over a network. The host computer may also implement a protocol which packages data to be transmitted over the network into packets, each of which contains a destination address as well as a portion of the data to be transmitted. Data packets received at the network adapter are often stored in an available allocated packet buffer in the host memory. A transport protocol layer can process the packets received by the network adapter that are stored in the packet buffer, and access any I/O commands or data embedded in the packet.
For instance, the computer may implement the Transmission Control Protocol (TCP) and Internet Protocol (IP) to encode and address data for transmission, and to decode and access the payload data in the TCP/IP packets received at the network adapter. IP specifies the format of packets, also called datagrams, and the addressing scheme. TCP is a higher level protocol which establishes a connection between a destination and a source. Another protocol, Remote Direct Memory Access (RDMA) establishes a higher level connection and permits, among other operations, direct placement of data at a specified memory location at the destination.
A “message” comprising a plurality of data packets can be sent from the connection established between the source and a destination. Depending upon the size of the message, the packets of a message might not be sent all at once in one continuous stream. Instead, the message may be subdivided into “segments” in which one segment comprising one or more packets may be dispatched at a time. The message may be sent in a send loop function such as tcp_output, for example, in which a message segment can be sent when the send function enters a send loop.
A device driver, application or operating system can utilize significant host processor resources to handle network transmission requests to the network adapter. One technique to reduce the load on the host processor is the use of a TCP/IP Offload Engine (TOE) in which TCP/IP protocol related operations are implemented in the network adapter hardware as opposed to the device driver or other host software, thereby saving the host processor from having to perform some or all of the TCP/IP protocol related operations. The transport protocol operations include packaging data in a TCP/IP packet with a checksum and other information, and unpacking a TCP/IP packet received from over the network to access the payload or data.
In the TCP protocol as specified in the industry accepted TCP RFC (request for comment), each byte of data (including certain flags) of a packet is assigned a unique sequence number. As each packet is successfully sent to the destination host, an acknowledgment is sent by the destination host to the source host, notifying the source host by packet byte sequence numbers of the successful receipt of the bytes of that packet. Accordingly, the stream 10 includes a portion 12 of packets which have been both sent and acknowledged as received by the destination host. The stream 10 further includes a portion 14 of packets which have been sent by the source host but have not yet been acknowledged as received by the destination host. The source host maintains a TCP Unacknowledged Data Pointer 16 which points to the sequence number of the first unacknowledged sent byte. The TCP Unacknowledged Data Pointer 16 is stored in a field 17a, 17b . . . 17n (
The capacity of the packet buffer used to store data packets received at the destination host is generally limited in size. In accordance with the TCP protocol, the destination host advertises how much buffer space it has available by sending a value referred to herein as a TCP Window indicated at 20 in
For example, if the destination host sends a TCP Window value of 128 KB (kilobytes) for a particular TCP connection, the source host will according to the TCP protocol, limit the amount of data it sends over that TCP connection to 128 KB until it receives an acknowledgment from the destination host that it has received some or all of the data. If the destination host acknowledges that it has received the entire 128 KB, the source host can send another 128 KB. On the other hand, if the destination host acknowledges receiving only 96 KB, for example, the host source will send only an additional 32 KB over that TCP connection until it receives further acknowledgments.
A TCP Next Data Pointer 22 stored in a field 23a, 23b . . . 23n of the associated Protocol Control Block 18a, 18b . . . 18n, points to the sequence number of the next byte to be sent to the destination host. A portion 24 of the datastream 10 between the TCP Next Data Pointer 22 and the end 28 of the TCP Window 20 represents packets which have not yet been sent but are permitted to be sent under the TCP protocol without waiting for any additional acknowledgments because these packets are still within the TCP Window 20 as shown in
As the destination host sends acknowledgments to the source host, the TCP Unacknowledged Data Pointer 16 moves to indicate the acknowledgment of bytes of additional packets for that connection. The beginning boundary 30 of the TCP Window 20 shifts with the TCP Unacknowledged Data Pointer 16 so that the TCP Window end boundary 28 also shifts so that additional packets may be sent for the connection.
In one system, as described in copending application Ser. No. 10/663,026, filed Sep. 15, 2003, entitled “Method, System and Program for Managing Data Transmission Through a Network” and assigned to the assignee of the present application, a computer when sending data over a TCP connection can impose a Virtual Window 200 (
Notwithstanding, there is a continued need in the art to improve the performance of connections.
Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
In the following description, reference is made to the accompanying drawings which form a part hereof and which illustrate several embodiments of the present invention. It is understood that other embodiments may be utilized and structural and operational changes may be made without departing from the scope of the present invention.
The network adapter 112 includes a network protocol layer 116 to send and receive network packets to and from remote devices over a network 118. The network 118 may comprise a Local Area Network (LAN), the Internet, a Wide Area Network (WAN), Storage Area Network (SAN), etc. The embodiments may be configured to transmit data over a wireless network or connection, such as wireless LAN, Bluetooth, etc. In certain embodiments, the network adapter 112 and various protocol layers may implement the Ethernet protocol including Ethernet protocol over unshielded twisted pair cable, token ring protocol, Fibre Channel protocol, Infiniband, Serial Advanced Technology Attachment (SATA), parallel SCSI, serial attached SCSI cable, etc., or any other network communication protocol known in the art.
A device driver 120 executes in memory 106 and includes network adapter 112 specific commands to communicate with a network controller of the network adapter 112 and interface between the operating system 110, applications 114 and the network adapter 112. The network controller can implement the network protocol layer 116 and can control other protocol layers including a data link layer and a physical layer which includes hardware such as a data transceiver. In an embodiment employing the Ethernet protocol, the data transceiver could be an Ethernet transceiver.
In certain implementations, the network controller of the adapter 112 includes a transport protocol layer 121 as well as the network protocol layer 116 and other protocol layers. For example, the network controller of the network adapter 112 can implement a TCP/IP offload engine (TOE), in which many transport layer operations can be performed within the offload engines of the transport protocol layer 121 implemented within the network adapter 112 hardware or firmware, as opposed to the device driver 120, operating system 110 or an application 114.
The network layer 116 handles network communication and provides received TCP/IP packets to the transport protocol layer 121. The transport protocol layer 121 interfaces with the device driver 120, or operating system 110 or application 114 and performs additional transport protocol layer operations, such as processing the content of messages included in the packets received at the network adapter 112 that are wrapped in a transport layer, such as TCP and/or IP, the Internet Small Computer System Interface (iSCSI), Fibre Channel SCSI, parallel SCSI transport, or any transport layer protocol known in the art. The transport offload engine 121 can unpack the payload from the received TCP/IP packet and transfer the data to the device driver 120, operating system 110 or an application 114.
In certain implementations, the network controller and network adapter 112 can further include an RDMA protocol layer as well as the transport protocol layer 121. For example, the network adapter 112 can implement an RDMA offload engine, in which RDMA layer operations are performed within the offload engines of the RDMA protocol layer implemented within the network adapter 112 hardware, as opposed to the device driver 120, operating system 110 or an application 114.
Thus, for example, an application 114 transmitting messages over an RDMA connection can transmit the message through the device driver 120 and the RDMA protocol layer of the network adapter 112. The data of the message can be sent to the transport protocol layer 121 to be packaged in a TCP/IP packet before transmitting it over the network 118 through the network protocol layer 116 and other protocol layers including the data link and physical protocol layers.
The memory 106 further includes file objects 124, which also may be referred to as socket objects, which include information on a connection to a remote computer over the network 118. The application 114 uses the information in the file object 124 to identify the connection. The application 114 would use the file object 124 to communicate with a remote system. The file object 124 may indicate the local port or socket that will be used to communicate with a remote system, a local network (IP) address of the computer 102 in which the application 114 executes, how much data has been sent and received by the application 114, and the remote port and network address, e.g., IP address, with which the application 114 communicates. Context information 126 comprises a data structure including information the device driver 120, operating system 110 or an application 114, maintains to manage requests sent to the network adapter 112 as described below.
If a particular TCP connection of the source host is accorded a relatively large TCP window 20 (
In one embodiment, as discussed below, the programmable Message Segment Send Limit may be globally programmed so that each connection is allowed the same number of executions of the send loop. Alternatively, a different Message Segment Send Limit may be programmed for each connection. In this manner, each connection may be given the same priority or alternatively, each connection may be given a weighted priority. This weighted priority may be provided by, for example, assigning different Message Segment Send Limits to various connections.
In this implementation, a programmable Message Segment Send Limit is stored in a field 224a, 224b . . . 224n of the associated Protocol Control Block 222a, 222b. . . 222n for each connection. In another aspect, the Message Segment Send Limit programmed for each connection may be selectively enabled for each connection. Thus, a Limit Enable is stored in a field 226a, 226b . . . 226n of the associated Protocol Control Block 222a, 222b . . . 222n for each connection.
To begin transmitting the messages of the various connections which have been established, a first connection is selected (block 230). The particular connection which is selected may be selected using a variety of techniques. In one embodiment, the connections may be assigned different levels of priority. Other techniques may be used as well.
A suitable send function is called (block 232) or initiated for the selected connection. In the illustrated embodiment, the send function may operate substantially in accordance with the TCP_Output function as implemented by the Berkeley Software Distribution (BSD). However, the send function of the illustrated embodiment has been modified as set forth in
The interval of a send function is started (block 240,
If the limiting of sending of message segments has been enabled as indicated by the Limit Enable variable, a segment send count is initialized (block 246) to the value of the Message Segment Send Limit programmed for the selected connection as indicated by the field 224a, 224b . . . 224n of the associated Protocol Control Block 222a, 222b . . . 222n for the selected connection. If the limiting of sending of message segments has not been enabled, the initialization of the segment count is skipped as shown in
During this interval of
If conditions are such that a segment can be sent, a determination (block 252) is made again as to whether the Message Segment Send Limit programmed for the connection has been enabled. If the limiting of sending of message segments during the interval has been enabled as indicated by the Limit Enable variable, the segment send count previously initialized (block 246) to the value of the Message Segment Send Limit is decremented (block 254) for the selected connection. If the limiting of sending of message segments has not been enabled, the decrementing the segment send count is skipped as shown in
A segment of the message of the selected connection is then sent (block 256). Upon sending the packet or packets of the message segment, the TCP Next Data Pointer 23a, 23b . . . 23n of the associated Protocol Control Block 222a, 222b . . . 222n for the selected connection is updated to point to the first byte of the next message segment to be sent.
A determination (block 260) is made as to whether the entire message (in this example, all the message segments of the message) has been sent. If not, a determination (block 262) is made again as to whether the Message Segment Send Limit programmed for the connection has been enabled. If the limiting of sending of message segments has been enabled as indicated by the Limit Enable variable, a determination (block 264) is made as to whether the segment send count has reached zero, that is, whether the number of successive message segments sent in this interval of execution of the send function has reached the maximum number as indicated by the Message Segment Send Limit.
If it is determined (block 264) that the segment sent count has not reached zero, that is, that the maximum number of successive message segments as indicated by the Message Segment Send limit has not yet been sent in this execution of the send function, the segment sending interval is continued in which successive additional message segments are sent (block 256) and the segment send limit count is decremented (block 254) for each message segment sent until either conditions do not permit (block 250) the sending of another message segment, the entire message has been sent (block 260), or the number of successive message segments sent in this execution interval of the send function has reached (block 264) the maximum number as indicated by the Message Segment Send Limit.
Once conditions do not permit (block 250) the sending of another message segment, or the number of successive message segments sent in this execution of the send function has reached (block 264) the maximum number as indicated by the Message Segment Send Limit, or the entire message has been sent (block 260), the message segment sending interval ends and the appropriate send function fields of the Protocol Control Block 222a, 222b . . . 222n for the selected connection are saved (block 270) and the process returns (block 272) from the called send function. Once the entire message has been sent (block 260), the process returns (block 272) from the called send function.
Although the entire message may have not been sent (block 260), and although conditions may still permit (block 250) the sending of another message segment, once the number of successive message segments sent in this execution interval of the send function has reached (block 264) the maximum number as indicated by the Message Segment Send Limit, further sending of message segments is suspended at this time for the selected connection to permit other connections to have access to the send resources of the send host. Since the entire message for the selected connection has not been sent, the appropriate send function fields of the Protocol Control Block 222a, 222b. . . 222n for the selected connection are saved (block 270) and the process returns (block 270) from the called send function.
Upon returning from the called send function, a determination (block 300,
The send function is then called again (block 232) to start another message segment sending interval in which message segments of the message of the selected connection are sent. Again, the send function may utilize the programmable Message Segment Send Limit to limit the number of successive segments which are sent during the interval of the send function call for the next selected connection if enabled for that connection. Upon the return from the send function call when the interval of the send function call is ended, connections are successively selected (block 230) and the send function is called (block 232) and new message segment sending intervals entered for each selected connection until all the messages have been sent (block 300) which permits the process to exit (block 302).
The described techniques for processing requests directed to a network card may be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” as used herein refers to code or logic implemented in hardware logic (e.g., an integrated circuit chip, Programmable Gate Array (PGA), Application Specific Integrated Circuit (ASIC), etc.) or a computer readable medium, such as magnetic storage medium (e.g., hard disk drives, floppy disks, tape, etc.), optical storage (CD-ROMs, optical disks, etc.), volatile and non-volatile memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, DRAMs, SRAMs, firmware, programmable logic, etc.). Code in the computer readable medium is accessed and executed by a processor. The code in which preferred embodiments are implemented may further be accessible through a transmission media or from a file server over a network. In such cases, the article of manufacture in which the code is implemented may comprise a transmission media, such as a network transmission line, wireless transmission media, signals propagating through space, radio waves, infrared signals, etc. Thus, the “article of manufacture” may comprise the medium in which the code is embodied. Additionally, the “article of manufacture” may comprise a combination of hardware and software components in which the code is embodied, processed, and executed. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the present invention, and that the article of manufacture may comprise any information bearing medium known in the art.
In the described embodiments, certain operations were described as being performed by the device driver 120, or by one or more of the protocol layers of the network adapter 112. In alterative embodiments, operations described as performed by the device driver 120 may be performed by the network adapter 112, and vice versa.
In the described embodiments, various protocol layers and operations of those protocol layers were described. The operations of each of the various protocol layers may be implemented in hardware, firmware, drivers, operating systems, applications or other software, in whole or in part, alone or in various combinations thereof.
In the described embodiments, the packets are transmitted from a network adapter card to a remote computer over a network. In alternative embodiments, the transmitted and received packets processed by the protocol layers or device driver may be transmitted to a separate process executing in the same computer in which the device driver and transport protocol driver execute. In such embodiments, the network card is not used as the packets are passed between processes within the same computer and/or operating system.
In certain implementations, the device driver and network adapter embodiments may be included in a computer system including a storage controller, such as a SCSI, Integrated Drive Electronics (IDE), Redundant Array of Independent Disk (RAID), etc., controller, that manages access to a non-volatile storage device, such as a magnetic disk drive, tape media, optical disk, etc. In alternative implementations, the network adapter embodiments may be included in a system that does not include a storage controller, such as certain hubs and switches.
In certain implementations, the device driver and network adapter embodiments may be implemented in a computer system including a video controller to render information to display on a monitor coupled to the computer system including the device driver and network adapter, such as a computer system comprising a desktop, workstation, server, mainframe, laptop, handheld computer, etc. Alternatively, the network adapter and device driver embodiments may be implemented in a computing device that does not include a video controller, such as a switch, router, etc.
In certain implementations, the network adapter may be configured to transmit data across a cable connected to a port on the network adapter. Alternatively, the network adapter embodiments may be configured to transmit data over a wireless network or connection, such as wireless LAN, Bluetooth, etc.
The illustrated logic of
The network adapter 508 may be implemented on a network card, such as a Peripheral Component Interconnect (PCI) card or some other I/O card, or on integrated circuit components mounted on the motherboard or in software.
The foregoing description of various embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto. The above specification, examples and data provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.