The present invention relates to the field of computer networking, and in particular to long distance or Wide Area Network (WAN) communications.
In WAN optimization products, and some other products, there is a need to tunnel multiple flows in the same Transmission Control Protocol (TCP) tunnel. Carrying multiple local area network (LAN) TCP connections over one WAN TCP connection can cause head of line blocking. Head of line blocking occurs if there is a frame loss for one of the data flows. In this case, the flow with the missing frame gets stuck in the TCP tunnel until the lost frame is retransmitted. Flows that follow the missing frame flow will be impacted by this as they will also not be delivered until the first flow has passed through the TCP tunnel. This results in unnecessary time delays.
One way to avoid this problem is to establish a WAN TCP connection for each LAN TCP connection. However, this requires many resources and is very inefficient.
Thus, what is needed is an efficient method for carrying multiple LAN TCP connections over one WAN TCP connection while avoiding a head of line blocking problem.
The preferred embodiment uses a method to share a TCP tunnel between multiple flows without having head of the line blocking problem. When a complete but out of order protocol data unit (PDU) is stuck behind an incomplete PDU in a TCP tunnel, the complete but out of order PDU is removed from the tunnel. To do that, first, the boundaries of the PDUs of the different flows are preserved and the TCP receive window advertisement is increased. The receive window is opened when initially receiving out-of-order data. As out-of-order complete PDUs are pulled out of the receive queue, to address double counting, place holders are used in the receive queue to indicate data that was in the queue. As out-of-order data PDUs are pulled out of the queue the window advertisement is increased. This keeps the sending side from running out of TX window and stopping transmission of new data.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an implementation of apparatus and methods consistent with the present invention and, together with the detailed description, serve to explain advantages and principles consistent with the invention.
Referring to
One goal of the embodiments of the present invention is to extend a Virtual Cluster Switch (VCS) and TRILL network across data centers and meet the scalability requirements needed by the deployments. A CNE device can be implemented in a two-box solution, wherein one box is capable of layer 2/layer 3/Fibre Channel over Ethernet (L2/L3/FCoE) switching and is part of the VCS, and the other facilitates the WAN tunneling to transport Ethernet and/or Fibre Channel (FCI) traffic over WAN. The CNE device can also be implemented in a one-box solution, wherein a single piece of network equipment combines the functions of L2/L3/FCoE switching and WAN tunneling.
VCS as a layer-2 switch uses TRILL as its inter-switch connectivity and delivers a notion of single logical layer-2 switch. This single logical layer-2 switch delivers a transparent LAN service. All the edge ports of VCS support standard protocols and features like Link Aggregation Control Protocol (LACP), Link Layer Discovery Protocol (LLDP), virtual LANs (VLANs), media access control (MAC) learning, and the like. VCS achieves a distributed MAC address database using Ethernet Name Service (eNS) and attempts to avoid flooding as much as possible. VCS also provides various intelligent services, such as virtual link aggregation group (vLAG), advance port profile management (APPM), End-to-End FCoE, Edge-Loop-Detection, and the like. More details on VCS are available in U.S. patent application Ser. Nos. 13/098,360, entitled “Converged Network Extension,” filed Apr. 29 , 2011; 12/725,249, entitled “Redundant Host Connection in a Routed Network,” filed 16 Mar. 2010;13/087,239, entitled “Virtual Cluster Switching,” filed 14 Apr. 2011; 13/092,724, entitled “Fabric Formation for Virtual Cluster Switching,” filed 22 Apr. 2011;13/092,580, entitled “Distributed Configuration Management for Virtual Cluster Switching,” filed 22 Apr. 2011;13/042,259, entitled “Port Profile Management for Virtual Cluster Switching,” filed 07Mar. 2011; 13/092,460, entitled “Advanced Link Tracking for Virtual Cluster Switching,” filed 22 Apr. 2011; No. 13/092,701, entitled “Virtual Port Grouping for Virtual Cluster Switching,” filed 22 Apr. 2011; 13/092,752, entitled “Name Services for Virtual Cluster Switching,” filed 22 Apr. 2011;13/092,877, entitled “Traffic Management for Virtual Cluster Switching,” filed 22 Apr. 2011; and 13/092,864, entitled “Method and System for Link Aggregation Across Multiple Switches,” filed 22 Apr. 2011, all hereby incorporated by reference.
In embodiments of the present invention, for the purpose of cross-data-center communication, each data center is represented as a single logical RBridge. This logical RBridge can be assigned a virtual RBridge ID or use the RBridge ID of the CNE device that performs the WAN tunneling.
Similar to the data center 844 , data center 846 includes a VCS 842, which in turn includes a member switch 832. Member switch 832 is coupled to a host 841, which includes virtual machines (VMs) 834 and 836, both of which are coupled to virtual switches 838 and 840. Also included in VCS 842 is a CNE device 830. CNE device 830 is coupled to member switch 832 via an Ethernet (TRILL) link and an FC link. CNE device 830 is also coupled to a target storage device 822 and a clone of target storage device 820.
In previous embodiments, moving VM 802 of the network architecture of
When forwarding TRILL frames from data center 844 to data center 846, CNE device 818 modifies the egress TRILL frames' header so that the destination RBridge identifier is the RBridge identifier assigned to data center 846. CNE device 818 then uses the FCIP tunnel to deliver these TRILL frames to CNE device 830, which in turn forwards these TRILL frames to their respective layer-2 destinations.
VCS uses the FC control plane to automatically form a fabric and assign RBridge identifiers to each member switch. In one embodiment, the CNE architecture keeps the TRILL and SAN fabrics separate between data centers. From a TRILL point of view, each VCS (which corresponds to a respective data center) is represented as a single virtual RBrisdge. In addition, the CNE device can be coupled to a VCS member switch with both a TRILL link and an FC link. However, since the CNE devices keeps the TRILL VCS fabric and SAN (FC) fabrics separate, the FC link between the CNE device and the member switch is generally configured for FC multi-fabric.
As illustrated in
In one embodiment, each data center's VCS includes a node designated as the ROOT RBridge for multicast purposes. During the initial setup, the CNE devices in the VCSs exchange each VCS's ROOT RBridge identifier. In addition, the CNE devices also exchange each data center's RBridge identifier. Note that this RBridge identifier represents the entire data center. Information related to data-center RBridge identifiers is distributed as a static route to all the nodes in the local VCS.
Assume that host A needs to send multicast traffic to host Z, and that host A already has the knowledge of host Z's MAC address. During operation, host A assembles an Ethernet frame 1002, which has host Z's MAC address (denoted as MAC-Z) as its destination address (DA), and host A's MAC address (denoted as MAC-A) as its source address (SA). Based on frame 1002, member switch RB1 assembles a TRILL frame 1003, whose TRILL header 1006 includes the RBridge identifier of data center DC-1's root RBridge (denoted as “DC1-ROOT”) as the destination RBridge, and RB1 as the source RBridge. (That is, within DC-1, the multicast traffic is distributed on the local multicast tree.) The outer Ethernet header 1004 of frame 1003 has CNE device RB4's MAC address (denoted as MAC-RB4) as the destination address, and member switch RB1's MAC address (denoted as MAC-RB1) as the source address.
When frame 1003 reaches CNE device RB4, it further modifies the frame's TRILL header to produce frame 1005. CNE device RB4 replaces the destination RBridge identifier in the TRILL header 1010 with data center DC-2's root RBridge identifier DC2-ROOT. The source RBridge identifier is changed to data center DC-1's virtual RBridge identifier, DC1-RB (which allows data center DC-2 to learn data center DC-1's RBridge identifier). Outer Ethernet header 1008 has the core router's MAC address (MAC-RTR) as its destination address, and CNE device RB4's MAC address (MAC-DC-1) as its source address.
Frame 1005 is subsequently transported across the IP WAN in an FCIP tunnel and reaches CNE device RB6. Correspondingly, CNE device RB6 updates the header to produce frame 1007. Frame 1007's TRILL header 1014 remains the same as frame 1005. The outer Ethernet header 1012 now has member switch RB5's MAC address, MAC-RB5, as its destination address, and CNE device RB6's MAC address, MAC-RB6, as its source address. Once frame 1007 reaches member switch RB5, the TRILL header is removed, and the inner Ethernet frame is delivered to host Z.
In various embodiments, a CNE device can be configured to allow or disallow unknown unicast, broadcast (e.g., Address Resolution Protocol (ARP)), or multicast (e.g., Internet Group Management Protocol (IGMP) snooped) traffic to cross data center boundaries. By having these options, one can limit the amount of BUM traffic across data centers. Note that all TRILL encapsulated BUM traffic between data centers can be sent with the remote data center's root RBridge identifier. This translation is done at the terminating point of the FCIP tunnel.
Additional mechanisms can be implemented to minimize BUM traffic across data centers. For instance, the TRILL ports between the CNE device and any VCS member switch can be configured to not participate in any of the VLAN multicast group IDs (MGIDs). In addition, the eNS on both VCSs can be configured to synchronize their learned MAC address database to minimize traffic with unknown MAC destination address. In one embodiment, before the learned MAC address databases are synchronized in different VCSs, frames with unknown MAC destination addresses are flooded within the local data center only.
To further minimize BUM traffic, broadcast traffic such as ARP traffic can be reduced by snooping ARP responses to build ARP databases on VCS member switches. The learned ARP databases are then exchanged and synchronized across different data centers using eNS. Proxy-based ARP is used to respond to all known ARP requests in a VCS. Furthermore, multicast traffic across data centers can be reduced by distributing the multicast group membership across data canters through sharing the IGMP snooping information via eNS.
The process of forwarding unicast traffic between data centers is as follows. During the FCIP tunnel formation, the logical RBridge identifiers representing data centers are exchanged. When a TRILL frame arrives at the entry node of the FCIP tunnel, wherein the TRILL destination RBridge is set as the RBridge identifier of the remote data center, the source RBridge in the TRILL header is translated to the logical RBridge identifier assigned to the local data center. When the frame exits the FCIP tunnel, the destination RBridge field in the TRILL header is set as the local (i.e., the destination) data center's virtual RBridge identifier. The MAC DA and VLAN ID in the inner Ethernet header are then used to look up the corresponding destination RBridge (i.e., the RBridge identifier of the member switch to which the destination host is attached), and the destination RBridge field in the TRILL header is updated accordingly.
In the destination data center, based on an ingress frame, all the VCS member switches learn the mapping between the MAC SA (in the inner Ethernet header of the frame) and the TRILL source RBridge (which is the virtual RBridge identifier assigned to the source data center). This allows future egress frames destined to that MAC address to be sent to the right remote data center. Because the RBridge identifier assigned to a given data center does not correspond to a physical RBridge, in one embodiment, a static route is used to map a remote data-center RBridge identifier to the local CNE device.
When frame 1003 reaches CNE device RB4, it further modifies the frame's TRILL header to produce frame 1005. CNE device RB4 replaces the source RBridge identifier in the TRILL header 1011 with data center DC-1's virtual RBridge identifier DC1-RB (which allows data center DC-2 to learn data center DC-1's RBridge identifier). Outer Ethernet header 1008 has the core router's MAC address (MAC-RTR) as its DA, and CNE device RB4's MAC address (MAC-DC-1) as its SA.
Frame 1005 is subsequently transported across the IP WAN in an FCIP tunnel and reaches CNE device RB6. Correspondingly, CNE device RB6 updates the header to produce frame 1007. Frame 1007's TRILL header 1015 has an updated destination RBridge identifier, which is RB5, the VCS member switch in DC-2 that couples to host Z. The outer Ethernet header 1012 now has member switch RB5's MAC address, MAC-RB5, as its DA, and CNE device RB6's MAC address, MAC-RB6, as its SA. Once frame 1007 reaches member switch RB5, the TRILL header is removed, and the inner Ethernet frame is delivered to host Z.
Flooding across data centers of frames with unknown MAC DAs is one way for the data centers to learn the MAC address in another data center. All unknown SAs are learned as MACs behind an RBridge and it is no exception for the CNE device. In one embodiment, eNS can be used to distribute learned MAC address database, which reduces the amount of flooding across data centers.
In order to optimize flushes, even though MAC addresses are learned behind RBridges, the actual VCS edge port associated with a MAC address can be present in the eNS MAC updates. However, the edge port IDs might no longer be unique across data-centers. To resolve this problem, all eNS updates across data centers will qualify the MAC entry with the data-center's RBridge identifier. This configuration allows propagation of port flushes across data centers.
In the embodiments described herein, VCSs in different data-centers do not join each other and thus the distributed configurations are kept separate. However, in order to allow virtual machines to move across data-centers, there maybe some configuration data that needs to be synchronized across data-centers. In one embodiment, a special module (in either software or hardware) is created for CNE purposes. This module is configured to retrieve the configuration information needed to facilitate moving of virtual machines across data centers and it is synchronized between two or more VCSs.
In one embodiment, the learned MAC address databases are distributed across data centers. Additionally, edge port state change notifications (SCNs) may be distributed across data centers. When a physical RBridge is going down, the SCN is converted to multiple port SCNs on the inter-data-center FCIP link.
In order to protect the inter-data-center connectivity, a VCS can form a vLAG between two or more CNE devices. In this model, the vLAG RBridge identifier is used as the data-center RBridge identifier. The FCIP control plane is configured to be aware of this arrangement and exchange the vLAG RBridge identifiers in such cases.
Various software modules 1216 are present in the CNE/LDCM device 1200. These include an underlying operating system 1218, a control plane module 1220 to manage interaction with the VCS, a TRILL management module 1222 for TRILL functions above the control plane, an FCIP management module 1224 to manage the FCIP tunnels over the WAN, an FC management module 1226 to interact with the FC SAN and an address management module 1228. An additional module is a high availability (HA) module 1230, which in turn includes a flow-based TCP submodule 1232. The software in the connection flow-based TCP submodule 1232 is executed in the CPUs 1204 to perform the flow-based TCP operations described below relating to
The cloud virtual interconnect 1304 preferably includes the following components: an FCIP trunk, as more fully described in U.S. patent application Ser. No. 12/880,495, entitled “FCIP Communications with Load Sharing and Failover”, filed Sep. 13, 2010, which is hereby incorporated by reference, and aggregates multiple TCP connections to support wide WAN bandwidth ranges from 100 Mbps up to 20 Gbps. It also supports multi-homing and enables transparent failover between redundant network paths.
Adaptive rate limiting (ARL) is performed on the TCP connections to change the rate at which data is transmitted through the TCP connections. ARL uses the information from the TCP connections to determine and adjust the rate limit for the TCP connections dynamically. This will allow the TCP connections to utilize the maximum available bandwidth. It also provides a flexible number of priorities for defining policies and the users are provisioned to define the priorities needed.
High bandwidth TCP (HBTCP) is designed to be used for high throughput applications, such as virtual machine and storage migration, over long fat networks. It overcomes the challenge of the negative effect of traditional TCP/IP in WAN. In order to optimize the performance, the following changes can be made.
1) Scaled Windows: In HBTCP, scaled windows are used to support WAN latencies of up to 350 ms or more. Maximum consumable memory will be allocated per session to maintain the line rate.
2) Optimized reorder resistance: HBTCP has more resistance to duplicate acknowledgements and requires more duplicate ACK's to trigger the fast retransmit.
3) Optimized fast recovery: In HBTCP, instead of reducing the cwnd by half, it is reduced by substantially less than 50% in order to make provision for the cases where extensive network reordering is done.
4) Quick Start: The slow start phase is modified to quick start where the initial throughput is set to a substantial value and throughput is only minimally reduced when compared to the throughput before the congestion event.
5) Congestion Avoidance: By carefully matching the amount of data sent to the network speed, congestion is avoided instead of pumping more traffic and causing a congestion event so that congestion avoidance can be disabled.
6) Optimized slow recovery: The retransmission timer in HBTCP (15 ms) expires much quicker than in traditional TCP and is used when fast retransmit cannot provide recovery. This triggers the slow start phase earlier when a congestion event occurs.
7) Lost packet continuous retry: Instead of waiting on an ACK for a SACK retransmitted packet, continuously retransmit the packet to improve the slow recovery, as described in more detail in U.S. patent application Ser. No. 12/972,713, entitled “Repeated Lost Packet Retransmission in a TCP/IP Network”, filed Dec. 20, 2010, which is hereby incorporated by reference.
The vMotion migration data used in VM mobility for VMware systems enters the CNE/LDCM device 1302 through the LAN Ethernet links of the CEE switching chip 1210 and the compressed, encrypted data is sent over the WAN infrastructure using the WAN uplink using the Ethernet ports 1206 of the SOC 1202. Similarly for storage migration, the data from the SAN FC link provided by the FC switching chip 1208 is migrated using the WAN uplink to migrate storage. The control plane module 1220 takes care of establishing, maintaining and terminating TCP sessions with the application servers and the destination LDCM servers.
LAN termination 1402 has a layer 2, Ethernet or CEE, module 1420 connected to the LAN ports. An IP virtual edge routing module 1422 connects the layer 2 module 1420 to a Hyper-TCP module 1424. The Hyper-TCP module 1424 operation is described in more detail below and includes a TCP classifier 1426 connected to the virtual edge routing module 1422. The TCP classifier 1426 is connected to a data process module 1428 and a session manager 1430. An event manager 1432 is connected to the data process module 1428 and the session manager 1430. The event manager 1432, the data process module 1428 and the session manager 1430 are all connected to a socket layer 1434, which acts as the interface for the Hyper-TCP module 1424 and the LAN termination 1402 to the application module 1408.
SAN termination 1404 has an FC layer 2 module 1436 connected to the SAN ports. A batching/debatching module 1438 connects the FC layer 2 module 1436 to a routing module 1440. Separate modules are provided for Fibre connection (FICON) traffic 1442, FCP traffic 1444 and F_Class traffic 1446, with each module connected to the routing module 1440 and acting as interfaces between the SAN termination 1404 and the application module 1408.
The application module 1408 has three primary applications, hypervisor 1448, web/security 1452 and storage 1454. The hypervisor application 1448 cooperates with the various hypervisor motion functions, such vMotion, Xenmotion and MS Live Migration. A caching subsystem 1450 is provided with the hypervisor application 1448 for caching of data during the motion operations. The web/security application 1452 cooperates with virtual private networks (VPNs), firewalls and intrusion systems. The storage application 1454 handles iSCSI, network attached storage (NAS) and SAN traffic and has an accompanying cache 1456.
The data compaction engine 1410 uses the compression engine 1212 to handle compression/decompression and deduplicaton operations to allow improved efficiency of the WAN links.
The main function of the HRDA layer 1412 is to ensure the communication reliability at the network level and also at the transport level. As shown, the data centers are consolidated by extending the L2 TRILL network over IP through the WAN infrastructure. The redundant links are provisioned to act as back up paths. The HRDA layer 1412 performs a seamless switchover to the backup path in case the primary path fails. HBTCP sessions running over the primary path are prevented from experiencing any congestion event by retransmitting any unacknowledged segments over the backup path. The acknowledgements for the unacknowledged segments and the unacknowledged segments themselves are assumed to be lost. The HRDA layer 1412 also ensures reliability for TCP sessions within a single path. In case a HBTCP session fails, any migration application using the HBTCP session will also fail. In order to prevent the applications from failing, the HRDA layer 1412 transparently switches to a backup HBTCP session.
The CVI 1406 includes an IP module 1466 connected to the WAN links. An IPSEC module 1464 is provided for link security. A HBTCP module 1462 is provided to allow the HBTCP operations as described above and to perform the out of order delivery of PDUs to the upper layer and advertised receive window changes as described below. A quality of service (QoS)/ARL module 1460 handles the QoS and the ARL function described above. A trunk module 1458 handles trunking operations.
Hyper-TCP is a component in accelerating the migration of live services and applications over long distance networks. Simply, a TCP session between the application client and server is locally terminated and by leveraging the high bandwidth transmission techniques between the data centers, application migration is accelerated.
Hyper-TCP primarily supports two modes of operation:
1) Data Termination Mode (DTM): In data termination mode, the end device TCP sessions are not altered but the data is locally acknowledged and data sequence integrity is maintained.
2) Complete Termination Mode (CTM): In the complete termination mode, end device TCP sessions are completely terminated by the LDCM. Data sequence is not maintained between end devices but data integrity is guaranteed.
There are primarily three phases in Hyper-TCP. They are Session Establishment, Data Transfer and Session Termination. These three phases are explained below.
1) Session Establishment: During this phase, the connection establishment packets are snooped and the TCP session data, like connection end points, Window size, MTU and sequence numbers, are cached. The Layer 2 information like the MAC addresses is also cached. The TCP session state on the Hyper-TCP server is the same as that of the application server and the TCP session state of the Hyper-TCP client is the same as application client. With the cached TCP state information, the Hyper-TCP devices can locally terminate the TCP connection between the application client and server and locally acknowledge the receipt of data packets. Hence, the round trip times (RTT's) calculated by the application will be masked from including the WAN latency, which results in better performance.
The session create process is illustrated in
2) Data Transfer Process: Once the session has been established, the data transfer is always locally handled between a Hyper-TCP device and the end device. A Hyper-TCP server acting as a proxy destination server for the application client locally acknowledges the data packets and the TCP session state is updated. The data is handed over to the HBTCP session between the Hyper-TCP client and server. HBTCP session compresses and forwards the data to the Hyper-TCP client. This reduces the RTT's seen by the application client and the source as it masks the latencies incurred on the network. The data received at the Hyper-TCP client is treated as if the data has been generated by the Hyper-TCP client and the data is handed to the Hyper-TCP process running between the Hyper-TCP client and the application server. Upon congestion in the network, the amount of data fetched from the Hyper-TCP sockets is controlled.
This process is illustrated in
3) Session Termination: A received FIN/RST is transparently sent across like the session establishment packets. This is done to ensure the data integrity and consistency between the two end devices. The FIN/RST received at the Hyper-TCP server will be transparently sent across only when all the packets received prior to receiving a FIN have been locally acknowledged and sent to the Hyper-TCP client. If a FIN/RST packet has been received on the Hyper-TCP client, the packet will be transparently forwarded after all the enqueued data has been sent and acknowledged by the application server. In either direction, once the FIN has been received and forwarded, the further transfer of packets is done transparently and is not locally terminated.
This is shown in more detail in
Flow-Based TCP
In WAN optimization products, and some other products, there is sometimes a need to tunnel multiple flows in the same TCP tunnel. Carrying multiple LAN TCP connections over one WAN TCP connection helps in reducing the number of TCP connections across the WAN but it can also introduce a head of the line blocking problem. Head of the line blocking occurs, when there is a frame loss for one of the flows and as a result of the frame loss for the one flow, other flows are not delivered until the lost frame is retransmitted. In the preferred embodiment of the invention, this problem is addressed by using stream based TCP connections where each LAN TCP connection is mapped to a stream and each stream data unit is sent with a stream identifier. TCP delivers stream data units out of order but packets in the stream data unit are always in order. CVI guarantees that data units for a stream are always delivered in order.
The head of line blocking problem and the solution for it are illustrated in
The preferred embodiment of the present invention introduces a method for sharing the TCP tunnel between multiple flows without having this head of the line blocking problem. The method involves allowing the data streams that are transmitted after a stuck data stream to pass through the TCP channel to the remote side without having to wait for the stuck data stream to pass through. Thus, as shown in
This is achieved by first removing out of the TCP receive queue complete but out of order PDUs. In order to do that, the boundaries of the PDUs of different data streams are preserved to determine one PDU from another. A variety of methods can be employed to preserve PDU boundaries. In one embodiment, to preserve PDU boundaries data is parsed to look for PDU/CVI headers. When out-of-order packets are received, it may not be clear where the next PDU/CVI header will be. Thus, in this embodiment every byte of payload data is searched until a header is found, and it is validated that it is in fact a header and not payload data. This method may be time consuming and not very efficient.
An alternative embodiment for preserving PDU boundaries involves using the urgent flag of the data stream as a pointer to the PDU boundary. In this embodiment, the urgent flag and offset are used to denote the beginning of the PDU/CVI header with a TCP segment.
In one embodiment, the TCP transmit engine needs to keep a running total of the number of bytes in a PDU sent to identify when the next start of PDU is in the TCP segment. This is done through a set of counters to identify when a PDU header is in the segment. If there is a PDU header, the TCP transmit engine sets the urgent flag and sets the urgent pointer to the byte count of the previous PDU in the segment (the value can be anywhere from 0 to the MSS). If a packet does not have a start of a PDU header in it, the urgent flag is not set, indicating the entire segment is after the PDU header.
To prevent unneeded waiting and reassembly of the PDU header on the remote side, the segment size may be truncated as to include the start of the PDU header up through the entire PDU length field in a single segment. This causes some TCP segments to be smaller than the optimal MSS, but it will prevent waiting on the remote side for reassembly.
Reassembly of PDUs in TCP Receive
When a packet is received that has an urgent flag set, a check is made to verify that the PDU has enough of the header to read the PDU size. If there is enough data to read the PDU size, the size will be read, and a PDU boundary will be noted. From that point on the start of PDUs can be determined and all incoming packets processed. PDU boundaries will be determined and when an entire PDU is received, it will be immediately sent up the layer. This process allows for packets to be sent to the upper layer out of order, preventing head of line blocking.
The method of using the urgent flag as a pointer to the PDU boundary is easy to implement, but it only allows for one boundary per packet and prevents from filling the full MSS if there is a small PDU, particularly if the PDU includes jumbo frames. This is because the larger the jumbo frame, the greater the chance of multiple boundaries in a packet. This issue is addressed by using the PDU length value to calculate the start of the next PDU. This can be continued as long as segments are received in order. When an out of order segment is received, the urgent pointer is used to find the next PDU, so that the next PDU length can be obtained to continue the process. Thus, PDU boundaries can be preserved by using the urgent flag as a pointer.
The second step involved in successfully removing complete but out of order PDU's in the TCP tunnel is to open the receive window, when a complete but out of order PDU is removed out of the TCP receive queue. The size of an advertised receive window is generally restricted to two times the normal operating receive window size.
The receive window is generally opened when initially receiving out-of-order data. As out-of-order but complete PDUs are pulled out of the receive queue, however, that data is counted double towards the receive window size because the data cannot be ACKed until it can be sent up to the TCP user. To alleviate this problem, place holders are used in the receive queue to indicate data that was in the queue, but no longer exists in the queue. Thus, in the receive queue, a placeholder is inserted to indicate that data has been sent up to the user. The placeholder has byte counters for what has been sent and what is remaining to be sent to properly adjust the window sizes. This facilitates continued processing of the queue. When a segment is sent up to the application layer out of order, credits are applied to the advertised receive window for the size of the bytes sent up. Thus, the size of the data that is sent up is added in to the advertised TCP receive window. This creates a situation where the TCP receive window advertisements reflects the available size of the receive queue and the receive window is kept open for new data.
If out-of order PDUs having sizes X1, X2, X3. . . , respectively, are pulled out of the queue, the window advertisement would be calculated as:
win_adv=max_win_size+(X1+X2+X3+ . . . )—bytes_still_in RX_queue
The receive window size is decreased by the amount incremented for each placeholder frame on the receive queue. This decreases the receive window size down to the normal value for when all gaps in the receive queue have been filled.
Table 1 below represents what could be processed, and what the advertised window would be at each of the given time stamps for the above example in prior art TCP tunnel transfers. It should be noted that in the prior art TCP cases, the upper layer could not process any PDUs until after time index T6 at the point of retransmit. In addition, the window size would be steadily decreasing until the retransmit is received.
With early credit back to the RX window when a PDU is passed along to the upper layer, the same example would progress as shown in the Table 2. As shown, in this case, at earlier time stamps the upper layer can process full PDUs. Additionally, the advertised window does not drop down as far.
If the data in Tables 1 and 2 above is examined in a side by side comparison, it would be seen that the further removed a retransmit is from the original place it was supposed to be received, the worse the blocking is for the prior art TCPs. Table 3 below shows a side by side comparison based on the following assumptions:
The disclosed method of manipulating the receive window size keeps the sending side from running out of transmit window size and stopping transmission of new data when the receive side is able to pull out-of-order data from the RX queue. This helps reduce the amount of head-of-line-blocking when multiple flows share the same WAN TCP connection.
As shown in
if ((segment size−urgent pointer−length offset−length size)>0){/*length is not segmented*/}
Once the entire portion of the length field is received, the length of the PDU is determined and processed on the queue as normal.
The above description is intended to be illustrative, and not restrictive. For example, the above-described embodiments may be used in combination with each other. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. In the appended claims, the terms “including” and “in which” are used as the plain-English equivalents of the respective terms “comprising” and “wherein.”
This application claims the benefit under 35 U.S.C. §119(e) of U.S. Provisional Patent Application Ser. No. 61/567,288 entitled “Flow-based TCP,” filed Dec. 6, 2011, which is hereby incorporated by reference. This application is also related to U.S. patent application. Serial Nos. 13/677,929 , entitled “Lossless Connection Failover for Single Devices,” 13/677,909 , entitled “TCP Connection Relocation,” 13/677,922, entitled “Lossless Connection Failover for Mirrored Devices,” all three filed concurrently herewith, which are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
6434620 | Boucher et al. | Aug 2002 | B1 |
6560243 | Mogul | May 2003 | B1 |
6563821 | Hong et al. | May 2003 | B1 |
20040042458 | Elzu | Mar 2004 | A1 |
20040143642 | Beckmann et al. | Jul 2004 | A1 |
20050033878 | Pangal et al. | Feb 2005 | A1 |
20050063307 | Samuels et al. | Mar 2005 | A1 |
20050135416 | Ketchum et al. | Jun 2005 | A1 |
20050165985 | Vangal et al. | Jul 2005 | A1 |
20070076726 | Weston et al. | Apr 2007 | A1 |
20080126553 | Boucher et al. | May 2008 | A1 |
20090006710 | Daniel et al. | Jan 2009 | A1 |
20090080332 | Mizrachi et al. | Mar 2009 | A1 |
20100232427 | Matsushita et al. | Sep 2010 | A1 |
20100281195 | Daniel et al. | Nov 2010 | A1 |
20100318700 | Rangan et al. | Dec 2010 | A1 |
20110029734 | Pope et al. | Feb 2011 | A1 |
20120163396 | Cheng et al. | Jun 2012 | A1 |
Number | Date | Country | |
---|---|---|---|
20130315260 A1 | Nov 2013 | US |
Number | Date | Country | |
---|---|---|---|
61567288 | Dec 2011 | US |