Unless otherwise indicated herein, the approaches described in this section are not admitted to be prior art by inclusion in this section.
Communications networks are generally packet-switched networks that operate based on Internet Protocol (IP). When one endpoint (e.g., host) has data to send to another endpoint, the data may be transmitted as a series of packets. Transmission Control Protocol (TCP) is a transport layer protocol that offers reliable data transfer between two endpoints. TCP is connection-oriented protocol that requires endpoints to establish a connection before data transfer occurs. Although widely implemented, TCP is designed to ensure that data is delivered according to a particular order from one endpoint to another, which may not be optimal for network throughput and resource utilization.
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, similar symbols typically identify similar components, unless context dictates otherwise. The illustrative embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented here. It will be readily understood that the aspects of the present disclosure, as generally described herein, and illustrated in the drawings, can be arranged, substituted, combined, and designed in a wide variety of different configurations, all of which are explicitly contemplated herein.
Conventional Transmission Control Protocol (TCP) is designed to ensure in-order transfer during a connection. For example, when a web server has several web page objects (e.g., O1, O2, O3, O4 and O5) to send to a client for display on a web browser, these objects are usually sent according to a particular order. The web server may first send O1, followed by O2, O3, O4 and then O5 during the TCP connection. This order according to which data is sent by the web server affects how the data is displayed on the web browser at the client. If a packet containing part or whole of O1 is dropped, all other packets containing O2, O3, O4 and O5 will not be delivered to the web browser until the dropped packet is retransmitted.
The above problem is known as head-of-line blocking, which adversely affects application throughput and causes unnecessary delay and longer page load time at the client. This problem is especially evident in applications where objects (e.g., O1, O2, O3, O4 and O5 in the above example) are independent of each other. In this case, it is not necessary for the source to send, and the destination to receive, them according to a particular order. One conventional approach to mitigate head-of-line blocking is to establish multiple TCP connections, such as one for each web page object. However, this approach is inefficient and requires many resources, such as to open and manage multiple sockets at both endpoints.
According to examples of the present disclosure, a multipath connection may be established to transfer multiple data sets. For example, Multipath Transmission Control Protocol (MPTCP) is a multipath connection protocol that utilizes multiple paths simultaneously to transfer data between endpoints. An MPTCP connection begins similarly to a regular TCP connection, and includes multiple subflows that are established as required. Instead of using a TCP connection that enforces in-order transfer strictly, subflows allow multiple data sets to be transferred independently over multiple paths. This reduces head-of-line blocking and thereby improves application throughput and performance.
In the following, although “multipath connection” is exemplified using an MPTCP connection, it should be understood that other suitable protocol may be used. Throughout the present disclosure, the term “multipath connection” may refer generally to a set of subflows between two endpoints. The term “data set” may refer generally to any group or block of data. For example, in the above web browsing application, a data set may contain a single page object (e.g., O1), or multiple page objects (e.g., O1 and O2). Depending on the application, a data set may be in any other suitable form, such as memory page, file, storage object, virtual machine data, virtual disk file, etc.
In more detail,
Each endpoint 110/120 executes application 112/122 (one shown for simplicity) having access to protocol stack 116/126 via socket 114/124. Protocol stack 116/126 is divided into several layers, such as transport layer (e.g., MPTCP, TCP), network layer (e.g., Internet Protocol (IP) layer), etc. Socket 114/124 serves as a protocol-independent interface for application 112/122 to access protocol stack 116/126, such as via socket system calls. Here, the term “socket system call” may refer generally to a software function that is called or invoked by application 112/122 (e.g., a user-level process) to access a service supported by protocol stack 116/126 (e.g., a kernel-level process), such as to send data to EP-B 120, etc.
As defined in Request for Comments (RFC) 6824 published by the Internet Engineering Task Force (IETF), an MPTCP connection may be established between EP-A 110 and EP-B 120 provided at least one of the endpoints is multi-homed (i.e., having multiple network interfaces) and multi-addressed (i.e., having multiple IP addresses). In the example in
EP-A 110 and EP-B 120 are connected via various intermediate devices, such as R1 130, R2 140, R3 150, R4 152, R5 154 and R6 156. Each intermediate device may be any suitable physical or virtual network device, such as a router, switch, gateway, any combination thereof, etc. EP-A 110 is connected to R1 130 via NIC-A1 118 and NIC-A2 119, and EP-B 120 to R2 140 via NIC-B 128. R1 130 provides multiple paths between EP-A 110 and EP-B 120. In particular, a first path is formed by the connection between R1 130 and R2 140 via R3 150, a second path via R4 152, a third path via R5 154, and a fourth path via R6 156.
In the example in
The mapping between an MPTCP connection and socket 114/124 is generally one-to-one. For example in
More detailed examples will be discussed with reference to
Referring first to 210 in
Any suitable approach may be used at 210, such as detecting socket system calls invoked by application 112 to send D1 160 and D2 162 via socket 114. As will be further described using examples in
The term “tag information” may refer generally to information, such as a label, keyword, category, class, or the like, assigned to a particular piece of data to indicate its independence from other data assigned with different tag information. In one example in
In particular, at 230 in
SF1 170 is identified by first set of tuples 172 and SF2 180 by second set of tuples 182 configured by EP-A 110. The term “set of tuples” may generally refer to a 4-tuple in the form of (source IP address, source port number, destination IP address, destination port number) for uniquely identifying a bi-directional connection between EP-A 110 and EP-B 120. In practice, a 5-tuple set (i.e., 4-tuple plus protocol information) may also be used. In the example in
At 250 in
Using example process 200, D1 160 and D2 162 that are independent of each other may be sent in parallel to EP-B 120 using the MPTCP connection to improve application throughput and performance. Since in-order transfer is not required, the effect of head-of-line blocking is also reduced. For example, when a packet containing part or whole of D1 160 is lost, it is not necessary for D2 162 to wait for the retransmission of the lost packet before D2 162 can be delivered to application 122 at destination EP-B 120.
It should be understood that in-order transfer of packets on a particular subflow (e.g., SF1 170) is still maintained. In practice, the packets are assigned a sequence number to maintain a particular order within a subflow. However, since D1 160 and D2 162 are independent of each other and sent on respective subflows SF1 170 and SF2 180, they may arrive at EP-B 120 for delivery to application 122 in any order. For example in
In practice, example process 200 may be implemented by endpoints without requiring any software and/or hardware changes to intermediate devices R1 130 to R6 156. Since there are usually many intermediate devices connecting a pair of endpoints, the costs of implementing example process 200 may be reduced.
In the following, various examples will be explained with reference to
In the following, network environment 100 in
In the example in
It should be understood that a “virtual machine” is one form of workload. In general, a workload may represent an addressable data compute node or isolated user space instance. In practice, any suitable technologies aside from hardware virtualization may be used to provide isolated user space instances. For example, other workloads may include physical hosts, client computers, containers (e.g., running on t op of a host operating system without the need for a hypervisor or separate operating system), virtual private servers, etc. The virtual machines may also be complete computation environments, containing virtual equivalents of the hardware and system software components of a physical computing system.
Server devices are inter-connected via top-of-rack (ToR) leaf switches and spine switches. For example, intermediate devices R1 130 and R2 140 (introduced in
Due to the leaf-spine topology, all server devices are exactly the same number of hops away from each other. For example, packets from left-most rack unit 310 to right-most rack unit 317 may be routed with equal cost via any one of spine switches R3 150, R4 152, R5 154 and R6 156. Leaf switches and/or spine switches may implement flow balancing features such as Equal Cost Multipath (ECMP) routing, NIC teaming, Link Aggregation Control Protocol (LACP), etc. For example, leaf switch R1 130 is ECMP-capable and configured to distribute subflows from downstream server device 320 hosting EP-A 110 to any one of the upstream spine switches R3 150, R4 152, R5 154 and R6 156.
In data center environment 300, various applications may require high volume data transfers, such as virtual machine migrations, backups, cloning, file transfers, data placement on a virtual storage area network (SAN), fault tolerance, high availability (HA) operations, etc. In some cases, data transfers may involve sending large amount Internet Small Computer System Interface (iSCSI) traffic and/or Network File System (NFS) traffic between endpoints. In these applications, since in-order transfer is generally not required, example process 200 may be used to send multiple data sets in parallel.
Referring first to 410 and 420 in
At 515 and 520 in
At 525 and 530 in
To implement first example process 500, conventional socket system call send(s, buffer, len, flags) may be modified to include an additional parameter to specify the tag information. This allows application 112 to tag the socket system calls at 515 and 520, and protocol stack 116 to identify the tag information at 525. In cases where it is not feasible to modify send( ), second example process 540 in
For example, “Tag-D1” may be included in “buffer-D1” and “Tag-D2” in “buffer-D2”. In this case, “send(s, buffer-D1, len, flags) for D1” is a first socket system call invoked to send D1 160, and “send(s, buffer-D2, len, flags)” is a second socket system call invoked to send D2 162. In practice, the tag information may be included at the beginning of each buffer, end of the buffer, or any other predetermined location for ease of identification by protocol 116.
At 565 and 570 in
Although socket system call “send( )” is used in the above examples, it should be understood that any other suitable system call may be used, such as “sendto( ),” “sendmsg( ),” “write( ),” “writev( ),” etc. Any alternative tagging approach may also be used. For example, in another approach, control information “flags” in each “send( )” system call may be modified to include an flag (e.g., MSG_TAG) containing the tag information.
Referring to
At 440, 442 and 444 in
At 450 in
At 460, 462 and 464 in
At 470 and 480 in
During packet forwarding, next-hop leaf switch R1 130 performs path selection based on tuples 172/182 configured for each subflow 170/180. For example, at 472 and 474 in
At destination EP-B 120, since D1 160 and D2 162 are received on different subflows, it is not important whether D1 160 arrives before D2 162, D2 162 before D1 160, or both at the same time. However, in-order transfer of packets on a particular subflow (e.g., SF1 170) is maintained. For example, if a packet containing part of D1 160 is lost, the packet has to be retransmitted by EP-A 110 and received by EP-B 120 before subsequent packets on the same subflow are delivered to application 122. However, since D2 162 is sent independently, the packet loss will not affect D2 162 on SF2 180, thereby reducing head-of-line blocking.
After the subflows are established, it should be understood that application 112 may send further data associated with D1 160 using “Tag-D1.” This indicates to protocol stack 116 that the further data should be sent on the same SF1 170 to maintain order. On the other hand, application 112 may send further data associated with D2 162 using “Tag-D2.” Again, this is to indicate to protocol stack 116 that the further data should be sent on the same SF2 180 to maintain order. In practice, protocol stack 116 may store the tag information, and associated subflow, in any suitable data store (e.g., table) for subsequent reference.
Although two subflows are shown in the examples, it should be understood that EP-A 110 and EP-B 120 may be configured to support any number of subflows to transfer data in parallel. For example, application 112 may use “Tag-D3” for a third data set “D3.” In this case, protocol stack 116 may determine whether a subflow has already been established for D3, such as by comparing “Tag-D3” with the known “Tag-D1” and “Tag-D2” (e.g., retrieved from the data store mentioned above). If not established, protocol stack 116 proceeds to establish a third subflow (e.g., SF3) to send D3 in parallel with D1 160 on SF1 170 and D2 162 on SF2 180. This further improves application throughput and performance.
Although EP-A 110 is multi-homed (i.e., having NIC-A1 118 and NIC-A2 119) in the example in
In more detail,
To establish an MPTCP connection over NIC-A 610, different port numbers are configured for subflows SF1 620 and SF2 630. For example, first set of tuples 622 is configured to be (IP-A, Port-A1, IP-B, Port-B) and second set of tuples 632 to be (IP-A, Port-A2, IP-B, Port-B), both sharing the same IP address. As such, although EP-A 110 is not multi-homed and multi-addressed, an MPTCP connection may be established to take advantage of the multiple paths between EP-A 110 and EP-B 120.
For example, according to first set of tuples 622, R1 130 may perform path selection to forward D1 160 on SF1 620 to EP-B 120, such as on a first path formed by R1 130, R3 150 and R2 140. A different path may be used for D2 162. For example, according to second set of tuples 632, R1 130 may perform path selection to forward D2 162 on SF2 630 to EP-B 120, such as on a second path formed by R1 130, R6 156 and R2 140.
In practice, the maximum number of subflows that can be used to transfer data in parallel may be configured. For example, EP-A 110 may be configured to have cognizance of the multiple paths between them and operate in a “network-cognizant mode.” Here, the term “cognizance” (of the multiple paths) and related “network-cognizant” may refer generally to one endpoint (e.g., EP-A 110) having awareness or knowledge of the multiple paths (e.g., further within the network) that lead to another endpoint.
Such cognizance of the multiple paths may then be exploited by application 112 of EP-A 110 to include tag information when sending data to EP-B 120 according to the examples in
Any suitable approach may be used to configure the network-cognizant mode and/or MAX_SF. For example, the configuration may be performed by a user (e.g., network administrator) who has knowledge of the leaf-spine topology and the number of leaf and spine switches in data center environment 300. It is also possible to initiate the configuration programmatically (e.g., using a script), such as based on relevant information (e.g., message, trigger, etc.) from a leaf switch, spine switch, endpoint, management device (not shown for simplicity), etc.
When operating in the network-cognizant mode, application throughput may be further improved the example in
In the example in
At protocol stack 116, based on the cognizance of multiple paths (e.g., MAX_SF=4) between EP-A 110 and EP-B 120 as well as the independence of the different data sets, multiple subflows may be established to send D1 160, D2 162, D3 710 and D4 720 in parallel. In addition to SF1 170 and SF2 180 explained above, subflow SF3 730 is established to send D3 710 over NIC-A1 118. To distinguish SF3 730 from other subflows, third set of tuples 732=(IP-A1, Port-A3, IP-B, Port-B) is configured for SF3 730. Further, SF4 740 is established to send D4 720 over NIC-A2 119 and identified by fourth set of tuples 742=(IP-A2, Port-A4, IP-B, Port-B).
At R1 130, path selection may be performed to select a first path via R3 150 for D1 160 based on tuples 172 and a second path via R6 156 for D2 162 based on tuples 182. Additionally, a third path via R5 154 is selected for D3 710 based on tuples 732 and a fourth path via R4 152 for D4 720 based on tuples 742. Again, at destination EP-B 120, the order of arrival is not important. For example, D2 (see 192) may be first delivered to application 122, followed by D4 (see 750), then D1 (see 190) and finally D3 (see 760). As such, head-of-blocking may be reduced because it is not necessary for data sets on different subflows to wait for each other.
Similar to the example in
Examples of the present disclosure suitable in networks such as data center environment 300 paths between EP-A 110 and EP-B 120 are under a single autonomous domain. In this case, the multiple paths between EP-A 110 and EP-B 120 are usually known, which may be exploited to establish a MPTCP connection to send data sets in parallel. Such custom network environments should be contrasted with public networks (e.g., the Internet) in which endpoints and intermediate devices are usually controlled by different autonomous systems or domains.
Although EP-A 110 is shown as the initiator of the MPTCP connection, it will be appreciated that the examples discussed using
The above examples can be implemented by hardware (including hardware logic circuitry), software or firmware or a combination thereof. The above examples may be implemented by any suitable computing device, network device, computer system, etc., which may include processor and memory that may communicate with each other via a bus, etc. The network device may include a non-transitory computer-readable medium having stored thereon instructions that, when executed by the processor, cause the processor to perform processes described herein with reference to
The techniques introduced above can be implemented in special-purpose hardwired circuitry, in software and/or firmware in conjunction with programmable circuitry, or in a combination thereof. Special-purpose hardwired circuitry may be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), and others. The term ‘processor’ is to be interpreted broadly to include a processing unit, ASIC, logic unit, or programmable gate array etc.
The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or any combination thereof.
Those skilled in the art will recognize that some aspects of the embodiments disclosed herein, in whole or in part, can be equivalently implemented in integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computing devices), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of skill in the art in light of this disclosure.
Software and/or to implement the techniques introduced here may be stored on a non-transitory computer-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A “computer-readable storage medium”, as the term is used herein, includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant (PDA), mobile device, manufacturing tool, any device with a set of one or more processors, etc.). A computer-readable storage medium may include recordable/non recordable media (e.g., read-only memory (ROM), random access memory (RAM), magnetic disk or optical storage media, flash memory devices, etc.).
The drawings are only illustrations of an example, wherein the units or procedure shown in the drawings are not necessarily essential for implementing the present disclosure. Those skilled in the art will understand that the units in the device in the examples can be arranged in the device in the examples as described, or can be alternatively located in one or more devices different from that in the examples. The units in the examples described can be combined into one module or further divided into a plurality of sub-units.
Number | Date | Country | Kind |
---|---|---|---|
6569/CHE/2015 | Dec 2015 | IN | national |
Benefit is claimed under 35 U.S.C. 119(a)-(d) to Foreign application Serial No. 6569/CHE/2015 filed in India entitled “TRANSFERRING MULTIPLE DATA SETS USING A MULTIPATH CONNECTION”, on Dec. 8, 2015, by Nicira, Inc., which is herein incorporated in its entirety by reference for all purposes. The present application (Attorney Docket No. N221) is related in subject matter to U.S. patent application Ser. No. ______ (Attorney Docket No. N203.01) and U.S. patent application Ser. No. ______ (Attorney Docket No. N203.02), which are incorporated herein by reference.