The field relates generally to communications over a wide area network and, more particularly, to a connection-oriented protocol for wide area network communication devices.
The Transmission Control Protocol (TCP) has been very successful and significantly contributes to the popularity of the Internet. See, for example, M. Allman et al., “TCP Congestion Control,” Request for Comments 5681 (RFC5681) (September 2009). A majority of Internet communications are transmitted using TCP. Recently, however, with the rapid advance of optical networks and rich Internet applications, TCP has been found to be less efficient as the network bandwidth-delay product (BDP) increases. Bandwidth-delay product refers to the product of the capacity of a data link (typically, in bits per second) and its round-trip delay time (typically, in seconds). BDP represents an amount of data measured in bits (or bytes) that is equivalent to the maximum amount of data on the network circuit at any given time (e.g., data that has been transmitted but not yet acknowledged).
The Additive Increase Multiplicative Decrease (AIMD) algorithm of TCP reduces the TCP congestion window significantly but fails to recover to the available bandwidth quickly. See, for example, D. Chiu, and R. Jain, “Analysis of the Increase/Decrease Algorithms for Congestion Avoidance in Computer Networks”, ISDN Systems, Vol. 17, No. 1, 1-14 (June 1989). Theoretical flow level analysis has shown that TCP becomes more vulnerable to packet loss as the BDP increases. See, for example, T. V. Lakshman and U. Madhow, “The Performance of TCP/IP for Networks with High Bandwidth-Delay Products and Random Loss,” IEEE ACM Trans. on Networking, Vol. 5 No 3, 336-50 (July 1997).
A need therefore exists for improved techniques for overcoming the inefficiency problem of TCP over high-speed wide area networks.
Illustrative embodiments of the present invention provide connection-oriented communication devices with round trip time estimation. In at least one embodiment, a method is provided for communicating between a first communication device and a second communication device over at least one wide area communication network. The exemplary method at the first communication device comprises the steps of: the first communication device sending a Round Trip Time (RTT) packet to the second communication device, wherein the RTT packet comprises a timestamp, wherein the second communication device receives the RTT packet, copies the timestamp into a reply RTT packet and sends the reply RTT packet to the first communication device; receiving the reply RTT packet from the second communication device; and determining a current Round Trip Time based on a difference between the timestamp and a current time. A current Round Trip Time is optionally processed by a congestion avoidance and control algorithm.
In at least one embodiment, a train comprises a plurality of packets and the second communication device determines an available network bandwidth by dividing a size of the train by an amount of time it took to receive train. A length of the train is optionally based on the available network bandwidth.
In one or more embodiments, data of a transaction is divided into a plurality of chunks, and a bitmap is maintained for the chunks of the transaction indicating whether a given chunk has been acknowledged. The bitmap is optionally divided into a plurality of ranges, wherein each range has a due time indicating when the corresponding range must be transmitted and wherein the first communication device sends unacknowledged data chunks of a given range based on the due time.
In one exemplary embodiment, at least one application executing on the first communication device or the second communication device controls a size of a queue at the corresponding communication device based on conditions of the wide area communication network.
In one or more embodiments, the second communication device processes a transaction identifier of each received packet and processes a given received packet if the transaction identifier is known to the second communication device. In addition, the second communication device optionally processes a transaction identifier of each received packet and allocates a new transaction if the transaction identifier is not known to the second communication device and satisfies a predefined transaction identifier criteria. The second communication device can process a chunk identifier of each received packet to determine if a given packet is a new packet. The second communication device optionally updates a bitmap indicating whether a given chunk has been acknowledged and provides the bitmap to the first communication device.
Illustrative embodiments described herein provide significant improvements relative to the existing TCP protocol. In some of these embodiments, connection-oriented communication devices can estimate round trip time and thereby provide improved flexibility and efficiency in congestion avoidance and control algorithms compared to, for example, the TCP protocol.
Illustrative embodiments of the present invention will be described herein with reference to exemplary communication devices and associated clients, servers, and other processing and storage devices. It is to be appreciated, however, that the invention is not restricted to use with the particular illustrative device configurations shown.
In one exemplary embodiment, a data transfer protocol, comprising connection-oriented communication methods and apparatus, is provided, based on the User Datagram Protocol (UDP). See, for example, J. Postel, “User Datagram Protocol,” Request for Comments 768 (RFC768) (August 1980), incorporated by reference herein in its entirety. Data and control packets are transferred using UDP. The connection-oriented aspect of the invention allows congestion control, reliability, and security to be maintained.
According to another aspect of the invention, a unicast duplex communication protocol is provided that supports reliable messaging. One or more exemplary embodiments provide for reliable simultaneous transactions over UDP. In the following discussion, the communication methods and systems described herein are referred to as “BURST,” a recursive acronym for “BURST is UDP Reliable Simultaneous Transactions.”
In one or more embodiments, a BURST communication system allows applications to control the size of memory windows, also referred to as memory constraints, in one or more of sending queues and receive queues. In this manner, a BURST communication system permits effective use of network bandwidth since more live data can be stored locally, as discussed further below, for example, in conjunction with
In at least one embodiment, the disclosed BURST data transfer protocol allows the available network bandwidth for a given connection to be automatically measured. In one exemplary embodiment, discussed further below in conjunction with
In one or more embodiments, the protocol decision logic is optionally concentrated in the BURST transmitter, making the BURST receiver fully passive, allowing for robust BURST implementations with application programming interfaces (APIs) compatible with Berkeley sockets. Generally, Berkeley sockets are an application programming interface (API) for Internet sockets and Unix domain sockets, used for inter-process communication (IPC). Berkeley sockets are commonly implemented as a library of linkable modules.
Another aspect of the invention provides a protocol handshake procedure that allows clients to reconnect to a different server on the fly, for better load balancing and flexibility. One or more embodiments provide a connection-oriented, message-oriented, unicast, and duplex BURST data transfer protocol.
BURST Data Model
A given chunk is acknowledged by the receiver (400;
As noted above, in one or more embodiments, the application 310 can control the size of memory windows in the sending queue 340 to accommodate the storage of transactions 120. In this manner, the available network bandwidth can be used more effectively since more live data can be stored. The transactions 120 are stored into the sending queue 340, which is handled by a shaper 360, discussed below.
The size of each queue, such as the sending queue 340, is the memory constraint set by an application. The application may know the conditions of the WAN 380 (e.g., latency, maximum bandwidth and loss percent). Based on these values, the application can estimate how much memory is needed to effectively utilize this particular WAN 380.
It may happen that an application cannot provide all of the memory required, e.g., when the WAN 380 is 40 Gbps with a large delay, and the memory budget is tight. In this case, the disclosed BURST protocol operates on a best-effort basis and can exceed the performance of TCP.
It is noted that the BURST transmitter application 310 (and receiver application 460 of
The sending logic 330 maintains a sorted list of the range due times for transactions 120 in the sending queue 340. The unacknowledged data chunks of a range are sent when their range due time becomes current.
In addition, the sending logic 330 probes the available network bandwidth and periodically measures network Round Trip Time (RTT). In one or more embodiments, RTT is measured by sending probing RTT packets that carry a current timestamp. When a BACKRTT reply packet is received, the difference between the current time and the carried time allows for RTT evaluation. Available network bandwidth is calculated from the network queuing delay, discussed further below.
The sending logic 330 also processes control messages received from the local receiver 400. The local receiver 400 may request the BURST transmitter 300 to send BACKRTT and BITMAP packets, as discussed further below in conjunction with
As noted above, the data chunks (DATA packet) of the transactions 120 are stored into the sending queue 340. In addition, a number of control messages (e.g., RTT, BACKRTT and BITMAP packets) are also stored in the same sending queue 340, handled by the shaper 360.
Generally, the shaper 360 ensures (i) correct available network bandwidth consumption, (ii) congestion avoidance, and (iii) proper packet preparation and timing for available network bandwidth probing using link probing logic 350. In one or more embodiments, the shaper 360 sends the packets onto the wide area network (WAN) 380 in bursts, i.e., a number of packets are sent one after another without any delays between them. The burst of packets is referred to as a train.
Congestion avoidance logic 370 and the probed available bandwidth are the inputs for the shaper 360. Based on this input, the shaper 360 calculates the length of the train. Each packet that gets into a train is assigned a train sequence number, e.g., starting from 0. The train sequence numbers are used by the receiver 400 to calculate the network queuing delay, as discussed further below in conjunction with
The sending logic 330 sleeps until the range sending time of another transaction is due, or a periodic RTT measurement is required, for example, based on a timer. While sending logic 330 sleeps, the transmitter 300 can receive chunk acknowledgements, and the chunks bitmap 130 is updated accordingly. A given transaction 120 becomes complete when all of the chunks in the given transaction 120 are acknowledged. Thereafter, the size of the completed transaction 120 is subtracted from the memory consumption tracking 335 and its memory can be freed.
As noted above, in one or more embodiments, a BURST communication system allows applications to control the size of memory windows, also referred to as memory constraints, in one or more of sending queues 340 and receive (ready) queues 450. In this manner, a BURST communication system permits effective use of network bandwidth since more live data can be stored. An application 460 at the BURST receiver 400 may know the conditions of the WAN 410 (e.g., latency, maximum bandwidth and loss percent). Based on these values, the application can estimate how much memory is needed to effectively utilize this particular WAN 410.
The exemplary BURST receiver 400 maintains the list of incomplete transactions 430, as well as a list of complete transactions (not shown in
The receiver 400 receives incoming flow of different packets: DATA, RTT, BITMAP and BACKRTT. All packets have a train sequence number that is used to calculate the available network bandwidth.
Each DATA packet received during step 610 has a transaction identifier and a chunk identifier in its header. As shown in
If it is determined during step 620 that the transaction identifier is known, then the packet is processed further. If, however, it is determined during step 620 that the transaction identifier is not known, then its value is compared to tidlow during step 630. If the transaction identifier is less than tidlow, then the transaction was completely received, but the sender did not receive a finishing BITMAP for the transaction; the packet is discarded and the finishing BITMAP is scheduled to send again during step 635.
If the transaction identifier is not known to the receiver and the transaction identifier is not less than tidlow, then the new transaction appears. The receiver allocates for this new transaction, and the packet is processed further. A memory buffer for transaction is allocated during step 645 and its size is added to the memory consumption tracker. If the memory consumption is above a predefined high watermark, then the receiver sets do_not_send flag to the transmitter with the BACKRTT packet. In addition, the bitmap is created for the chunks of the transaction, where each bit is set to 0 (i.e., not present/not acknowledged).
As shown in
The exemplary packet handling process processes information in the DATA packet header and state variables during step 665, and determines whether to send the BITMAP packet that contains a collective acknowledgement for all chunks in a range. If it is determined during step 665 that the DATA packet has explicit request to send the BITMAP packet for the range, the BITMAP packet for the DATA packet's range is scheduled to send during step 668.
If it is determined during step 670 that the DATA packet's range is not equal to rangelast, then the BITMAP packet for the rangelast is scheduled to send during step 672 and the rangelast is updated with current range. If it is determined during step 675 that the DATA packet's transaction identifier is not equal to tidlast, the BITMAP packet for transaction tidlast's range rangelast is scheduled to send during step 678, the tidlast and rangelast are updated with current values.
The exemplary packet handling process analyzes the chunk identifier gaps and evaluates network packet loss percent during step 680. The loss percent value can be sent to the remote transmitter 300 with a BACKRTT packet. If all chunks for a transaction are not yet received during step 682, then the process waits for the next packets until all chunks of the transaction are received during step 684.
Once all chunks for a transaction are received during step 682, then the transaction is complete. During step 685, the transaction is moved to the complete transactions list, and made available for an application; the transaction's size is subtracted from the memory consumption tracker; the do_not_send flag is updated accordingly and tidlow and tidhigh are updated accordingly to complete the data packet processing.
The received BITMAP packet is passed to the local transmitter during step 688 and the transmitter updates outbound transactions bitmaps accordingly. The timestamp contained in the received RTT packet is copied into BACKRTT packet and the BACKRTT packet is scheduled to send during step 690. The local transmitter optionally makes a time correction on the BACKRTT timestamp to exclude processing time to make RTT evaluation more precise.
As shown in
When the initiator 710 has completed the data transfer 770, the initiator 710 sends a disconnection request 775 to the server 730. The server 730 responds with a disconnection response 780 to the initiator 710. The initiator 710 then acknowledges the disconnection with a reply 780.
In one or more exemplary embodiments, BURST is a three-way handshake protocol that is immune to “SYN flooding”-type attacks. In addition, a CONN_RSP packet can carry an IP address and port that is different from those of the Listener 720, allowing for on-the-fly client reconnection to a different server.
In one or more exemplary embodiments, security can be implemented externally using BURST as a reliable transport. For example, OpenSSL or RSA BSAFE Transport Layer Security implementations can be used on top of BURST.
Among other benefits, the disclosed BURST protocol can be employed for bulk data transfers, such as replication and cloud data uploads.
The foregoing applications and associated embodiments should be considered as illustrative only, and numerous other embodiments can be configured using the techniques disclosed herein, in a wide variety of different applications.
It should also be understood that the connection-oriented communication techniques, as described herein, can be implemented at least in part in the form of one or more software programs stored in memory and executed by a processor of a processing device such as a computer. As mentioned previously, a memory or other storage device having such program code embodied therein is an example of what is more generally referred to herein as a “computer program product.”
The communication devices may be implemented using one or more processing platforms. One or more of the processing modules or other components may therefore each run on a computer, storage device or other processing platform element. A given such element may be viewed as an example of what is more generally referred to herein as a “processing device.”
Referring now to
The cloud infrastructure 1000 may encompass the entire given system or only portions of that given system, such as one or more of client, servers, controller, authentication server or relying server in the system.
Although only a single hypervisor 1004 is shown in the embodiment of
An example of a commercially available hypervisor platform that may be used to implement hypervisor 1004 and possibly other portions of the system in one or more embodiments of the invention is the VMware® vSphere™ which may have an associated virtual infrastructure management system, such as the VMware® vCenter™. The underlying physical machines may comprise one or more distributed processing platforms that include storage products, such as VNX™ and Symmetrix VMAX™, both commercially available from EMC Corporation of Hopkinton, Mass. A variety of other storage products may be utilized to implement at least a portion of the system.
In some embodiments, the cloud infrastructure additionally or alternatively comprises a plurality of containers implemented using container host devices. For example, a given container of cloud infrastructure illustratively comprises a Docker container or other type of LXC. The containers may be associated with respective tenants of a multi-tenant environment, although in other embodiments a given tenant can have multiple containers. The containers may be utilized to implement a variety of different types of functionality within the system. For example, containers can be used to implement respective compute nodes or cloud storage nodes of a cloud computing and storage system. The compute nodes or servers may be associated with respective cloud tenants of a multi-tenant environment. Containers may be used in combination with other virtualization infrastructure such as virtual machines implemented using a hypervisor.
Another example of a processing platform is processing platform 1100 shown in
The processing device 1102-1 in the processing platform 1100 comprises a processor 1110 coupled to a memory 1112. The processor 1110 may comprise a microprocessor, a microcontroller, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other type of processing circuitry, as well as portions or combinations of such circuitry elements, and the memory 1112, which may be viewed as an example of a “computer program product” having executable computer program code embodied therein, may comprise random access memory (RAM), read only memory (ROM) or other types of memory, in any combination.
Also included in the processing device 1102-1 is network interface circuitry 1114, which is used to interface the processing device with the network 1104 and other system components, and may comprise conventional transceivers.
The other processing devices 1102 of the processing platform 1100 are assumed to be configured in a manner similar to that shown for processing device 1102-1 in the figure.
Again, the particular processing platform 1100 shown in the figure is presented by way of example only, and the given system may include additional or alternative processing platforms, as well as numerous distinct processing platforms in any combination, with each such platform comprising one or more computers, storage devices or other processing devices.
Multiple elements of the system may be collectively implemented on a common processing platform of the type shown in
As is known in the art, the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer readable medium having computer readable code means embodied thereon. The computer readable program code means is operable, in conjunction with a computer system, to carry out all or some of the steps to perform the methods or create the apparatuses discussed herein. The computer readable medium may be a tangible recordable medium (e.g., floppy disks, hard drives, compact disks, memory cards, semiconductor devices, chips, application specific integrated circuits (ASICs)) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used. The computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic media or height variations on the surface of a compact disk.
It should again be emphasized that the above-described embodiments of the invention are presented for purposes of illustration only. Many variations and other alternative embodiments may be used. For example, the techniques are applicable to a wide variety of other types of storage systems that can benefit from the connection-oriented communication techniques disclosed herein. Also, the particular configuration of communication device elements shown herein, and the associated connection-oriented communication techniques, can be varied in other embodiments. Moreover, the various simplifying assumptions made above in the course of describing the illustrative embodiments should also be viewed as exemplary rather than as requirements or limitations of the invention. Numerous other alternative embodiments within the scope of the appended claims will be readily apparent to those skilled in the art.
Number | Date | Country | Kind |
---|---|---|---|
2015155284 | Dec 2015 | RU | national |
Number | Name | Date | Kind |
---|---|---|---|
7274299 | Osman | Sep 2007 | B2 |
7817546 | Filsfils | Oct 2010 | B2 |
8064356 | Krzanowski | Nov 2011 | B1 |
8102852 | Marcondes | Jan 2012 | B2 |
8125970 | Gu | Feb 2012 | B2 |
8274891 | Averi | Sep 2012 | B2 |
8559982 | Wu | Oct 2013 | B2 |
8761037 | Krzanowski | Jun 2014 | B2 |
8769366 | Axelsson | Jul 2014 | B2 |
8773993 | Shojania | Jul 2014 | B2 |
8929305 | Cho | Jan 2015 | B2 |
9100135 | Tosti | Aug 2015 | B2 |
9357534 | Xu | May 2016 | B2 |
9451571 | Lorenz | Sep 2016 | B2 |
9459337 | Aldana | Oct 2016 | B2 |
9667518 | Lakshmikantha | May 2017 | B2 |
9918283 | Braxton | Mar 2018 | B2 |
9935756 | Marri Sridhar | Apr 2018 | B2 |
9998979 | Cui | Jun 2018 | B2 |
20080144624 | Marcondes | Jun 2008 | A1 |
20090010158 | Filsfils | Jan 2009 | A1 |
20100110922 | Ketheesan | May 2010 | A1 |
20100202303 | Gu | Aug 2010 | A1 |
20100205499 | Axelsson | Aug 2010 | A1 |
20110090856 | Cho | Apr 2011 | A1 |
20110274124 | Tosti | Nov 2011 | A1 |
20120063345 | Krzanowski | Mar 2012 | A1 |
20120117273 | Averi | May 2012 | A1 |
20120281715 | Shojania | Nov 2012 | A1 |
20130262951 | Axelsson | Oct 2013 | A1 |
20140003310 | Kamath et al. | Jan 2014 | A1 |
20140241163 | Lee | Aug 2014 | A1 |
20140355461 | Aldana | Dec 2014 | A1 |
20150045048 | Xu | Feb 2015 | A1 |
20160006526 | Cho | Jan 2016 | A1 |
20160044524 | Ben-Haim | Feb 2016 | A1 |
20160088581 | Lorenz | Mar 2016 | A1 |
20160119968 | Kim | Apr 2016 | A1 |
20160212032 | Tsuruoka | Jul 2016 | A1 |
20160241373 | Marri Sridhar | Aug 2016 | A1 |
20160323421 | Sakurai | Nov 2016 | A1 |
20160359753 | Robitaille | Dec 2016 | A1 |
20170013569 | Braxton | Jan 2017 | A1 |
20170078176 | Lakshmikantha | Mar 2017 | A1 |
20170093731 | Flajslik | Mar 2017 | A1 |
20170094298 | Gu | Mar 2017 | A1 |
20170188192 | Mujtaba | Jun 2017 | A1 |
20170208534 | Cui | Jul 2017 | A1 |
20170280343 | Chu | Sep 2017 | A1 |
20170295585 | Sorrentino | Oct 2017 | A1 |
20180152853 | Soder | May 2018 | A1 |
20180152910 | Ryu | May 2018 | A1 |
20180167894 | Braxton | Jun 2018 | A1 |
Number | Date | Country |
---|---|---|
6192490 | Mar 1991 | AU |
2023553 | Mar 1991 | CA |
102197611 | Sep 2011 | CN |
102197611 | Jun 2014 | CN |
0415843 | Mar 1991 | EP |
2359510 | Aug 2011 | EP |
2359510 | Oct 2012 | EP |
2605453 | Jun 2013 | EP |
WO-2010045961 | Apr 2010 | WO |
Entry |
---|
Hedayat et al., A Two-Way Active Measurement Protocol, RFC 5357, Oct. 2008 (Year: 2008). |
Chiu et al., “Analysis of the Increase/Decrease Algorithms for Congestion Avoidance in Computer Networks”, J. of Computer Networks and ISDN Systems, vol. 17, No. 1, 1-14 (Jun. 1989). |
Lakshman et al., “The Performance of TCP/IP for Networks with High Bandwidth-Delay Products and Random Loss,” IEEE ACM Trans. on Networking, vol. 5 No. 3, 336-50 (Jul. 1997). |
Allman et al., “TCP Congestion Control,” Request for Comments 5681 (RFC5681) (Sep. 2009). |
E. Rescorla et al., “Datagram Transport Layer Security Version 1.2”, Request for Comments 6347 (RFC 6347) (Jan. 2012). |
W. Eddy, “TCP SYN Flooding Attacks and Common Mitigations” Request for Comments 4987 (RFC4987) (Aug. 2007). |
Tan et al., “A Compound TCP Approach for High-speed and Long Distance Networks”, in IEEE Infocom, Apr. 2006, Barcelona, Spain. |
J. Postel, “User Datagram Protocol”, downloaded from http//www.rfc-base.org/txt/rfc-768.txt, ISI, Aug. 28, 1980. |
Marchenko et al., “Congestion Avoidance and Control for UDP-Based Protocols”, U.S. Appl. No. 14/236,273, filed Aug. 25, 2014. |
Number | Date | Country | |
---|---|---|---|
20170187598 A1 | Jun 2017 | US |