Traditional computing systems are connected to an external network via a single network device such as a wired or wireless Network Interface Card (NIC). Some systems may include a secondary NIC to provide failover functionality in case issues arise with the primary NIC. In either case, network traffic to and from the computing system is carried by one network device at any given time.
New topologies allow applications to transmit messages over multiple network devices. For example, an application may execute on a computer server including two network devices. The application may establish two connections with a remote application and tie each connection to a respective network device, each of which represents a different network path. The application then transmits messages associated with a first connection over the respective network device of the first connection and transmits messages associated with a second connection over the respective network device of the second connection. The bandwidth desired for the application may be apportioned between each of the network paths. Each network device may therefore operate at a lower bandwidth than would otherwise be used, providing improved network resiliency and scalability.
These topologies require applications to load-balance messages across multiple network paths. Moreover, a failure of a network device will halt the transmission of messages of the application-to-application connection which is associated with the failed network device.
Systems are desired to transparently leverage all available network paths. Such systems may be used to maximize the network bandwidth which is effectively available to an application connection. Such systems may increase resilience against failures of one or more available network paths. From an application standpoint, it would be beneficial to transmit and receive data over an application-to-application connection without regard to or knowledge of the multiple underlying network paths over which the data may pass.
The following description is provided to enable any person in the art to make and use the described embodiments. Various modifications, however, will be apparent to those in the art.
Some embodiments provide an intermediate layer to shield applications from an underlying multi-path fabric implementation. Transparent to an application, data transmitted by the application to a remote application is not tied to a single network path and may travel over multiple network paths. In some embodiments, the intermediate layer segments a message into portions, and, for each portion, the intermediate layer dynamically selects one of several available network paths on which to schedule transmission of the portion. The selection may be based on respective monitored characteristics of the available network paths, including but not limited to a current status (e.g., enabled, disabled, online, offline), queue size, congestion, and performance.
Embodiments may respond seamlessly to NIC failures. For example, a case is considered in which four network paths are available for scheduling transmission of portions of a message, and one of the network paths fails. This failure will be reflected in the monitored characteristics of failed network path, and future message portions will therefore not be scheduled on the failed network path. Any message portions which were previously scheduled on the failed network path and were not successfully received are re-scheduled onto one of the available network paths. Embodiments may therefore improve failure tolerance without requiring action from the sending application or significantly impacting performance.
Computing system 110 also includes NICs 116 through 119. Embodiments of computing system 110 are not limited to any particular number of network devices. As shown, multipathing component 114 logically resides between CPUs 112 and NICs 116 through 119. In some embodiments, one or more hardware accelerators such as a Network Processing Unit may be substituted for CPUs 112 in
Generally, multipathing component 114 receives an instruction from CPUs 112 (or, alternatively, NPUs) to transmit a message over an application connection. The application connection defines a remote application to which the message should be sent. Notably, the received instruction does not specify any particular network path or paths over which the message should be sent.
Multipathing component 114 segments the message into multiple WQEs as will be described below. Also, as will be described below, multipathing component 114 monitors characteristics of network paths which include NICs 116 through 119 and schedules the WQEs on the network paths according to the characteristics.
NICs 116 through 119 may comprise remote direct memory access (RDMA)-enabled NICS (RNICs) but embodiments are not limited thereto. RDMA is a communication protocol which moves data from the memory of one computing system directly into the memory of another computing system with minimal involvement from their respective CPUs. The RDMA protocol allows a computing system to place data directly into its final memory destination in another computing system without additional or interim data copies.
Accordingly, in some embodiments, CPUs 112 may simply use RDMA verbs as is known in the art to instruct multipathing component 114 to transmit messages, with multipathing component 114 appearing as a single RNIC to CPUs 112. Multipathing component 114 receives these RDMA verbs and operates in response as described herein to segment and transmit the messages via NICs 116 through 119.
Network 130 may comprise any number and type of public and private networks suitable to support communication from NICs 116 through 119 to computing system 120 via a desired protocol. Network 130 may comprise a cabled Ethernet network within a datacenter, the public cloud, or any combination thereof, for example.
Computing system 120 may comprise any computing device executing an application 121 to which application 111 wishes to transmit a message. Similarly to computing system 110, computing system 120 comprises one or more CPUs 122, random access memory including transmit buffer 123 and receive buffer 125, and related unshown hardware to provide an environment for the execution of application 121. Computing system 120 may comprise a blade server or a rack server installed within a server rack located in the same datacenter as computing system 110, but embodiments are not limited thereto.
Computing system 120 include NIC 126. According to some embodiments, NIC 126 comprises an RNIC to which CPUs 122 (or one or more corresponding accelerators of computing system 120) directly communicate. Computing system 120 therefore does not include a multipathing component and all network traffic is transmitted from and received at NIC 126. In view of NICs 116 through 119 of computing system 110 and NIC 126 of computing system 120, four network paths exist between computing system 110 and computing system 120 (i.e., NIC 116 to NIC 126, NIC 117 to NIC 126, NIC 118 to NIC 126, and NIC 119 to NIC 126). These four network paths are visible to multipathing component 114 but not to either of application 111 or application 121, although messages sent from application 111 to application 121 may travel over all four network paths as described herein.
Prior to S210, a communication channel, or connection, is established between two applications executing on two different network nodes. Computing systems 110 and 120 are examples of network nodes according to some embodiments. Establishment of a communication channel may typically be performed by the two endpoint applications, rather than by a multipath component as described herein, and therefore is depicted in
An AQP may be negotiated between two applications as is known in the art. For example, the sender application may establish the AQP by exchanging control messages with the receiver application based on physical end node addressing (e.g., IP address, MAC address), ordering policy, QoS, etc. The AQP may be assigned specific characteristics such as endpoint characteristics (e.g., in-order, relaxed-order, QoS priority) and network fabric characteristics (e.g., QoS priority, Traffic Class (TC) assignment).
At S210, an instruction is received to send a message to the remote application of the AQP. According to some embodiments, multipath component 114 receives an RDMA instruction from CPUs 112 including a pointer to transmit buffer 113 (which identifies the data of the message) and an identifier of an AQP.
The message is segmented into a plurality of segment, or WQEs, at S220. According to some embodiments, each AQP is associated with a dedicated WQE dispatcher which segments messages associated with the AQP at S220. Using such dedicated WQE dispatchers, each AQP progresses independently and is non-blocking to any other AQPs.
Each WQE includes N packets, where the maximum size of each packet is the maximum transmission unit (MTU) size of the Internet Protocol transport media. The AQP applications may negotiate N prior to segmentation of each new message to be transmitted. At S230, a network path is determined for a WQE based on one or more network characteristics of a plurality of network paths. The plurality of network paths may be determined by a multi-path component according to some embodiments. Determination of the plurality of network paths may be facilitated by node pre-registration and/or a node discovery mechanism.
Multipathing component 310 also shows queues 313 through 316 to which WQEs may be scheduled at S230. In the illustrated example, each of queues 313 through 316 is associated with a respective network path. If the receiving node includes one NIC, the number of available network paths (and queues) may be equal to the number of NICs of the transmitting node. If the receiving node and the transmitting node each include multiple NICs, the number of available network paths (and queues) may be equal to the product of the number of NICs of the transmitting node and the number of NICs of the receiving node. This calculation is based on the realization that each NIC of the transmitting node may form a network path with each NIC of the receiving node.
A WQE scheduler 311 may determine a network path on which to schedule a WQE at S230 based on network path characteristics 312. Generally, network path monitor 317 may periodically collect performance-related information from each network path and compile the performance-related information into network path characteristics 312. Network path characteristics 312 may specify congestion state, traffic pacing requirements, node reachability, round-trip time (RTT), etc. According to some embodiments, network path monitor 317 updates network path characteristics 312 each time an acknowledgement packet is received from a remote receiver node of a given network path.
In some embodiments, a WQE scheduler 311 determines a network path based on a weighted round-robin evaluation of network path characteristics 312.
The lists may be updated and re-sorted in response to each received acknowledgement packet. In some embodiments, a “keep alive” ping is periodically sent over each network path and the RTT for each path is updated based on the response to the ping. In a case that a network path goes down, its RTT would become infinite and cause the network path to jump to the top of sorted list 410.
According to some embodiments of S230, weighted round-robin component 420 receives a query from a WQE scheduler, inputs the lower-most network path of each sorted list to a weighted round-robin function, and returns a network path output by the function to the WQE scheduler. In some embodiments, the network path for a WQE is determined at S230 by randomly selecting a network path from the available network paths.
Use of network path characteristics at S230 may facilitate removal of a network path from usage, for example to apply updates or conduct other maintenance. Specifically, network path monitor 317 may ensure that the network path to be removed is associated with network path characteristics 312 indicating poor performance, regardless of actual performance-related information received from the network path to be removed. Consequently, the network path will not be determined for scheduling a WQE at S230 and can be subjected to maintenance without jeopardizing message delivery.
Determination of a network path at S230 may be based on any suitable factors. In some embodiments, each network path is limited to a certain number (e.g., the Bandwidth Delay Product (BDP) of the network) of outstanding (i.e., assigned but not yet sent) WQEs. Network paths which include this number of outstanding WQEs are not considered in the determination at S230. Determination of a network path at S230 may also ensure that the number of packets to be transmitted does not exceed a rate limiting policy associated with the AQP.
After a network path has been determined for a WQE at S230, the WQE is scheduled for transmission via the determined network path at S240. Scheduling the WQE at S240 may comprise adding the WQE to a transmission queue associated with the network path, such as queues 313-316. At S250, it is determined whether any WQEs of the message remain to be scheduled to a network path. If so, flow returns to S230 and continues as described above. If not, process 200 terminates.
Scheduling of a WQE at S240 may be independent of the scheduling of any other WQEs of the message. This independence may result in later-scheduled WQEs being transmitted before earlier-scheduled WQEs. However, each packet of each WQE will be transmitted in order, relative to the other packets of the WQE.
If transmission of a WQE does not complete within a timeout period, or a nack is received indicating a packet is missing, the multipathing component will schedule the WQE or selected packets thereof for retransmission. If N=1 or another small number, the WQE includes few packets and may simply be scheduled for retransmission on the same network path. If N is greater than the bandwidth delay product (BDP) of the intervening network, lost packets may be individually identified and scheduled for retransmission (also on the same network path), rather than scheduling retransmission of the entire WQE.
At the receiver node (e.g., computing system 120), every received WQE will be processed without regard to order, while packets within a WQE are marked for completion in order. As described above, the message may be associated with an AQP using a field in the protocol header of the message. A memory buffer (e.g., buffer 125) is allocated on the receiver node for a given message based on control messages. For each received WQE, the start address of the buffer is known based on the WQE number. Within a received WQE, the receiver checks if packets arrived in order and places the WQE at the appropriate location in the memory buffer based on the WQE start address, WQE number and MTU.
In some implementations, the receiver node is not configured to place received WQEs directly into appropriate locations of a memory buffer. Such receivers may simply reorder received WQEs in sequence and store the reordered WQEs. In either case, the receiver node may determine whether any WQEs are missing and respond to the sender node accordingly.
Embodiments are not limited to four NICs or to equal numbers of NICs on either side of an AQP. Sixteen network paths exist between computing system 510 and computing system 520. Specifically, network paths are defined by NIC 516/NIC 526, NIC 516/NIC 527, NIC 516/NIC 528, NIC 516/NIC 529, NIC 517/NIC 526, NIC 517/NIC 527, NIC 517/NIC 528, NIC 517/NIC 529, NIC 518/NIC 526, NIC 518/NIC 527, NIC 518/NIC 528, NIC 518/NIC 529, NIC 519/NIC 526, NIC 519/NIC 527, NIC 519/NIC 528, and NIC 519/NIC 529. Any number of network paths may exist between two networks nodes according to some embodiments,
Either of multipathing component 514 and multipathing component 521 may operate as described above with respect to process 200 to transmit a message associated with an AQP over any combination of the sixteen network paths. Received WQEs are placed directly into receive buffers 515 and 525 as described above.
Multipathing component 614 of computing system 610 may operate as described above with respect to process 200. Computing system 620 may transmit messages over NICs 626-629, where each application-to-application connection is tied to one specific NIC. Computing system 620 includes reorder engine 625 for receiving WQEs transmitted by multipathing component 614, reordering the WQEs, and storing the reordered WQEs. Reorder engine 625 may also identify missing WQEs and instruct CPUs 622 to request retransmission thereof.
Any number of servers of datacenter 700 may implement any of the computing systems described herein. For example, a server of datacenter 700 may implement computing system 110 to transmit messages of a given AQP over multiple network paths to another server of datacenter 700. Any server of datacenter 700 may also transmit messages of a given AQP over multiple network paths to another server which is not located within datacenter 700.
The foregoing diagrams represent logical architectures for describing processes according to some embodiments, and actual implementations may include more or different components arranged in other manners. Other topologies may be used in conjunction with other embodiments. Moreover, each component or device described herein may be implemented by any number of devices in communication via any number of other public and/or private networks. Two or more of such computing devices may be located remote from one another and may communicate with one another via any known manner of network(s) and/or a dedicated connection. Each component or device may comprise any number of hardware and/or software elements suitable to provide the functions described herein as well as any other functions.
All processes mentioned herein may be embodied in processor-executable program code read from one or more of non-transitory computer-readable media, such as ROM, RAM, a hard disk drive, a solid-state drive, and a flash drive, and then stored in a compressed, uncompiled and/or encrypted format. such program code may be executed by one or more processors such as but not limited to a microprocessor, a microcontroller, and a processor core. In some embodiments, hard-wired circuitry may be used in place of, or in combination with, program code for implementation of processes according to some embodiments. Embodiments are therefore not limited to any specific combination of hardware and software.
Embodiments described herein are solely for the purpose of illustration. Those in the art will recognize other embodiments may be practiced with modifications and alterations to that described above.