The present disclosure relates generally to the prioritization of packets in flows.
In packet switching networks, the terms “traffic flow,” “packet flow,” “network flow,” and “flow” may be used interchangeably. The term “flow” may refer to a sequence of packets sent from a source device to a destination, which may be a destination device, a multicast group, or a broadcast domain. RFC 3697, “IPv6 Flow Label Specification,” J. Rajahalme, A. Conta, B. Carpenter, and S. Deering, March 2004, defines a flow as “a sequence of packets sent from a particular source to a particular unicast, anycast, or multicast destination that the source desires to label as a flow. A flow could consist of all packets in a specific transport connection or a media stream. However, a flow is not necessarily 1:1 mapped to a transport connection.”
RFC 3917, “Requirements for IP Flow Information Export (IPFIX),” J. Quittek, T. Zseby, B. Claise, and S. Zander, October 2004, provides that “[a]ll packets belonging to a particular flow have a set of common properties.” Often, such properties are defined by the value of one or more packet header fields, such as a source IP address field, destination IP address field, transport header field (e.g., source port number and/or destination port number), or application header field (e.g., Real-time Transport Protocol (RTP) header fields). The properties of a flow may also include one or more characteristics of the packet itself (e.g., number of MPLS labels) and/or values of one or more fields derived from packet treatment (e.g., next hop IP address, output interface, etc.) A packet is identified as belonging to a flow if it completely satisfies all the defined properties of the flow.
Today, data center fabrics handle a mix of short flows and long flows. Short flows are typically latency-sensitive, while long flows are typically bandwidth-intensive. A key challenge in today's data center fabrics is that congestion caused by long flows severely degrades the performance for short flows.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, to one skilled in the art, that the disclosed embodiments may be practiced without some or all of these specific details. In other instances, well-known process steps have not been described in detail in order to simplify the description.
In one embodiment, a next set of packets in a first flow may be identified. A counter may be incremented, where the counter indicates a first number of initial sets of packets in the first flow that have been identified. The identified next set of packets may be prioritized such that the first number of initial sets of packets in the first flow are prioritized and a sequential order of all packets in the first flow is maintained. The identifying, incrementing, and prioritizing may be repeated until no further sets of packets in the first flow remain to be identified or the first number of initial sets of packets is equal to a first predefined number.
The performance degradation experienced by short flows can be illustrated with reference to Transmission Control Protocol (TCP) flows. By design, long TCP flows consume all the available buffer space at their bottleneck link. As a result, a short flow that shares a bottleneck link with a long flow can experience a significant queuing delay as its packets wait behind the packets of the long flow. Even worse, there may not be buffer space left for the short flow's packets, causing them to be dropped at the bottleneck. Packet drops typically cause a short flow to take a TCP timeout which, by default, increases its completion time by approximately 200-300 ms.
Data centers are used as an infrastructure for many online services such as online stores, social networking or web search. Many short flows in data centers can ideally complete within 1 ms. As a result, a TCP timeout increasing the completion time of a short flow by 200-300 ms can be extremely costly for these flows, resulting in an increase in completion time by 1-2 orders of magnitude Therefore, for users accessing online services provided by data centers, even a small fraction of a second may be noticeable and negatively impact the user experience.
There is an even more significant delay if a TCP Synchronize (SYN) or SYN Acknowledgement (SYN ACK) packet is dropped (during the initiation of a TCP connection). The retransmission time for a SYN or SYN ACK is approximately 3 seconds in current TCP implementations. Hence, dropping a SYN or SYN ACK is so costly that it can severely impact even most long flows, as well as short flows.
There have been a number of proposals for enhancing congestion control in data centers to mitigate the performance penalties of long flows on short flows. For example, one class of data center protocols divides the link bandwidth equally among flows. However, the result has been far from optimal in terms of minimizing the average flow completion time (AFCT).
It is advantageous to minimize the AFCT of two or more flows being processed. For example, a network device such as a router or switch may receive packets associated with two flows. More particularly, a first flow may be a long flow that takes ten seconds to complete, while a second flow may be a short flow that takes one second to complete assuming it does not encounter a bottleneck. If the network device were to process the long flow first, the completion times would be 10 seconds for the long flow and 11 seconds for the short flow, resulting in an AFCT of 10.5 seconds. However, if the network device were to process the short flow first, the completion times would be 1 second for the short flow and 11 seconds for the long flow, resulting in an AFCT of 6 seconds.
As indicated by the example set forth above, if the short flows were to be processed before long flows, this would improve the completion time for the short flows drastically while only minimally impacting the completion time for long flows. As a result, one way to improve the AFCT is to prioritize short flows over long flows. However, it is often difficult or impossible to determine whether a flow will be long or short. For example, at the beginning of a voice over Internet Protocol (IP) call, it would be impossible to determine whether the call will be long or short. Another problem with shortest job first scheduling is the potential for long flows to suffer from starvation if there is always a shorter flow to be processed. Therefore, the issue of minimizing the AFCT is a difficult problem to solve.
In accordance with various embodiments, a pre-defined number of initial sets of packets of a flow are prioritized. As a result, the length of the flow is irrelevant and need not be determined to apply the disclosed embodiments. Furthermore, through the application of the disclosed embodiments, it is possible to prioritize the beginning of long flows, as well as short flows.
In this example, the server(s) 102 may receive messages such as request messages via a network 104 from one or more computers 106, 108, 110. The server(s) 102 may have access to one or more data stores 112, which may include one or more memories. The server(s) may respond to the requests by accessing the data stores 112 as appropriate, and sending response messages via the network 104 to the corresponding computers 106, 108, 110.
Messages that are sent via the network 104 are composed of packets. Each packet may be associated with a particular flow, which is uniquely defined by a set of common properties, which may include the value of one or more packet header fields. For example, each flow may be uniquely identified by a source Internet Protocol (IP) address, a destination IP address, a source port, and/or a destination port. These flows may be propagated through the network 104 via network devices such as routers and/or switches.
Within the network 104, a network device such as a router or switch may receive a first flow of packets from a first source device addressed to a first destination. For example, the first flow of packets may be sent from the computer 106 to the server(s) 102. However, the network device may also receive a second flow of packets from a second source device addressed to the first destination. For example, the second flow of packets may be sent from the computer 108 to the server(s) 102. Unfortunately, the network device may not have the resources to simultaneously process (e.g., forward) packets in the first flow and packets in the second flow. As a result, the network device may apply a prioritization mechanism to each flow in order to optimize the user experience, as well as ensure some level of fairness within the system. The prioritization mechanism may be performed entirely at the network device and may be advantageously performed without explicit congestion signaling. Various embodiments for prioritizing packets will be described in further detail below.
In accordance with various embodiments, packets may be prioritized on a per-packet basis or as groups of packets, which may be referred to as “flowlets.” In some implementations, flowlets may correspond to bursts of packets. The existence of a time delay between the receipt of one packet and the receipt of a subsequent packet that is greater than or equal to a pre-defined time, t, may be used to delineate one burst of packets from another burst of packets. The time delay may be ascertained by comparing a time stamp from one packet with the time stamp from a subsequently received packet. In this example, four different bursts of packets are illustrated, where the four different bursts of packets are separated by a time delay that is greater than or equal to the pre-defined time, t. As shown in
A network device such as a switch or router may process at least a portion of packets of a flow as they are obtained. More particularly, at least a portion of the packets of the flow may be processed dynamically in real-time as they are received. The network device may also periodically obtain at least a portion of the packets of the flow from a queue so that they may be processed. An example network device will be described in further detail below with reference to
In some embodiments, a network device or group of devices may divide a flow into two or more sets of packets having corresponding priorities (e.g., for network transmission). Each set of packets may include only those packets in a flowlet (e.g., burst of packets). Each flowlet may include one or more packets.
In other embodiments, each set of packets may be a single packet. Thus, a pre-defined number of initial packets in a flow may be prioritized.
As will be described in further detail below, a pre-defined number of initial sets of packets from each flow may be prioritized. If a flow has less than the pre-defined number of sets of packets, then all sets of packets for that flow may be prioritized. However, if the flow has more than the pre-defined number of packets, then only the first pre-defined number of sets of packets may be prioritized. A method of prioritizing a pre-defined number of sets of packets from flows will be described below with reference to
For example, a count indicating the number of sets of packets in the first flow may be initialized to zero and incremented to count the number of sets of packets in the flow. Those packets corresponding to a count that is less than or equal to a first predefined number may be prioritized over those packets having a count that is greater than the first predefined number, as will be described in further detail below. Thus, where the total number of sets of packets in the first flow is greater than or equal to the first pre-defined number, the first pre-defined number of initial sets of packets will be treated with a high priority. Stated another way, the first number of packets that are prioritized will be equal to the first pre-defined number. However, where the total number of sets of packets in the first flow is less than the first pre-defined number, all sets of packets in the first flow will be treated with a high priority. In other words, the first number of packets that are prioritized will be equal to the number of sets of packets in the first flow.
Various implementations disclosed herein employ a technique for distinguishing one set from the next in a packet flow. Each set so identified is counted and when a threshold number of initial sets in a flow is reached, the process may apply a lower priority to later received packets.
In some embodiments, each set of packets in the first number of initial sets of packets in the first flow may be a burst of packets. A set of packets may be identified by the presence of at least a pre-defined time delay that separates the time of receipt of a set of packets (e.g., the time of receipt of the last packet in the set of packets) from the time of receipt of a subsequent set of packets (e.g., the time of receipt of the first packet in the subsequent set of packets). In other words, the presence of less than the pre-defined time delay between two packets results in the two packets being in the same set of packets. As a result, the identification of a particular packet as being within a particular set of packets may be determined according to the time that the packet has been received. Accordingly, the first number of initial sets of packets in the first flow may be separated from one another by at least a predefined period of time.
In alternative embodiments, the process does not attempt to group packets into sets. Rather it simply counts the packets in a new flow. When a threshold number of packets in the flow is received, the process applies the lower priority to later received packets. In this alternate approach, each packet is in some sense a “set,” even though no attempt is made to group successive packets. Accordingly, each set of packets in the first number of initial sets of packets in the first flow may contain only a single packet.
The first number of initial sets of packets in the first flow may be prioritized at 304 without prioritizing the remaining packets in the first flow. Such prioritization may include dynamically prioritizing each individual set of packets of the first number of initial sets immediately in real-time as the packets are received or obtained (e.g., from a queue). The processing of individual sets of packets will be described in further detail below with reference to
The prioritization of packets in the first number of initial sets of packets in the first flow may be performed such that a sequential order of all packets in the first flow is maintained. For example, the packets in the first flow may be delivered to their destination in the sequential order. More particularly, the prioritization of the first number of initial sets of packets in the first flow may be performed such that the first number of initial sets of packets are processed (e.g., forwarded) prior to packets associated with one or more other flow(s). This may be accomplished by assigning a high priority to the first number of initial sets of packets (e.g., those packets having a count that is less than or equal to the first predefined number). For example, the first number of initial sets of packets may be added to a high priority queue. Any remaining sets of packets in the first flow may be processed, as usual or more slowly than normal. In some embodiments, any remaining sets of packets in the first flow may be assigned a lower priority. For example, the remaining sets of packets may be added to a lower priority queue, which may be a medium or low priority queue. While the assignment of a priority to “sets” of packets is described herein, it follows that the packets within the corresponding sets are assigned the same priority.
The prioritization of the first number of initial sets of packets in the first flow may be performed independent of content of a payload of packets in the first number of initial sets of packets in the first flow. Furthermore, such prioritization of the first number of initial sets of packets may be performed without marking packets in the first number of initial sets of packets. For example, packets need not be marked to indicate a corresponding prioritization or length of the flow.
Although not shown in
A counter indicating a first number of initial sets of packets in the first flow that have been identified may be incremented at 404. The identified next set of packets may be prioritized at 406 such that the first number of initial sets of packets in the first flow are prioritized and a sequential order of all packets in the first flow is maintained. For example, the packets in the first flow may be delivered to the destination in the sequential order. More particularly, this may be accomplished by assigning a high priority to the identified set of packets. For example, the identified set of packets may be added to a high priority queue. The identifying, incrementing, and prioritizing may be repeated as shown until the first number of initial sets of packets is equal to a first predefined number 408 or no further sets of packets in the first flow remain to be identified 410. Any remaining packets in the first flow may be assigned a lower priority (e.g., by adding the remaining packets to a medium or low priority queue), as will be described in further detail below with reference to
As shown in
The set counter may be compared at 426 to a predefined number, which represents the maximum number of high priority sets in a given flow. The network device may determine whether the set counter meets or exceeds the predefined number at 428. If the set counter does not meet or exceed the predefined number, the process may continue for further packets in the flow. More particularly, if the flow includes further packets at 430, a new set of packets may be initiated at 432 and the set counter may be incremented. If the flow does not include further packets, the process may end at 434. If the set counter meets or exceeds the predefined number, the packet priority may be set to low and any remaining packets in the flow may be treated as having a low priority at 436. In this instance, any further sets of packets in the flow need not be delineated or counted.
As described above, sets of packets may be delineated by a time delay that is greater than or equal to a pre-defined time delay. In this manner, flowlets of a flow may be identified and prioritized, as disclosed herein.
A very long-lived flow, such as that used by the Network File System (NFS) protocol, may carry multiple messages. In some embodiments, dynamic flowlet prioritization may be performed for each of the messages such that a particular number of initial sets of packets in each of the messages is prioritized. Each of the messages in the long-lived flow may be identified by detecting much larger gaps between packets.
In some embodiments, a second pre-defined time delay that is greater than the pre-defined time delay may be defined. For example, the second pre-defined time delay may be on the order of 10s to 100s of milliseconds. Where the time delay between two sets of packets is greater than or equal to the second pre-defined time delay, the second set of packets may be treated as the first set of packets in a new flow. Through the application of the second pre-defined time delay, it is possible to treat each of the messages in a very long flow as a separate flow.
The methods described herein may be repeated for further flows. The disclosed embodiments may be applied to flows consistently on a per-set basis. For example, the disclosed embodiments may be applied either on a per-packet basis or per-flowlet (e.g., per-burst) basis. Similarly, the pre-defined time delay used to delineate flowlets (e.g., bursts of packets) may be applied to flows consistently.
In some embodiments, the pre-defined number specifies a number of initial sets of packets to be prioritized for a given flow regardless of the characteristics of the flow, such as the length of the flow or the type of traffic of the flow. Thus, the pre-defined number may be applied to each flow regardless of a type of traffic being transmitted in the corresponding flow.
In some other embodiments, the pre-defined number of sets of packets to be prioritized may depend, at least in part, upon the characteristic(s) of the flow, such as the type of traffic being carried in the flow. Example types of traffic include control traffic, data traffic, voice over IP, video, gaming, etc. Other example types of traffic include high priority traffic, low priority traffic, and best effort. More particularly, the pre-defined number may be one of two or more pre-defined numbers, where each of the two or more pre-defined numbers is associated with a corresponding set of one or more flow characteristics such as particular traffic types. For example, the pre-defined number may be 100 for higher priority traffic such as control traffic, while the pre-defined number may be 10 for lower priority traffic such as games. As a result, the pre-defined number may be associated with one or more particular traffic type(s) (e.g. voice, video traffic), enabling the pre-defined number of initial sets of packets of a flow carrying one of the particular traffic type(s) to be prioritized. Accordingly, the pre-defined number of sets of packets to be prioritized for flows may be identical, or may vary with the traffic types being transmitted, as described herein.
A network device operating as described herein may be statically or dynamically configured with a single pre-defined number indicating the number of sets of packets to be prioritized for flows. Alternatively, the network device may be statically or dynamically configured with two or more pre-defined numbers such that each of the two or more pre-defined numbers is associated with a corresponding set of one or more flow characteristics (e.g., traffic types).
In some implementations, two or more types of traffic may correspond to two or more queues of packets such that each of the queues of packets is associated with at least one of the types of traffic. The prioritization mechanisms described herein may be applied to all traffic flows or types of traffic (or queues), or to a subset of traffic flows or types of traffic (or queues). For example, it may be undesirable to apply the prioritization mechanisms described herein to a particular traffic type or queue to which an absolute priority has been assigned, since the order of the corresponding packets is guaranteed.
Packet Forwarding
The prioritization of sets of packets for a given flow may include servicing or otherwise processing the sets of packets according to the priorities that have been assigned. Such processing may include processing data transmitted in the prioritized sets of packets. Alternatively, such processing may include forwarding each of the prioritized sets of packets. This processing may be performed by a network device such as that described herein.
The sets of packets in a given flow that have been prioritized may have absolute priority over remaining sets of packets in the flow. For example, a high priority queue including the prioritized packets or information associated therewith may have absolute priority over a lower priority queue in order to guarantee that the packets in the flow are not re-ordered. However, by giving the high priority queue absolute priority, there is a possibility that long flows will suffer starvation.
Alternatively, rather than having absolute priority over the lower priority queue, the high priority queue may be serviced more frequently than the lower priority queue. For example, the high priority queue may be serviced 10 times more frequently than the lower priority queue, which may be represented as a 10:1 ratio. In such an implementation, the sets of packets in the lower priority queue to be serviced may be determined based, at least in part, upon a queuing latency associated with the high priority queue. For example, the queuing latency may be a maximum or average queuing latency. Where multiple high priority queues are implemented, sets of packets in the lower priority queue to be serviced may be determined based, at least in part, upon a total queuing latency equal to the sum of the queuing latencies experienced in all of the high priority queues.
Latency is a measure of time delay experienced in a system. Queue latency may refer to a time between adding a set of packets to a queue and servicing (e.g., forwarding) the set of packets.
In some embodiments, a time gap between the receipt of two sequential sets of packets in a first number of initial sets of packets in a flow may be ascertained. More particularly, the time gap may be ascertained by comparing a time stamp from a last packet of the first set of packets with a time stamp of a first packet of the second set of packets. Where the second set of the first number of initial sets of packets in the first flow is in a lower priority queue, the second set of packets may then be serviced (e.g., forwarded) according to the time gap. More particularly, it may be determined whether the time gap is greater than a total queuing latency associated with the high priority queue(s). The second set of the first number of initial sets of packets may be forwarded according to whether the time gap is greater than the total queuing latency associated with the high priority queue(s). If the gap between sets of packets is greater than the total queuing latency experienced by packets in the high priority queue(s), then packets in the set after that gap cannot be delivered to their destination before the last packet in the previous set that was placed in the high priority queue. This guarantees that all the packets of the flow are delivered to their destination in order.
The term path may refer to a transmission channel between two nodes of a network that a packet follows. More particularly, the path may include one or more intermediary network devices such as routers or switches that forward packets along the path. There may be any number of intermediary network devices in a particular path.
In order to minimize the likelihood of packet reordering, all packets within a particular flow may be forwarded by the network device along a single path. For example, the network device may apply a hash function to information defining the flow (e.g., source address and destination address) to pick a particular path. In other implementations, sets of packets in a flow may be transmitted via two or more paths. For example, the sets of packets in a flow may be transmitted via a low latency path and a high latency path.
Generally, the techniques for performing the disclosed embodiments may be implemented by a device such as a network device. In some embodiments, the network device is designed to handle network traffic. Such network devices typically have multiple network interfaces. Specific examples of such network devices include routers and switches.
The disclosed embodiments may be implemented in one or more network devices within a network. A few example network architectures will be described in further detail below.
In order to meet the demands of a worldwide user base, the modern datacenter may be composed of hundreds, thousands, or even tens of thousands of data servers. However, a large number of servers within a datacenter places a corresponding high demand on the datacenter's networking infrastructure. Network traffic taxing this infrastructure may represent communications between servers within the datacenter itself, or it may represent requests for information or services originating outside the datacenter, such as from client computers located throughout the worldwide internet (hereinafter just “internet”). With regards to the latter, the total number of servers in a datacenter is typically many times the total number of connections to the internet, and so the sharing of a limited number of internet connections between many servers is typically an important consideration.
“Access-Aggregation-Core” Network Architecture
Datacenter network design may follow a variety of topological paradigms—a given topology just referring to the system of networking lines/links which carry network traffic (i.e., data) and the networking switches, which control the flow of traffic over the lines/links in the network. One of the most common topological paradigms in use today is the aptly-named “access-aggregation-core” architecture. As the “core” part of the name suggests, such an architecture follows a hierarchical paradigm, wherein information traveling between hypothetical points A and B, first travel up the hierarchy away from point A and then back down the hierarchy towards point B.
Shared usage of links and network devices (such as just described) leads to bottlenecks in a network exhibiting a tree structure architecture like the access-aggregation-core (AAC) network shown in
Though the blocking problem is an inevitable consequence of the tree-structure paradigm, various solutions have been developed within this paradigm to lessen the impact of the problem. One technique is to build redundancy into the network by adding additional links between high traffic nodes in the network. In reference to
“Leaf-Spine” Network Architecture
Another way of addressing the ubiquitous “blocking” problem manifested in the modern datacenter's networking infrastructure is to design a new network around a topological paradigm where blocking does not present as much of an inherent problem. One such topology is often referred to as a “multi-rooted tree” topology (as opposed to a “tree”), which can be said to embody a full bi-partite graph if each spine network device is connected to each Leaf network device and vice versa. Networks based on this topology are oftentimes referred to as “Clos Networks,” “flat networks,” “multi-rooted networks,” or just as “multi-rooted trees.” In the disclosure that follows, a “leaf-spine” network architecture designed around the concept of a “multi-rooted tree” topology will be described. While it is true that real-world networks are unlikely to completely eliminate the “blocking” problem, the described “leaf-spine” network architecture, as well as others based on “multi-rooted tree” topologies, are designed so that blocking does not occur to the same extent as in traditional network architectures.
Roughly speaking, leaf-spine networks lessen the blocking problem experienced by traditional networks by being less hierarchical and, moreover, by including considerable active path redundancy. In analogy to microprocessor design where increased performance is realized through multi-core or multi-processor parallelization rather than simply by increasing processor clock speed, a leaf-spine network realizes higher performance, at least to a certain extent, by building the network “out” instead of building it “up” in a hierarchical fashion. Thus, a leaf-spine network in its basic form consists of two-tiers, a spine tier and leaf tier. Network devices within the leaf tier—i.e. “leaf network devices”—provide connections to all the end devices, and network devices within the spine tier—i.e., “spine network devices”—provide connections among the leaf network devices. Note that in a prototypical leaf-spine network, leaf network devices do not directly communicate with each other, and the same is true of spine network devices. Moreover, in contrast to an AAC network, a leaf-spine network in its basic form has no third core tier connecting the network devices within the second tier to a much smaller number of core network device(s), typically configured in a redundant fashion, which then connect to the outside internet. Instead, the third tier core is absent and connection to the internet is provided through one of the leaf network devices, again effectively making the network less hierarchical. Notably, internet connectivity through a leaf network device avoids forming a traffic hotspot on the spine which would tend to bog down traffic not travelling to and from the outside internet.
It should be noted that very large leaf-spine networks may actually be formed from 3 tiers of network devices. As described in more detail below, in these configurations, the third tier may function as a “spine” which connects “leaves” formed from first and second tier network devices, but a 3-tier leaf-spine network still works very differently than a traditional AAC network due to the fact that it maintains the multi-rooted tree topology as well as other features. To present a simple example, the top tier of a 3-tier leaf-spine network still does not directly provide the internet connection(s), that still being provided through a leaf network device, as in a basic 2-tier leaf-spine network.
Though in
To illustrate, consider analogously to the example described above, communication between end device A and end device K simultaneous with communication between end devices I and J, which led to blocking in AAC network 500. As shown in
As a second example, consider the scenario of simultaneous communication between end devices A and F and between end devices B and G which will clearly also lead to blocking in AAC network 500. In the leaf-spine network 600, although two leaf network devices 625 are shared between the four end devices 610, specifically network devices 1 and 3, there are still three paths of communication between these two devices (one through each of the three spine network devices I, II, and III) and therefore there are three paths collectively available to the two pairs of end devices. Thus, it is seen that this scenario is also non-blocking (unlike
As a third example, consider the scenario of simultaneous communication between three pairs of end devices—between A and F, between B and G, and between C and H. In AAC network 500, this results in each pair of end devices having ⅓ the bandwidth required for full rate communication, but in leaf-spine network 600, once again, since 3 paths are available, each pair has exactly the bandwidth it needs for full rate communication. Thus, in a leaf-spine network having single links of equal bandwidth connecting devices, as long as the number of spine network devices 635 is equal to or greater than the number of end devices 610 which may be connected to any single leaf network device 625, then the network will have enough bandwidth for simultaneous full-rate communication between the end devices connected to the network.
More generally, the extent to which a given network is non-blocking may be characterized by the network's “bisectional bandwidth,” which is determined by dividing a network that has N end devices attached to it into 2 equal sized groups of size N/2, and determining the total bandwidth available for communication between the two groups. If this is done for all possible divisions into groups of size N/2, the minimum bandwidth over all such divisions is the “bisectional bandwidth” of the network. Based on this definition, a network may then be said to have “full bisectional bandwidth” and have the property of being “fully non-blocking” if each leaf network device's total uplink bandwidth to the spine tier 630 (the sum of the bandwidths of all links connecting the leaf network device 625 to any spine network device 635) is at least equal to the maximum downlink bandwidth to end devices associated with any of the leaf network devices on the network.
To be precise, when a network is said to be “fully non-blocking” it means that no “admissible” set of simultaneous communications between end devices on the network will block—the admissibility constraint simply meaning that the non-blocking property only applies to sets of communications that do not direct more network traffic at a particular end device than that end device can accept as a consequence of its own bandwidth limitations. Whether a set of communications is “admissible” may therefore be characterized as a consequence of each end device's own bandwidth limitations (assumed here equal to the bandwidth limitation of each end device's link to the network), rather than arising from the topological properties of the network per se. Therefore, subject to the admissibility constraint, in a non-blocking leaf-spine network, all the end devices on the network may simultaneously communicate with each other without blocking, so long as each end device's own bandwidth limitations are not implicated.
The leaf-spine network 600 thus exhibits full bisectional bandwidth because each leaf network device has at least as much bandwidth to the spine tier (i.e., summing bandwidth over all links to spine network devices) as it does bandwidth to the end devices to which it is connected (i.e., summing bandwidth over all links to end devices). To illustrate the non-blocking property of network 600 with respect to admissible sets of communications, consider that if the 12 end devices in
To implement leaf-spine network 600, the leaf tier 620 would typically be formed from 5 ethernet switches of 6 ports or more, and the spine tier 630 from 3 ethernet switches of 5 ports or more. The number of end devices which may be connected is then the number of leaf tier switches j multiplied by ½ the number of ports n on each leaf tier switch, or ½ ·j·n, which for the network of
However, not every network is required to be non-blocking and, depending on the purpose for which a particular network is built and the network's anticipated loads, a fully non-blocking network may simply not be cost-effective. Nevertheless, leaf-spine networks still provide advantages over traditional networks, and they can be made more cost-effective, when appropriate, by reducing the number of devices used in the spine tier, or by reducing the link bandwidth between individual spine and leaf tier devices, or both. In some cases, the cost-savings associated with using fewer spine-network devices can be achieved without a corresponding reduction in bandwidth between the leaf and spine tiers by using a leaf-to-spine link speed which is greater than the link speed between the leaf tier and the end devices. If the leaf-to-spine link speed is chosen to be high enough, a leaf-spine network may still be made to be fully non-blocking—despite saving costs by using fewer spine network devices.
The extent to which a network having fewer spine tier devices is non-blocking is given by the smallest ratio of leaf-to-spine uplink bandwidth versus leaf-to-end-device downlink bandwidth assessed over all leaf network devices. By adjusting this ratio, an appropriate balance between cost and performance can be dialed in. In
This concept of oversubscription and building cost-effective networks having less than optimal bandwidth between spine and leaf network devices also illustrates the improved failure domain provided by leaf-spine networks versus their traditional counterparts. In a traditional AAC network, if a device in the aggregation tier fails, then every device below it in the network's hierarchy will become inaccessible until the device can be restored to operation (assuming no split etherchannel or equal cost multi-pathing (ECMP)). Furthermore, even if redundancy is built-in to that particular device, or if it is paired with a redundant device, or if it is a link to the device which has failed and there are redundant links in place, such a failure will still result in a 50% reduction in bandwidth, or a doubling of the oversubscription. In contrast, redundancy is intrinsically built into a leaf-spine network and such redundancy is much more extensive. Thus, as illustrated by the usefulness of purposefully assembling a leaf-spine network with fewer spine network devices than is optimal, absence or failure of a single device in the spine (or link to the spine) will only typically reduce bandwidth by 1/k where k is the total number of spine network devices.
It is also noted once more that in some networks having fewer than the optimal number of spine network devices (e.g., less than the number of end devices connecting to the leaf network devices), the oversubscription rate may still be reduced (or eliminated) by the use of higher bandwidth links between the leaf and spine network devices relative to those used to connect end devices to the leaf network devices.
Example “Leaf-Spine” Network Architecture
The following describes an example implementation of a leaf-spine network architecture. It is to be understood, however, that the specific details presented here are for purposes of illustration only, and are not to be viewed in any manner as limiting the concepts disclosed herein. With this in mind, leaf-spine networks may be implemented as follows:
Leaf network devices may be implemented as ethernet switches having: (i) 48 ports for connecting up to 48 end devices (e.g., servers) at data transmission speeds of 10 GB/s (gigabits per second)—i.e. ‘downlink ports’; and (ii) 12 ports for connecting to up to 12 spine network devices at data transmission speeds of 40 GB/s—i.e. ‘uplink ports.’ Thus, each leaf network device has 480 GB/s total bandwidth available for server connections and an equivalent 480 GB/s total bandwidth available for connections to the spine tier. More generally, leaf network devices may be chosen to have a number of ports in the range of 10 to 50 ports, or 20 to 100 ports, or 50 to 1000 ports, or 100 to 2000 ports, wherein some fraction of the total number of ports are used to connect end devices (‘downlink ports’) and some fraction are used to connect to spine network devices (‘uplink ports’). In some embodiments, the ratio of uplink to downlink ports of a leaf network device may be 1:1, or 1:2, or 1:4, or the aforementioned ratio may be in the range of 1:1 to 1:20, or 1:1 to 1:10, or 1:1 to 1:5, or 1:2 to 1:5. Likewise, the uplink ports for connection to the spine tier may have the same bandwidth as the downlink ports used for end device connection, or they may have different bandwidths, and in some embodiments, higher bandwidths. For instance, in some embodiments, uplink ports may have bandwidths which are in a range of 1 to 100 times, or 1 to 50 times, or 1 to 10 times, or 1 to 5 times, or 2 to 5 times the bandwidth of downlink ports. In the particular embodiment described above, the bandwidth of the uplink ports is 4 times the bandwidth of the downlink ports—e.g., downlink port data transmission speeds are 10 GB/s and uplink port data transmission speeds are 40 GB/s. Depending on the embodiment, the downlink data transmission speed may be selected to be 10 MB/s (megabit/second), 100 MB/s, 1 GB/s (gigabit/second), 10 GB/s, 40 GB/s, 100 GB/s, 1 TB/s (terabit/second), and the corresponding uplink port data transmission speed may be chosen according to the foregoing proportions (of uplink to downlink port transmission speeds) Likewise, depending on the embodiment, the downlink data transmission speed may be selected from within a range of between about 10 MB/s and 1 TB/s, or between about 1 GB/s and 100 GB/s, or between about 10 GB/s and 40 GB/s, and the corresponding uplink port data transmission speed may also be chosen according to the previously described proportions (of uplink to downlink port transmission speeds).
Moreover, depending on the embodiment, leaf network devices may be switches having a fixed number of ports, or they may be modular, wherein the number of ports in a leaf network device may be increased by adding additional modules. The leaf network device just described having 48 10 GB/s downlink ports (for end device connection) and 12 40 GB/s uplink ports (for spine tier connection) may be a fixed-sized switch, and is sometimes referred to as a ‘Top-of-Rack’ switch. Fixed-sized switches having a larger number of ports are also possible, however, typically ranging in size from 48 to 150 ports, or more specifically from 48 to 100 ports, and may or may not have additional uplink ports (for communication to the spine tier) potentially of higher bandwidth than the downlink ports. In modular leaf network devices, the number of ports obviously depends on how many modules are employed. In some embodiments, ports are added via multi-port line cards in similar manner to that described below with regards to modular spine network devices.
Spine network devices may be implemented as ethernet switches having 576 ports for connecting with up to 576 leaf network devices at data transmission speeds of 40 GB/s. More generally, spine network devices may be chosen to have a number of ports for leaf network device connections in the range of 10 to 50 ports, or 20 to 100 ports, or 50 to 1000 ports, or 100 to 2000 ports. In some embodiments, ports may be added to a spine network device in modular fashion. For example, a module for adding ports to a spine network device may contain a number of ports in a range of 10 to 50 ports, or 20 to 100 ports. In this manner, the number of ports in the spine network devices of a growing network may be increased as needed by adding line cards, each providing some number of ports. Thus, for example, a 36-port spine network device could be assembled from a single 36-port line card, a 72-port spine network device from two 36-port line cards, a 108-port spine network device from a trio of 36-port line cards, a 576-port spine network device could be assembled from 16 36-port line cards, and so on.
Links between the spine and leaf tiers may be implemented as 40 GB/s-capable ethernet cable (such as appropriate fiber optic cable) or the like, and server links to the leaf tier may be implemented as 10 GB/s-capable ethernet cable or the like. More generally, links, e.g. cables, for connecting spine network devices to leaf network devices may have bandwidths which are in a range of 1 GB/s to 1000 GB/s, or 10 GB/s to 100 GB/s, or 20 GB/s to 50 GB/s. Likewise, links, e.g. cables, for connecting leaf network devices to end devices may have bandwidths which are in a range of 10 MB/s to 100 GB/s, or 1 GB/s to 50 GB/s, or 5 GB/s to 20 GB/s. In some embodiments, as indicated above, links, e.g. cables, between leaf network devices and spine network devices may have higher bandwidth than links, e.g. cable, between leaf network devices and end devices. For instance, in some embodiments, links, e.g. cables, for connecting leaf network devices to spine network devices may have bandwidths which are in a range of 1 to 100 times, or 1 to 50 times, or 1 to 10 times, or 1 to 5 times, or 2 to 5 times the bandwidth of links, e.g. cables, used to connect leaf network devices to end devices.
In the particular example of each spine network device implemented as a 576-port @ 40 GB/s switch and each leaf network device implemented as a 48-port @ 10 GB/s downlink & 12-port @ 40 GB/s uplink switch, the network can have up to 576 leaf network devices each of which can connect up to 48 servers, and so the leaf-spine network architecture can support up to 576·48=27,648 servers. And, in this particular example, due to the maximum leaf-to-spine transmission rate (of 40 GB/s) being 4 times that of the maximum leaf-to-server transmission rate (of 10 GB/s), such a network having 12 spine network devices is fully non-blocking and has full cross-sectional bandwidth.
As described above, the network architect can balance cost with oversubscription by adjusting the number of spine network devices. In this example, a setup employing 576-port switches as spine network devices may typically employ 4 spine network devices which, in a network of 576 leaf network devices, corresponds to an oversubscription rate of 3:1. Adding a set of 4 more 576-port spine network devices changes the oversubscription rate to 3:2, and so forth.
Datacenters typically consist of servers mounted in racks. Thus, in a typical setup, one leaf network device, such as the ‘Top-of-Rack’ device described above, can be placed in each rack providing connectivity for up to 48 rack-mounted servers. The total network then may consist of up to 576 of these racks connected via their leaf-network devices to a spine-tier rack containing between 4 and 12 576-port spine tier devices.
Leaf-Spine Network Architectures Formed from More than Two Tiers of Network Devices
The two-tier leaf-spine network architecture described above having 576-port @ 40 GB/s switches as spine network devices and 48-port @ 10 GB/s downlink & 12-port @ 40 GB/s uplink switches as leaf network devices can support a network of up to 27,648 servers, and while this may be adequate for most datacenters, it may not be adequate for all. Even larger networks can be created by employing spine tier devices with more than 576 ports accompanied by a corresponding increased number of leaf tier devices. However, another mechanism for assembling a larger network is to employ a multi-rooted tree topology built from more than two tiers of network devices—e.g., forming the network from 3 tiers of network devices, or from 4 tiers of network devices, etc.
One simple example of a 3-tier leaf-spine network may be built from just 4-port switches and this is schematically illustrated in
The disclosed embodiments may be implemented in one or more network devices within a network such as that described herein. Within a leaf-spine network, the disclosed embodiments may be implemented in one or more leaf network devices and/or one or more spine network devices within one or more spine tiers.
The interfaces 804 are typically provided as interface cards (not shown to simplify illustration), which may be referred to as “line cards”. Generally, the interfaces 804 control the sending and receiving of packets over the network and may also support other peripherals used with the network device 800. The communication path between interfaces/line cards may be bus based or switch fabric based (such as a cross-bar). Among the interfaces that may be provided are Fibre Channel (“FC”) interfaces, Ethernet interfaces, frame relay interfaces, cable interfaces, Digital Subscriber Line (DSL) interfaces, token ring interfaces, and the like. In addition, various very high-speed interfaces may be provided, such as fast Ethernet interfaces, Gigabit Ethernet interfaces, Asynchronous Transfer Mode (ATM) interfaces, High-Speed Serial Interfaces (HSSI), Packet over Sonet (POS) interfaces, Fiber Distributed Data Interfaces (FDDI), Asynchronous Serial Interfaces (ASI)s, DigiCable Headend Expansion Interfaces (DHEI), and the like.
When acting under the control of the ASICs 802, in some implementations of the invention the CPU 806 may be responsible for implementing specific functions associated with the functions of a desired network device. According to some embodiments, CPU 806 accomplishes all these functions under the control of software including an operating system and any appropriate applications software.
The CPU 806 may include one or more processors or specially designed hardware for controlling the operations of the network device 800. The CPU 806 may also include memory such as non-volatile RAM and/or ROM, which may be used for a variety of purposes such as, for example, caching and/or storing data, programming instructions, etc. However, there are many different ways in which memory could be coupled to the system.
Regardless of the network device's configuration, it may employ one or more memories or memory modules (such as, for example, memory block 806) configured to store data, program instructions for the general-purpose network operations and/or other information relating to the functionality of the techniques described herein. For example, the memory block 806 may correspond to a random access memory (RAM). The program instructions may control the operation of an operating system and/or one or more applications, for example. Because such information and program instructions may be employed to implement the systems/methods described herein, the disclosed embodiments relate to machine-readable media that include program instructions, state information, etc. for performing various operations described herein. Examples of machine-readable media include, but are not limited to, magnetic media such as hard disks and magnetic tape; optical media such as CD-ROM disks; magneto-optical media; and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and random access memory (RAM). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
Although the network device shown in
Although illustrative embodiments and applications of the disclosed embodiments are shown and described herein, many variations and modifications are possible which remain within the concept, scope, and spirit of the disclosed embodiments, and these variations would become clear to those of ordinary skill in the art after perusal of this application. Moreover, the disclosed embodiments need not be performed using the steps described above. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the disclosed embodiments are not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.
This application is a continuation of and claims priority from U.S. patent application Ser. No. 14/099,638, entitled “Dynamic Flowlet Prioritization,” by Attar et al, filed on Dec. 6, 2013, which claims priority from U.S. Provisional Application No. 61/900,277, entitled “Dynamic Flowlet Prioritization,” by Attar et al, filed on Nov. 5, 2013, both of which are hereby incorporated by reference in their entirety and for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
4298770 | Nishihara et al. | Nov 1981 | A |
4636919 | Itakura et al. | Jan 1987 | A |
4700016 | Hitchcock et al. | Oct 1987 | A |
5115431 | Williams et al. | May 1992 | A |
5859835 | Varma et al. | Jan 1999 | A |
5926458 | Yin et al. | Jul 1999 | A |
6252876 | Brueckheimer et al. | Jun 2001 | B1 |
6389031 | Chao | May 2002 | B1 |
6456624 | Eccles et al. | Sep 2002 | B1 |
6650640 | Muller et al. | Nov 2003 | B1 |
6677831 | Cheng et al. | Jan 2004 | B1 |
6714553 | Poole et al. | Mar 2004 | B1 |
6757897 | Shi et al. | Jun 2004 | B1 |
6876952 | Kappler et al. | Apr 2005 | B1 |
6907039 | Shen | Jun 2005 | B2 |
6941649 | Goergen | Sep 2005 | B2 |
6952421 | Slater | Oct 2005 | B1 |
6954463 | Ma et al. | Oct 2005 | B1 |
6996099 | Kadambi et al. | Feb 2006 | B1 |
7068667 | Foster et al. | Jun 2006 | B2 |
7152117 | Stapp et al. | Dec 2006 | B1 |
7177946 | Kaluve et al. | Feb 2007 | B1 |
7372857 | Kappler et al. | May 2008 | B1 |
7411915 | Spain et al. | Aug 2008 | B1 |
7426604 | Rygh et al. | Sep 2008 | B1 |
7516211 | Gourlay et al. | Apr 2009 | B1 |
7539131 | Shen | May 2009 | B2 |
7580409 | Swenson et al. | Aug 2009 | B1 |
7630368 | Tripathi et al. | Dec 2009 | B2 |
7729296 | Choudhary | Jun 2010 | B1 |
7826400 | Sakauchi | Nov 2010 | B2 |
7826469 | Li et al. | Nov 2010 | B1 |
7848340 | Sakauchi et al. | Dec 2010 | B2 |
8233384 | Osterhout et al. | Jul 2012 | B2 |
8302301 | Lau | Nov 2012 | B2 |
8325459 | Mutnury et al. | Dec 2012 | B2 |
8339973 | Pichumani et al. | Dec 2012 | B1 |
8378223 | Shiue et al. | Feb 2013 | B1 |
8442063 | Zhou et al. | May 2013 | B1 |
8514712 | Aswadhati | Aug 2013 | B1 |
8687629 | Kompella et al. | Apr 2014 | B1 |
8854972 | Li | Oct 2014 | B1 |
8868766 | Theimer et al. | Oct 2014 | B1 |
8908691 | Biswas et al. | Dec 2014 | B2 |
9036481 | White | May 2015 | B1 |
9106508 | Banavalikar et al. | Aug 2015 | B2 |
9178715 | Jain et al. | Nov 2015 | B2 |
9197551 | DeCusatis et al. | Nov 2015 | B2 |
9203188 | Siechen et al. | Dec 2015 | B1 |
9245626 | Fingerhut et al. | Jan 2016 | B2 |
9258195 | Pendleton et al. | Feb 2016 | B1 |
9325524 | Banavalikar et al. | Apr 2016 | B2 |
9374294 | Pani | Jun 2016 | B1 |
9402470 | Shen et al. | Aug 2016 | B2 |
9407501 | Yadav et al. | Aug 2016 | B2 |
9426060 | Dixon et al. | Aug 2016 | B2 |
9433081 | Xiong et al. | Aug 2016 | B1 |
9444634 | Pani | Sep 2016 | B2 |
9502111 | Dharmapurikar et al. | Nov 2016 | B2 |
9509092 | Shen et al. | Nov 2016 | B2 |
9544185 | Yadav et al. | Jan 2017 | B1 |
9544224 | Chu et al. | Jan 2017 | B2 |
9590914 | Alizadeh Attar et al. | Mar 2017 | B2 |
9627063 | Dharmapurikar et al. | Apr 2017 | B2 |
9634846 | Pani | Apr 2017 | B2 |
9635937 | Shen et al. | May 2017 | B2 |
9654300 | Pani | May 2017 | B2 |
9654385 | Chu et al. | May 2017 | B2 |
9654409 | Yadav et al. | May 2017 | B2 |
9655232 | Saxena et al. | May 2017 | B2 |
9667431 | Pani | May 2017 | B2 |
9667551 | Edsall et al. | May 2017 | B2 |
9674086 | Ma et al. | Jun 2017 | B2 |
9686180 | Chu et al. | Jun 2017 | B2 |
9698994 | Pani | Jul 2017 | B2 |
9716665 | Alizadeh Attar et al. | Jul 2017 | B2 |
9742673 | Banerjee et al. | Aug 2017 | B2 |
9755965 | Yadav et al. | Sep 2017 | B1 |
9769078 | Attar et al. | Sep 2017 | B2 |
9876715 | Edsall et al. | Jan 2018 | B2 |
20020126671 | Ellis et al. | Sep 2002 | A1 |
20020136268 | Gan et al. | Sep 2002 | A1 |
20020146026 | Unitt et al. | Oct 2002 | A1 |
20030035385 | Walsh et al. | Feb 2003 | A1 |
20030058837 | Denney et al. | Mar 2003 | A1 |
20030058860 | Kunze et al. | Mar 2003 | A1 |
20030067912 | Mead et al. | Apr 2003 | A1 |
20030067924 | Choe et al. | Apr 2003 | A1 |
20030097461 | Barham et al. | May 2003 | A1 |
20030115319 | Dawson et al. | Jun 2003 | A1 |
20030120884 | Koob et al. | Jun 2003 | A1 |
20030137940 | Schwartz et al. | Jul 2003 | A1 |
20030142629 | Krishnamurthi et al. | Jul 2003 | A1 |
20030174650 | Shankar et al. | Sep 2003 | A1 |
20030223376 | Elliott et al. | Dec 2003 | A1 |
20030231646 | Chandra et al. | Dec 2003 | A1 |
20040031030 | Kidder et al. | Feb 2004 | A1 |
20040062259 | Jeffries et al. | Apr 2004 | A1 |
20040073715 | Folkes et al. | Apr 2004 | A1 |
20040100901 | Bellows | May 2004 | A1 |
20040103310 | Sobel et al. | May 2004 | A1 |
20040111507 | Villado et al. | Jun 2004 | A1 |
20040160956 | Hardy et al. | Aug 2004 | A1 |
20040249960 | Hardy et al. | Dec 2004 | A1 |
20050007961 | Scott et al. | Jan 2005 | A1 |
20050010685 | Ramnath et al. | Jan 2005 | A1 |
20050013280 | Buddhikot et al. | Jan 2005 | A1 |
20050073958 | Atlas et al. | Apr 2005 | A1 |
20050083835 | Prairie et al. | Apr 2005 | A1 |
20050091239 | Ward et al. | Apr 2005 | A1 |
20050117593 | Shand | Jun 2005 | A1 |
20050175020 | Park et al. | Aug 2005 | A1 |
20050201375 | Komatsu et al. | Sep 2005 | A1 |
20050207410 | Adhikari et al. | Sep 2005 | A1 |
20050213504 | Enomoto et al. | Sep 2005 | A1 |
20050232227 | Jorgenson et al. | Oct 2005 | A1 |
20050240745 | Iyer et al. | Oct 2005 | A1 |
20060013143 | Yasuie et al. | Jan 2006 | A1 |
20060028285 | Jang et al. | Feb 2006 | A1 |
20060031643 | Figueira | Feb 2006 | A1 |
20060039364 | Wright | Feb 2006 | A1 |
20060072461 | Luong et al. | Apr 2006 | A1 |
20060075093 | Frattura et al. | Apr 2006 | A1 |
20060083179 | Mitchell | Apr 2006 | A1 |
20060083256 | Mitchell | Apr 2006 | A1 |
20060182036 | Sasagawa et al. | Aug 2006 | A1 |
20060193332 | Qian et al. | Aug 2006 | A1 |
20060198315 | Sasagawa et al. | Sep 2006 | A1 |
20060209688 | Tsuge et al. | Sep 2006 | A1 |
20060209702 | Schmitt et al. | Sep 2006 | A1 |
20060215572 | Padhye et al. | Sep 2006 | A1 |
20060215623 | Lin et al. | Sep 2006 | A1 |
20060221835 | Sweeney | Oct 2006 | A1 |
20060221950 | Heer | Oct 2006 | A1 |
20060227790 | Yeung et al. | Oct 2006 | A1 |
20060239204 | Bordonaro et al. | Oct 2006 | A1 |
20060250982 | Yuan et al. | Nov 2006 | A1 |
20060268742 | Chu et al. | Nov 2006 | A1 |
20060274647 | Wang et al. | Dec 2006 | A1 |
20060274657 | Olgaard et al. | Dec 2006 | A1 |
20060280179 | Meier | Dec 2006 | A1 |
20060285500 | Booth, III et al. | Dec 2006 | A1 |
20070016590 | Appleby et al. | Jan 2007 | A1 |
20070025241 | Nadeau et al. | Feb 2007 | A1 |
20070047463 | Jarvis et al. | Mar 2007 | A1 |
20070053303 | Kryuchkov | Mar 2007 | A1 |
20070058557 | Cuffaro et al. | Mar 2007 | A1 |
20070061451 | Villado et al. | Mar 2007 | A1 |
20070076605 | Cidon et al. | Apr 2007 | A1 |
20070091795 | Bonaventure et al. | Apr 2007 | A1 |
20070097872 | Chiu | May 2007 | A1 |
20070159987 | Khan et al. | Jul 2007 | A1 |
20070160073 | Toumura et al. | Jul 2007 | A1 |
20070165515 | Vasseur | Jul 2007 | A1 |
20070171814 | Florit et al. | Jul 2007 | A1 |
20070177525 | Wijnands et al. | Aug 2007 | A1 |
20070183337 | Cashman et al. | Aug 2007 | A1 |
20070211625 | Liu et al. | Sep 2007 | A1 |
20070217415 | Wijnands et al. | Sep 2007 | A1 |
20070223372 | Haalen et al. | Sep 2007 | A1 |
20070233847 | Aldereguia et al. | Oct 2007 | A1 |
20070258382 | Foll et al. | Nov 2007 | A1 |
20070258383 | Wada | Nov 2007 | A1 |
20070274229 | Scholl et al. | Nov 2007 | A1 |
20070280264 | Milton et al. | Dec 2007 | A1 |
20080031130 | Raj et al. | Feb 2008 | A1 |
20080031146 | Kwak et al. | Feb 2008 | A1 |
20080031247 | Tahara et al. | Feb 2008 | A1 |
20080092213 | Wei et al. | Apr 2008 | A1 |
20080123559 | Haviv et al. | May 2008 | A1 |
20080147830 | Ridgill et al. | Jun 2008 | A1 |
20080151863 | Lawrence et al. | Jun 2008 | A1 |
20080177896 | Quinn et al. | Jul 2008 | A1 |
20080212496 | Zou | Sep 2008 | A1 |
20080219173 | Yoshida et al. | Sep 2008 | A1 |
20080225853 | Melman et al. | Sep 2008 | A1 |
20080259809 | Stephan et al. | Oct 2008 | A1 |
20080259925 | Droms et al. | Oct 2008 | A1 |
20080310421 | Teisberg et al. | Dec 2008 | A1 |
20090052332 | Fukuyama et al. | Feb 2009 | A1 |
20090067322 | Shand et al. | Mar 2009 | A1 |
20090094357 | Keohane et al. | Apr 2009 | A1 |
20090103566 | Kloth et al. | Apr 2009 | A1 |
20090116402 | Yamasaki | May 2009 | A1 |
20090122805 | Epps et al. | May 2009 | A1 |
20090161567 | Jayawardena et al. | Jun 2009 | A1 |
20090188711 | Ahmad | Jul 2009 | A1 |
20090193103 | Small et al. | Jul 2009 | A1 |
20090225671 | Arbel et al. | Sep 2009 | A1 |
20090232011 | Li et al. | Sep 2009 | A1 |
20090238196 | Ukita et al. | Sep 2009 | A1 |
20090268614 | Tay et al. | Oct 2009 | A1 |
20090271508 | Sommers et al. | Oct 2009 | A1 |
20100020719 | Chu et al. | Jan 2010 | A1 |
20100020726 | Chu et al. | Jan 2010 | A1 |
20100128619 | Shigei | May 2010 | A1 |
20100150155 | Napierala | Jun 2010 | A1 |
20100161787 | Jones | Jun 2010 | A1 |
20100189080 | Hu et al. | Jul 2010 | A1 |
20100191813 | Gandhewar et al. | Jul 2010 | A1 |
20100191839 | Gandhewar et al. | Jul 2010 | A1 |
20100223655 | Zheng | Sep 2010 | A1 |
20100260197 | Martin et al. | Oct 2010 | A1 |
20100287227 | Goel et al. | Nov 2010 | A1 |
20100299553 | Cen | Nov 2010 | A1 |
20100312875 | Wilerson et al. | Dec 2010 | A1 |
20100312928 | Brownell | Dec 2010 | A1 |
20110022725 | Farkas | Jan 2011 | A1 |
20110110241 | Atkinson et al. | May 2011 | A1 |
20110110587 | Banner | May 2011 | A1 |
20110138310 | Gomez et al. | Jun 2011 | A1 |
20110158248 | Vorunganti et al. | Jun 2011 | A1 |
20110170426 | Kompella et al. | Jul 2011 | A1 |
20110199891 | Chen | Aug 2011 | A1 |
20110199941 | Ouellette et al. | Aug 2011 | A1 |
20110203834 | Yoneya et al. | Aug 2011 | A1 |
20110228795 | Agrawal et al. | Sep 2011 | A1 |
20110239189 | Attalla | Sep 2011 | A1 |
20110243136 | Raman et al. | Oct 2011 | A1 |
20110249682 | Kean et al. | Oct 2011 | A1 |
20110268118 | Schlansker et al. | Nov 2011 | A1 |
20110273987 | Schlansker et al. | Nov 2011 | A1 |
20110280572 | Vobbilisetty et al. | Nov 2011 | A1 |
20110286447 | Liu | Nov 2011 | A1 |
20110299406 | Vobbilisetty et al. | Dec 2011 | A1 |
20110310738 | Lee et al. | Dec 2011 | A1 |
20110321031 | Dournov et al. | Dec 2011 | A1 |
20120007688 | Zhou et al. | Jan 2012 | A1 |
20120030150 | McAuley et al. | Feb 2012 | A1 |
20120030666 | Laicher et al. | Feb 2012 | A1 |
20120057505 | Xue | Mar 2012 | A1 |
20120063318 | Boddu et al. | Mar 2012 | A1 |
20120102114 | Dunn et al. | Apr 2012 | A1 |
20120147752 | Ashwood-Smith et al. | Jun 2012 | A1 |
20120163396 | Cheng et al. | Jun 2012 | A1 |
20120167013 | Kaiser et al. | Jun 2012 | A1 |
20120195233 | Wang et al. | Aug 2012 | A1 |
20120275304 | Patel et al. | Nov 2012 | A1 |
20120281697 | Huang | Nov 2012 | A1 |
20120300669 | Zahavi | Nov 2012 | A1 |
20120300787 | Korger | Nov 2012 | A1 |
20120314581 | Rajamanickam et al. | Dec 2012 | A1 |
20130055155 | Wong et al. | Feb 2013 | A1 |
20130064246 | Dharmapurikar et al. | Mar 2013 | A1 |
20130090014 | Champion | Apr 2013 | A1 |
20130097335 | Jiang et al. | Apr 2013 | A1 |
20130124708 | Lee et al. | May 2013 | A1 |
20130151681 | Dournov et al. | Jun 2013 | A1 |
20130182712 | Aguayo et al. | Jul 2013 | A1 |
20130208624 | Ashwood-Smith | Aug 2013 | A1 |
20130223276 | Padgett | Aug 2013 | A1 |
20130227108 | Dunbar et al. | Aug 2013 | A1 |
20130227689 | Pietrowicz et al. | Aug 2013 | A1 |
20130250779 | Meloche et al. | Sep 2013 | A1 |
20130250951 | Koganti | Sep 2013 | A1 |
20130276129 | Nelson et al. | Oct 2013 | A1 |
20130311663 | Kamath et al. | Nov 2013 | A1 |
20130311991 | Li et al. | Nov 2013 | A1 |
20130322258 | Nedeltchev et al. | Dec 2013 | A1 |
20130322446 | Biswas et al. | Dec 2013 | A1 |
20130322453 | Allan | Dec 2013 | A1 |
20130329605 | Nakil et al. | Dec 2013 | A1 |
20130332399 | Reddy et al. | Dec 2013 | A1 |
20130332577 | Nakil et al. | Dec 2013 | A1 |
20130332602 | Nakil et al. | Dec 2013 | A1 |
20140006549 | Narayanaswamy et al. | Jan 2014 | A1 |
20140016501 | Kamath et al. | Jan 2014 | A1 |
20140043535 | Motoyama et al. | Feb 2014 | A1 |
20140043972 | Li et al. | Feb 2014 | A1 |
20140047264 | Wang et al. | Feb 2014 | A1 |
20140050223 | Foo et al. | Feb 2014 | A1 |
20140056298 | Vobbilisetty et al. | Feb 2014 | A1 |
20140064281 | Basso et al. | Mar 2014 | A1 |
20140068750 | Tjahjono et al. | Mar 2014 | A1 |
20140086097 | Qu et al. | Mar 2014 | A1 |
20140086253 | Yong et al. | Mar 2014 | A1 |
20140105039 | Mcdysan | Apr 2014 | A1 |
20140105062 | Mcdysan et al. | Apr 2014 | A1 |
20140105216 | Mcdysan | Apr 2014 | A1 |
20140108489 | Glines et al. | Apr 2014 | A1 |
20140146817 | Zhang | May 2014 | A1 |
20140146824 | Angst et al. | May 2014 | A1 |
20140149819 | Lu et al. | May 2014 | A1 |
20140185348 | Vattikonda et al. | Jul 2014 | A1 |
20140185349 | Terzioglu et al. | Jul 2014 | A1 |
20140201375 | Beereddy et al. | Jul 2014 | A1 |
20140219275 | Allan et al. | Aug 2014 | A1 |
20140241353 | Zhang et al. | Aug 2014 | A1 |
20140244779 | Roitshtein et al. | Aug 2014 | A1 |
20140269705 | DeCusatis et al. | Sep 2014 | A1 |
20140269712 | Kidambi | Sep 2014 | A1 |
20140307744 | Dunbar et al. | Oct 2014 | A1 |
20140321277 | Lynn, Jr. et al. | Oct 2014 | A1 |
20140328206 | Chan et al. | Nov 2014 | A1 |
20140334295 | Guichard et al. | Nov 2014 | A1 |
20140341029 | Allan et al. | Nov 2014 | A1 |
20140372582 | Ghanwani et al. | Dec 2014 | A1 |
20150009992 | Zhang | Jan 2015 | A1 |
20150010001 | Duda et al. | Jan 2015 | A1 |
20150016277 | Smith et al. | Jan 2015 | A1 |
20150052298 | Brand et al. | Feb 2015 | A1 |
20150092551 | Moisand et al. | Apr 2015 | A1 |
20150092593 | Kompella | Apr 2015 | A1 |
20150113143 | Stuart et al. | Apr 2015 | A1 |
20150124611 | Attar et al. | May 2015 | A1 |
20150124629 | Pani | May 2015 | A1 |
20150124631 | Edsall et al. | May 2015 | A1 |
20150124633 | Banerjee et al. | May 2015 | A1 |
20150124640 | Chu et al. | May 2015 | A1 |
20150124644 | Pani | May 2015 | A1 |
20150124806 | Banerjee et al. | May 2015 | A1 |
20150124817 | Merchant et al. | May 2015 | A1 |
20150124821 | Chu et al. | May 2015 | A1 |
20150124823 | Pani et al. | May 2015 | A1 |
20150124824 | Edsall et al. | May 2015 | A1 |
20150124825 | Dharmapurikar et al. | May 2015 | A1 |
20150124833 | Ma et al. | May 2015 | A1 |
20150127797 | Attar et al. | May 2015 | A1 |
20150188771 | Allan et al. | Jul 2015 | A1 |
20150236900 | Chung | Aug 2015 | A1 |
20150378712 | Cameron et al. | Dec 2015 | A1 |
20150378969 | Powell et al. | Dec 2015 | A1 |
20160036697 | DeCusatis et al. | Feb 2016 | A1 |
20160119204 | Murasato et al. | Apr 2016 | A1 |
20160315811 | Yadav et al. | Oct 2016 | A1 |
20170085469 | Chu et al. | Mar 2017 | A1 |
20170207961 | Saxena et al. | Jul 2017 | A1 |
20170214619 | Chu et al. | Jul 2017 | A1 |
20170237651 | Pani | Aug 2017 | A1 |
20170237678 | Ma et al. | Aug 2017 | A1 |
20170250912 | Chu et al. | Aug 2017 | A1 |
Number | Date | Country |
---|---|---|
WO 2003067799 | Aug 2003 | WO |
WO 2014071996 | May 2014 | WO |
Entry |
---|
Aslam, Faisal, et al., “NPP: A Facility Based Computation Framework for Restoration Routing Using Aggregate Link Usage Information,” Proceedings of QoS-IP: quality of service in multiservice IP network, Feb. 2005, pp. 150-163. |
Chandy, K. Mani, et al., “Distribution Snapshots: Determining Global States of Distributed Systems,” ACM Transaction on Computer Systems, Feb. 1985, vol. 3, No. 1, pp. 63-75. |
Khasnabish, Bhumip, et al., “Mobility and Interconnection of Virtual Machines and Virtual Network Elements; draft-khasnabish-vmmi-problems-03.txt,” Network Working Group, Dec. 30, 2012, pp. 1-29. |
Kodialam, Murali, et. al, “Dynamic Routing of Locally Restorable Bandwidth Guaranteed Tunnels using Aggregated Link Usage Information,” Proceedings of IEEE INFOCOM, 2001, vol. 1, pp. 376-385. |
Li, Li, et. al, “Routing Bandwidth Guaranteed Paths with Local Restoration in Label Switched Networks,” IEEE Journal on Selected Areas in Communications, Feb. 7, 2005, vol. 23, No. 2, pp. 1-11. |
Mahalingam, M., et al. “VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” VXLAN, Internet Engineering Task Force, Internet Draft, located at https://tools.ietf. org/html/draft-mahalingam-dutt-dcops-vxlan-06, Oct. 2013, pp. 1-24. |
Moncaster, T., et al., “The Need for Congestion Exposure in the Internet”, Oct. 26, 2009, Internet-Draft, pp. 1-22. |
Narten, T., et al., “Problem Statement: Overlays for Network Virtualization,” draft-ietf-nvo3-overlay-problem-statement-04, Internet Engineering Task Force, Jul. 31, 2013, pp. 1-24. |
Pan, P., et. al, “Fast Reroute Extensions to RSVP-TE for LSP Tunnels,” RFC-4090. May 2005, pp. 1-38. |
Raza, Saqib, et al., “Online Routing of Bandwidth Guaranteed Paths with Local Restoration using Optimized Aggregate Usage Information,” IEEE-ICC '05 Communications, May 2005, vol. 1, 8 pages. |
Sinha, Shan, et al., “Harnessing TCP's Burstiness with Flowlet Switching,” Nov. 2004. |
U.S. Office Action dated Dec. 16, 2015 issued in U.S. Appl. No. 14/099,638. |
U.S. Final Office Action dated Jun. 23, 2016 issued in U.S. Appl. No. 14/099,638. |
U.S. Office Action dated Nov. 3, 2016 issued in U.S. Appl. No. 14/099,638. |
U.S. Notice of Allowance dated May 19, 2017 issued in U.S. Appl. No. 14/099,638. |
Author Unknown, “Subset—Wikipedia, the free encyclopedia,” Dec. 25, 2014, pp. 1-3. |
Whitaker et al., “Forwarding Without Loops in Icarus,” IEEE OPENARCH 2002, pp. 63-75. |
Zhang, Junjie, et al., “Optimizing Network Performance using Weighted Multipath Routing,” Aug. 27, 2012, 7 pages. |
Number | Date | Country | |
---|---|---|---|
20170346748 A1 | Nov 2017 | US |
Number | Date | Country | |
---|---|---|---|
61900277 | Nov 2013 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 14099638 | Dec 2013 | US |
Child | 15682339 | US |