This disclosure generally relates to networking systems and, in particular, to analysis and configuration of bottlenecks and flows in a network.
In data networks, congestion control algorithms such as those implemented in TCP/IP stacks typically focus on two general objectives: (1) Maximizing network utilization and (2) ensuring fairness among flows competing for network bandwidth. For the past three decades since the first congestion control algorithm was implemented as part of the TCP protocol, many different variations have been designed, implemented and extensively put into practice to help address these two objectives. In this context, it is well-known that regardless of the complexity of a communication path, the performance of a flow is uniquely determined by the capacity of its bottleneck link and its end-to-end round trip time (RTT). This funnel view has steered much of the research towards the single-bottleneck characterization problem, leading to the implicit assumption that bottlenecks in a network have a flat structure, and potentially hiding the need for a better understanding of bottlenecks in distributed networks.
While the problem of characterizing the influences and relationships existing between bottlenecks has not been addressed from a formal and practical standpoint, it has naturally not been fully omitted from the literature. For instance, some techniques describe the bottleneck relationships as dependency chains and observe that performance perturbations of bottleneck links may affect other bottleneck links, propagating through all links in the network, but do not address the problem of formally identifying the hidden structure that controls such perturbations.
To address the above-described problems, we introduce the Theory of Bottleneck Ordering, a framework that, by revealing the bottleneck structure of a network, describes (qualitatively and quantitatively) the influence that bottlenecks exert onto each other. We then use three well-known congestion control algorithms to validate whether bottlenecks and flows in real networks behave as predicted by the theory. To get a broader sense of how the theory performs, we choose representatives from two widely used classes of congestion control algorithms: Bottleneck Bandwidth and Round-trip propagation time (BBR), from the class of congestion-based algorithms, and Cubic and Reno, from the class of loss-based additive-increase/multiplicative-decrease (AIMD) algorithms. The key contributions of the techniques described below can be summarized as follows:
In addition, we note that various known techniques for determining flow size that use metrics such as the number of bytes, transmission rate, duration, jitter or burstiness, to name a few may not accurately model the flow size and, to address this problem, provide a formal mathematical definition of flow size that depends on the topological structure of the network and a utility function that describes its QoS properties. This leads to a theory of flow ordering, a mathematical framework that explains the connection between flow size and the QoS properties of a network. We also describe techniques for flow ordering using the bottleneck structure of the network.
Accordingly, in one aspect a method is provided for improving the utilization of a network by configuring processing of flows therein, where the network includes several links and has a number of flows that are processed according to P policies, where P is greater than 1. The method includes evaluating a number of candidate partitionings of the several flows into P groups, where the evaluation of each candidate partitioning is performed using a QoS function that, using network topology, determines network performance for each candidate partitioning. Specifically, according to a candidate partitioning, a first flow belongs to a first group of the P groups and one or more flows belong to a second group of the P groups and, to determine the network performance, the QoS function accounts for an effect of rate drift in a rate of the first flow on respective rates of all other flows in the network. The method also includes designating a candidate partitioning that maximizes the network performance as best partitioning. The several flows in the network are allocated to the P groups according to the best partitioning, and one or more flows in each group are processed according to a respective one of the P policies.
The QoS function may include a generalized P-partitioned set function receiving (P-1) sets as inputs. In some embodiments, the QoS function includes a max-min/max-flow function. The max-min/max-flow function may include: (i) a weighted max-min/max-flow function, (ii) a max-min/weighted max-flow function, or (iii) a weighted max-min/weighted max-flow function. In some embodiments, the QoS function guarantees, for at least one flow among the several network flows, a specified minimum flow rate.
In some embodiments, the QoS function is inclusive. As such, the several candidate partitionings that are evaluated may include one or more sets of partitionings, where one or more partitionings in each set corresponds to a respective value of a parameter (λ). Evaluating the several candidate partitionings may include selecting a sequence of parameter values in order and, for each parameter value, performing the following two operations: (1) evaluating from a corresponding set one or more candidate partitionings using the QoS function; and (2) designating a candidate partitioning that maximizes the network performance as optimal partitioning for that parameter value. This yields a number of optimal partitionings corresponding to the sequence of parameter values. The method may also include designating as the best partitioning from the several optimal partitionings, a partitioning for which the network performance according to the QoS function is maximum.
In some embodiments, evaluating the one or more candidate partitionings for a current parameter value includes identifying a nested neighborhood of the optimal partitioning for a preceding parameter value. In the optimal partitioning for the preceding parameter value, each flow belonging to a first group in the P groups may be larger in size than any flow in a second group in the P groups, and each partitioning may be encoded using P symbols. Identifying the nested neighborhood may include selecting a candidate partitioning such that: (1) the Manhattan distance between an encoding of the candidate partitioning and an encoding of the optimal partitioning for the preceding parameter value is less than or equal to a specified threshold; and (2) each flow designated to the first group in the optimal partitioning for the preceding parameter value is also designated to a corresponding first group in the candidate partitioning.
In some embodiments, evaluating a particular candidate partitioning includes constructing a flow gradient graph for the network. The method may also include decrementing a respective flow rate of each flow in at least one partition according to the candidate partitioning, by a unit rate (δ), and propagating reduction in the flow rates through the flow gradient graph to obtain a flow rate drift at each flow in the network. The method may also include aggregating the flow rate drifts using a function based on the QoS function to obtain the network performance.
Constructing the flow gradient graph may include creating a link vertex for each bottleneck link in the network and a flow vertex for each flow in the network. In addition, the method may include, for each link (or each bottleneck link) in the network: (1) identifying one or more flows bottlenecked by that link, and adding a respective directed edge from a link vertex corresponding to that link to the respective flow vertices corresponding to each of the one or more bottlenecked flows; and (2) identifying non-bottlenecked flows passing through the link, and adding a respective directed edge from a flow vertex corresponding to each non-bottlenecked flow to the link vertex corresponding to the link. In some embodiments, the one or more bottlenecked flows and the one or more non-bottlenecked flows are identified by constructing a bottleneck precedence graph.
The network can be a data network, an energy-distribution network, a cellular network, and a goods-distribution network. In some embodiments, P is equal to two and, according to the best partitioning, one or more flows from the several flows in the network belong to a first group and are designated elephant flows, and the remaining flow or flows belong to a second group and are designated mouse flows. Each of the P policies may define a respective processing priority for the flows belonging to the corresponding one of the P groups.
In another aspect, a system is provided for improving the utilization of a network by configuring processing of flows therein, where the network includes several links and has a number of flows that are processed according to P policies, where P is greater than 1. The system includes a first processor, and a first memory in electrical communication with the first processor. The first memory includes instructions that may be executed by a processing unit. The processing unit may include the first processor and/or a second processor, and is in electronic communication with a memory module that may include the first memory and/or a second memory.
The instructions program the processing unit to evaluate a number of candidate partitionings of the several network flows into P groups using a QoS function that, using network topology, determines network performance for each candidate partitioning. According to a candidate partitioning, a first flow belongs to a first group of the P groups and one or more other flows belong to a second group of the P groups. To determine the network performance, the QoS function accounts for effect of rate drift in a rate of the first flow on respective rates of all other flows in the several network flows. The instructions program the processing unit to designate a candidate partitioning that maximizes the network performance as best partitioning, where the several network flows are allocated to the P groups according to the best partitioning, and one or more flows in each group are processed according to a respective one of the P policies. In various embodiments, the instructions can program the processing unit to perform one or more of the additional method steps described above.
In another aspect, an article of manufacture is provided that includes a non-transitory storage medium having stored therein instructions which, when executed by a processing unit program the processing unit, which is in electronic communication with a memory module, for improving the utilization of a network by configuring processing of flows therein, where the network includes several links and has a number of flows that are processed according to P policies, where P is greater than 1.
To this end, the instructions program the processing unit to evaluate a number of candidate partitionings of the several network flows into P groups using a QoS function that, using network topology, determines network performance for each candidate partitioning. According to a candidate partitioning, a first flow belongs to a first group of the P groups and one or more other flows belong to a second group of the P groups. To determine the network performance, the QoS function accounts for effect of rate drift in a rate of the first flow on respective rates of all other flows in the several network flows. The instructions program the processing unit to designate a candidate partitioning that maximizes the network performance as best partitioning, where the several network flows are allocated to the P groups according to the best partitioning, and one or more flows in each group are processed according to a respective one of the P policies. In various embodiments, the instructions can program the processing unit to perform one or more of the additional method steps described above.
The patent or application file contains at least one drawing executed in color. Copies of this patent or patent application publication with color drawing(s) will be provided by the Office upon request and payment of the necessary fee. Various embodiments of the present invention taught herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
Part 1: A Theory of Bottleneck Ordering for Networks
We describe below the Theory of Bottleneck Ordering, a mathematical framework that reveals the bottleneck structure of networks in general, including data networks, energy-distribution networks, cellular networks, networks for distributing physical goods, etc. This theoretical framework provides insights into the inherent topological properties of a network at least in three areas: (1) It identifies the regions of influence of each bottleneck; (2) it reveals the order in which bottlenecks (and flows traversing them) converge to their steady state transmission rates in distributed congestion control algorithms; and (3) it provides key insights to the design of optimized traffic engineering policies.
We demonstrate the efficacy of the proposed theory in TCP congestion-controlled networks for two broad classes of algorithms: congestion-based algorithms (BBR) and loss-based additive-increase/multiplicative-decrease algorithms (Cubic and Reno). Among other results, our network experiments show that: (1) Qualitatively, both classes of congestion control algorithms behave as predicted by the bottleneck structure of the network; (2) flows generally compete for bandwidth only with other flows operating at the same bottleneck level; (3) BBR flows can achieve higher performance and fairness than Cubic and Reno flows due to their ability to operate at the right bottleneck level; (4) the bottleneck structure of a network is generally ever-changing and its levels can be folded due to variations in the flows' round trip times; and (5) against conventional wisdom, low-hitter flows can have a large impact to the overall performance of a network.
Transmission of Bottleneck Information
In the context of data networks, it is well-known that regardless of how many links a connection traverses, from the TCP's viewpoint, any network path generally behaves as a single link with the same round-trip time and bottleneck rate. What is less understood is the fact that not all bottleneck links are of equal importance. To provide some intuition to this argument, let us consider some initial simple network examples.
Consider the network configurations illustrated in
Consider now the network configuration in
While this network configuration is still very simple, an interesting question arises from its topological structure: Are both bottleneck links l1 and l2 equally important with regards to the overall performance of the network? For instance, it is easy to see that the derivative of s2 with respect to c1 is 0.5, ∂s2/∂c1=0.5, while the derivative of s1 with respect to c2 is 0, ∂s1/∂c2=0. That means the performance of link l2 (measured in terms of its fair share) depends on the performance of link l1, but not vice-versa, thus revealing a notion of hierarchy or ordering between these two bottleneck links. The bottleneck structure that we formally introduce herein is a graph that captures these relationships.
An intuitive way to understand the bottleneck structure of a network involves modeling the bottleneck links as nodes communicating with each other, where these nodes implement a distributed congestion control algorithm to determine their fair share while using the smallest possible number of iterations. In
Consider yet another network configuration having three links and four flows, as shown in
While the bottleneck structures of the second and third examples are apparently similar (both have two bottleneck links connected by an edge), there is a subtle yet important difference: In the second example (
Note that the BPG also reveals the number of iterations required by links to converge to their fair share. For instance, while two links related by direct precedence can communicate in one iteration (
The above examples helped us to informally introduce the bottleneck structure of some simple network configurations providing initial intuition behind the existence of (1) bottleneck hierarchies, (2) the notion of convergence time for a distributed congestion control algorithm, and (3) the concept of direct and indirect precedence between two bottlenecks. In the discussion below, we formalize all these intuitive concepts into a mathematical theory of bottleneck ordering and introduce an algorithm to compute the bottleneck precedence graph for arbitrary network topologies.
Parallel Convergence in the Water-Filling Algorithm
We start our work with an implicit assumption: That in steady state, all congestion-controlled flows are bottlenecked at least at one link. The general intuition for this statement is that all congestion control algorithms—by their fundamental objective—are designed not to leave unused bandwidth. This fact is certainly true for algorithms implemented as part of a TCP/IP stack. The second observation we make is that, as shown above, the max-min rate allocation provides a natural way to identify the bottleneck link of every flow in steady state. Intuitively, a flow cannot get a rate above its bottleneck's fair share, as that would be a capacity violation if all other flows acted in the same way.
Thus, in what follows, we use the max-min assignment to analyze the influence that the bottlenecks exert onto each other and reveal the topological structure that inter-relates them. Two notes are relevant in this regard. First, the same general bottleneck principles apply to other allocation schemes different than max-min (for instance, weighted max-min or proportional fairness). The bottleneck structure in these other schemes might be different, but the concept of bottleneck structure is universal and exists in them too for a reason mentioned above. Secondly, as we show in our experiments discussed below, TCP's congestion control (a widely used algorithm in modern networks) does follow very closely the bottleneck structure revealed by the max-min assumption, which makes this choice highly practical.
Next, we describe an algorithm to compute the max-min rate allocation of a network. The original algorithm—called the water-filling Algorithm—was introduced by Bertsekas and Gallager (Dimitri P Bertsekas, Robert G Gallager, and Pierre Humblet, Data Networks, volume 2, Prentice-Hall International New Jersey (1992)). To develop the theoretical framework, however, we use a variant of the water-filling algorithm introduced by Ros-Giralt and Tsai (Jordi Ros and Wei Kang Tsai, “A Theory of Convergence Order of Maxmin Rate Allocation and an Optimal Protocol,” Proceedings IEEE INFOCOM 2001, volume 2, pages 717-726 (2001)). This variant, presented in Algorithm 1, exploits a parallelization property in the original water-filling algorithm that allows certain bottleneck links to converge concurrently, leading to a lower execution time when the algorithm runs on a parallel computing architecture.
:= Set of links in the input network;
:= Set of flows in the input network;
l := Set of flows going through link l;
:= Set of bottleneck links;
k := Set of unresolved links at iteration k;
k := Set of converged flows at iteration k;
0 = ; 0 = {∅};
= \ k;
The algorithm relies on two parameters:
The algorithm tracks the set of unresolved links k (links whose final fair share has not been determined yet) and the set of converged flows k (flows whose final transmission rate has been determined). At each iteration k, for each unresolved link l that has a fair share equal to its upstream fair share, slk=ulk (line 14), three actions are taken: (1) The transmission rate of the remaining flows going through link l is assigned to the fair share value (line 15), (2) link l is removed from the set of unresolved links (line 16), and (3) all of the remaining flows going through it are added to the set of converged flows (line 17). This process ends when all the flows and all the bottleneck links have converged to their final transmission rate and fair share values, respectively. Upon completion, the algorithm returns the set of bottleneck links , the transmission rate of all the flows {rlk, ∀f∈}, the fair share of all the bottleneck links {slk, ∀l∈}, and the value of k, which corresponds to the depth of the bottleneck structure as we demonstrate below.
The FastWaterFilling algorithm exploits the following property formally proven elsewhere: A link that has the minimum fair share among all the links it shares flows with, can immediately converge to its final fair share. Because in general, more than one link satisfies this property at each iteration k, this allows multiple links to converge concurrently at the same iteration. On average this approach reduces the number of iterations from O(||) in Bertseka's original water-filling algorithm down to O(log(||)). In the techniques described herein, we observe that this parallelization property captures the exact natural convergence time of bottleneck links in real networks. Our observation relies on the fact that modern networks use distributed congestion control algorithms-such as TCP BBR, Cubic, and Reno—that effectively behave like a large parallel computer. This property plays a crucial role in understanding the bottleneck structure of a network, as we show next.
Precedent Link Relations
A more detailed analysis of the FastWaterFilling algorithm shows that the order of execution of its while loop reveals a bottleneck structure that is unique to every network. In this hidden structure, bottleneck links relate to each other by following well defined mathematical relationships that describe how (and in what magnitude) a link can affect the performance of another link. In particular, we show there exists only two essential relationships between bottlenecks, direct precedence and indirect precedence, which we formally introduce as follows:
Definition 1 Direct precedence. Let l and l′ be two links such that (1) l converges at iteration k, (2) l′ converges at iteration k′, for some k′>k, and (3) l∩l′≠{Ø}. Then we say that link l is a direct precedent link (or simply a direct precedent) of link l′.
Definition 2 Indirect precedence. Let l and l′ be two links such that (1) l converges at iteration k, (2) l′ converges at iteration k′, for some k′>k, and (3) l∩l′={Ø} but there exists another link lr such that (4.1) l∩l
The relevancy of these two definitions is justified in the next two lemmas, which state the order of bottleneck convergence and the degree to which one bottleneck can affect the performance of another one in arbitrary networks.
Lemma 1 Bottleneck convergence order. A link l is removed from the set of unresolved links k at iteration k, k=k\{l}, if and only if all of its direct and indirect precedent links have already been removed from this set at iteration k−1.
Proof. See Appendix A.1.2.
This lemma introduces the notion of link convergence. We use this term to indicate that a link's fair share has been resolved and will no longer change as the FastWaterFilling algorithm continues iterating.
Intuitively, Lemma 1 says that for a link l to converge, all of its direct and indirect precedent links must have converged first (this is the necessary condition). Moreover, it also says that if k is the highest iteration at which any of its direct and indirect precedent links converges, then link l will converge immediately after it at iteration k+1 (this is the sufficient condition). The importance of this lemma lies in the fact that it reveals the hidden bottleneck structure of the network. Bottleneck links in a network are not all equal with respect to the performance impact they exert onto each other. On the contrary, they follow a well-defined structure uniquely characterized by the topological properties of the network. The following lemma characterizes the influence of bottlenecks onto each other:
Lemma 2 Bottleneck influence. A bottleneck l can influence the performance of another bottleneck l′, i.e., ∂sl′/∂cl≠0, if and only if there exists a set of bottlenecks {l1, l2, . . . , ln} such that li is a direct precedent of li+1, for 1≤i≤n−1, l1=l and ln=l′.
Proof. See Appendix A.1.3.
Note that in the above lemma we capture the influence of a link l against another link l′ using the derivative as ∂sl′/∂cl. That is, if a change on the effective capacity cl of link l changes the fair share sp, of another link l′, then we say that link l influences link l′. The lemma states that a bottleneck link l influences another bottleneck link l′ if there exists a set of direct precedent links that provide a path between l and l′.
In
Next, we provide an algorithm to compute all the precedent links to help construct the bottleneck structure of arbitrary networks and a detailed example to illustrate the practical implications of Lemmas 1 and 2.
The Bottleneck Structure of Networks
To compute the bottleneck structure of a network, we need to obtain the direct and indirect precedent links of every bottleneck link. These precedent links correspond to edges in a digraph that has all the bottleneck links as its vertices, revealing the inherent mesh structure that characterizes the relationships between the bottlenecks and how they influence each other. This structure can be formally defined as follows:
Definition 3 Bottleneck precedence graph. We define the bottleneck precedence graph (BPG) as a tuple V, E of vertices V and edges E such that V= and E={lk,∀l∈}∪{lk,∀l∈}. To differentiate them, we typically represent edges corresponding to direct and indirect precedents links with solid and dashed lines, respectively. We also equivalently refer to the BPG graph as the bottleneck structure of a network.
As noted in above, the FastWaterFilling algorithm already computes the set of bottleneck links . Thus to obtain the bottleneck structure, we can extend this algorithm with additional logic to compute the direct and indirect precedent links. We refer to this procedure as the Bottleneck Precedence Graph (BPG) Algorithm, which we introduce in Algorithm 2.
, , l, cl, , rf, k, k are as in Algorithm 2.2;
l
k := Set of direct precedents of link l at iteration k;
l
k := Set of indirect precedents of link l at iteration k;
l
k := Set of relays of link l at iteration k;
0 = ; 0 = {∅};
l
0 = l0 = l0 = {∅}, ∀l ∈ ;
= \ k;
For every link l, the BPG Algorithm keeps track of the sets of direct precedents lk, indirect precedents lk, and relays lk (lines 2-4). Every time a link converges, these sets are updated as follows (lines 15-23):
The BPG algorithm returns a tuple of vertices (corresponding to the bottleneck links ) and edges (corresponding to the set of direct lk and indirect lk precedents for every bottleneck link l, where k is the algorithm's last iteration), which provide all the necessary information to construct the bottleneck structure of the network. The last value of the iterator k is also of relevance because it corresponds to the length of the longest path in the BPG graph, as shown in the next lemma:
Lemma 3 Depth of the bottleneck structure. The diameter of the BPG graph is equal to the last value of the iterator k in the BPG algorithm. We refer to this value as the depth or simply the number of levels of the bottleneck structure.
Proof. See Appendix A.1.4
Next, we use two examples to illustrate how the BPG graph is constructed. In both these examples, we use the network shown in
Example 1 Computation of the BPG graph. In this simple example we assume the presence of five flows {f1, . . . , f5} as shown in
Note also that there exists an indirect precedence between links l15 and l1. As in a direct precedent case, this also means flows bottlenecked at link l1 cannot converge until flows bottlenecked at link l15 have converged. The difference here is that there exists no common flow going through both links, unlike in the case of direct precedents. Instead, execution of the BPG algorithm shows that links l15 and l1 have link l4 as a relay link (l
Example 2 B4's bottleneck structure for full-mesh/shortest-path. In this example, we compute the bottleneck structure of the B4 network for a more realistic case considering a larger number of flows. In particular, we assume the existence of a flow between each pair of data centers—i.e., a full mesh connectivity—with each flow taking the shortest path. As a first-order approximation, we reason that the full mesh assumption is intuitively meaningful if we consider that in general, every data center may need to communicate with each other. Also, the choice of shortest paths can serve as a raw approximation of the actual routing. While in a production system, both of these assumptions can be relaxed by using actual historical or real-time flow and routing information of the network to compute its precise bottleneck structure. This could, for instance, be inferred using NetFlow, sFlow or BGP-LS protocols among others-see also Appendix A.5 for details. The full-mesh/shortest-path model provides a simple but powerful base benchmark that can help measure the approximate bottleneck structure of a network by using only its fixed topological structure.
The BPG graph of the B4 network under the full-mesh/shortest-path assumption and with all link capacities set to 100 Gbps is presented in
In contrast, links l16, l11, l9, l18, l19, l14, l15, l3, l6, l7, l12, l1, are leaves in the bottleneck structure, which means they have no impact on the performance of any other link. This analysis reveals that when looking for high-impact bottlenecks, network operators should consider links that are located at higher levels in the BPG graph. Of interest is comparing the influence of links l8 and l10, both being the transatlantic connections between the US and EU regions. The bottleneck structure-which takes into account the structural properties of the topology, the routing and the flow information of the network-indicates that link l8 has a higher influence than l10 (from
Furthermore, the graph shows that link l8 can affect the performance of link l10 (since there is a directed path from l8 to l10), but not vice versa. Similarly, on the transpacific connections side, the BPG graph shows that link l2 has a more significant influence than link l4 in terms of nodes they can affect, but in this case, neither of these two links can influence the performance of each other since there is no directed path connecting them. Accordingly, the kind of structural information that the BPG graph reveals helps network operators identify and organize the relevancy of each bottleneck.
Property 1 Monotonic fair shares along a precedence path. Let sl
Proof. See Appendix A.1.1.
Minimum Convergence Time of Distributed Congestion Control Algorithms
We described above that an intuitive way to understand the bottleneck structure of a network includes modeling its bottleneck links as nodes that are trying to compute a globally optimal fair-share value by exchanging locally available information. This dualism between the two models-bottleneck structure and distributed communication-provides a robust framework to measure the minimum convergence time attainable by any distributed congestion control algorithm. We capture this concept in the following lemma:
Lemma 4 Minimum convergence time of a distributed congestion control algorithm. Let τ(li,lj) be a weight assigned to each edge (li,lj) of the BPG graph as follows: (1) If li is a direct precedent of lj, then τ(li,lj) is the time that it takes for a message to be sent from li to lj; (2) If li is an indirect precedent of lj, then τ(li,lj)=max{τ(li,lr)+τ(lr,lj)|foranyrelaylink|_rbetween|_iand|_j}. Let l1−l2− . . . −ln be a longest path terminating at link ln according to these weights. Then the minimum convergence time for link ln is Σ1≤i≤n−1 τ(li,li+1).
Proof. See Appendix A.1.5.
It is important to note that the above lower bound is not attainable in the case of TCP congestion control algorithms because, in such protocols, there is no direct communication mechanism between links-since they are end-to-end protocols. Instead, TCP algorithms rely on implicit feedback based on signals such as packet loss or round trip time, effectively requiring a significantly longer time to propagate from one link to another. Nevertheless, the importance of the above lemma is in the claim that convergence time increases as the depth of the bottleneck structure increases, and this general principle holds even for end-to-end protocols. In the discussion below, we test this lemma against a TCP congestion control algorithm and empirically demonstrate that convergence time does increase as the depth of the bottleneck structure increases.
The Influence of Flows onto Bottleneck Links and onto Each Other
While the theory of bottleneck ordering helps understand the relations that exist between links and the influences they exert onto each other, the BPG graph resulting from this mathematical framework does not reveal any information regarding the performance of flows. There exists, however, a simple way to transform the BPG graph into a structure that does take into account flows and helps characterize their performance. We refer to this new structure as the flow gradient graph, formally defined as follows:
Definition 4 Flow gradient graph. The flow gradient graph is a digraph such that:
The flow gradient graph can be constructed using the FastWaterFilling algorithm described above since this procedure computes the bottleneck links and for every flow in the network it discovers its bottleneck right at line 17 (Algorithm 2) when the flow is entered into the converged set. We use an example to illustrate how this graph is constructed.
Example 3 Computation of the flow gradient graph. Consider the network in
The flow gradient graph is a useful extension of the BPG since it allows to generalize the bottleneck analysis onto flows. We can see bottlenecks and flows as two sides of the same coin; thus if the network were an economy, links and flows would correspond to the supply and the demand, respectively. This duality principle implies that all the general lemmas and properties described herein regarding the bottleneck structure of a network have a dual correspondence in the domain of flows. For instance, Lemma 2 (bottleneck influence) can be translated to the domain of flows as follows:
Lemma 5 Flow influence. A flow f can influence the performance of another flow f′, i.e., ∂r
Proof. See Appendix A.1.6.
Using this Lemma, we can infer the following properties from the flow gradient graph in
Experimental Results
To experimentally evaluate the ability of the proposed theory to predict bottleneck and flow performance in real networks, we developed a sandbox using Mininet and the POX SDN controller. The sandbox takes as input information related to a network's topology, routing, and traffic flows. With that, the sandbox can create the desired network configurations, including the configurations presented in
On the Bottleneck Structure of TCP Networks
In this initial experiment, we tested the hypothesis that TCP congestion-controlled networks behave as predicted by the theory of bottleneck ordering. In particular, we experimented with the three networks presented in
We show the corresponding BPG graphs in
In this experiment, we also measured the degree of fairness achieved by each congestion control algorithm using Jain's fairness index normalized to the max-min rate allocation. (See Appendix A.4 for a summarized description.) Since the bottleneck structure of the network is also the solution to the theoretical max-min problem, this index effectively serves as an indicator of the degree to which the congestion control algorithm under test can assign rates according to the bottleneck structure itself.
The results for BBR and Cubic are presented in
The Effect of Latency on the Bottleneck Structure of a Network
In the discussion above, we saw that AIMD algorithms fail to identify the bottleneck structure of the test networks, and as a consequence, they perform significantly poorer than BBR. We now study this phenomenon in more detail.
The first observation we make is that differences in flow RTTs can play a role in the capabilities of a congestion control algorithm to identify the bottleneck structure. For instance, in
Due to its lower RTT, the lower level flow f1 can capture bandwidth from the higher-level flow f2, and this leads to the collapse of the 2 levels in the Cubic case. This is in contrast with BBR (
To demonstrate the effect of RTT, we tested the 3-level network (
The above results show that BBR does significantly better at identifying each flow's true bottleneck, but can we still fool this congestion control algorithm to collapse its bottleneck structure? In the next experiment, we take this objective by using the 3-level network, initially setting all link latencies to zero, and progressively increasing the RTT of the flows on the upper bottleneck levels to verify if the structure collapses. To control the RTT of each flow individually, we add a fictitious link attached to the source of each flow with a large capacity (so it does not become a bottleneck) and a latency that we adjust to achieve a specific end-to-end RTT value for each flow. In
On how Low-Hitter Flows can be High-Impact Flows
Next, we illustrate an example of how the theory of bottleneck ordering can be used to support traffic engineering operations. We start with an initial network configuration, and our objective is to identify the flows that have the largest impact on the performance of the rest of the flows. Based on the information from this analysis, a network operator can choose to re-route high-impact traffic or assign it to a lower priority queue to help improve the overall performance of the network.
To drive this experiment, we consider the network introduced in
In
This result might seem counter-intuitive if we consider the fact that flow f6 is among the smallest throughput flows, but it is easy to explain if we take into account the bottleneck structure of the network. In particular, from the flow gradient graph in
This result indicates that traditional approaches, which focus on finding the heavy hitter flows, might tend to sub-optimal choices because they do not take into account the full bottleneck structure of the network. We reason that network operators interested in finding the high impact flows should also consider the BPG and the flow gradient graph structures to aid their optimization process.
Finally, we also note that we performed the same experiments using TCP Cubic and Reno and obtained the same behavior in both algorithms. (For the sake of brevity, we include those results in Appendix A.3.) Overall, the results show that the same traffic engineering analysis based on the bottleneck structure of the network applies not only to BBR but to TCP congestion-controlled networks in general.
Convergence Time with Increased Number of Bottleneck Levels and Flow Competition
In the discussion above, we saw that the depth of the bottleneck structure of a network is closely related to the convergence time of a distributed congestion control algorithm. We set to experimentally verify this property by using the network configuration in
To complement the convergence time analysis, we also tested the impact of increasing the number of flows at each level. With this objective, we implemented a flow multiplier factor as part of our sandbox. Multiplying the number of flows by a factor m means that each flow in a given network is replicated (preserving its source, destination, and route) m times. This approach can effectively increase by m the number of flows competing at the same level, without modifying the bottleneck structure of the network.
Table 4 presents the results of running the described experiment varying the number of levels n and the flow multiplier factor m from 1 to 4, for a total of 16 experiments, using the BBR algorithm. The results show the behavior predicted by the theoretical framework: Convergence time increases (1) as the number of bottleneck levels increases, and (2) as the number of flows at the same level increases.
While network operators cannot influence the number of flows, they have some ability to alter the number of levels in the BPG graph. For instance, one strategy would include configuring routes (or even design network topologies) that tend to generate bottleneck structures of smaller depth. This strategy would help flows converge faster and lead to networks operating at higher utilization. The study and design of routes and topologies that yield shallow bottleneck structures, while very interesting, is left for future work.
To the best of our knowledge, the proposed theoretical framework is the first to address the problem of characterizing the bottleneck structure of a network and the influences and relationships existing between bottlenecks from both formal and practical standpoints. According to a known technique that refers to the bottleneck relationships as dependency chains and observes that performance perturbation of bottleneck links may affect other bottleneck links, does not address the problem of formally identifying the hidden structure that controls such perturbations.
Various other bottleneck characterization techniques appear to assume implicitly that the bottlenecks in a network have a flat structure. Traditional traffic optimization approaches generally only focus on heavy-hitter flows using a single-link model. In contrast, the techniques described herein reveal the bottleneck structure of a network, which allows optimizing traffic engineering by taking into account the network's topological, routing, and flow properties.
In the techniques described herein, we base the bottleneck analysis on max-min rate assignment. An initial algorithm to compute the max-min rate allocation of a network has been improved and is known. Our algorithm to construct the bottleneck structure is based on a further extension of that improved algorithm.
Congestion control algorithms have been one of the most widely researched subjects in the field of computer communications since the Internet collapsed in 1986 due to the lack of them. Starting from the first TCP congestion control algorithm implemented in 1987, key research has focused on characterizing the bottleneck to help maximize the performance of flows individually while achieving fairness collectively. Such single-bottleneck oriented approach, however, has left a gap in attempting to understand the global structure of the problem.
In this first part, we addressed this gap by introducing a theory of bottleneck ordering. The theoretical framework leads to the BPG algorithm, a procedure that reveals the bottleneck structure of a network in the form of a graph—the BPG graph. This graph provides vital insights to the understanding of the collective bottleneck behavior in a distributed network, among others: (1) The implicit relationships that exist between bottlenecks; (2) the bounded regions in the network that links and flows can influence; and (3) the convergence properties of the congestion control algorithm run by the network.
Our experimental results show that: (1) Real networks do have a bottleneck structure that may follow the BPG graph closely; (2) RTT variations can lead to imperfections (levels that fold) in the bottleneck structure, which can severely affect performance; (3) congestion-based algorithms such as BBR are able to infer the bottleneck structure significantly better than loss-based/AIMD algorithms such as Cubic and Reno, which leads them to achieve superior performance in both throughput and fairness. The theory also reveals intriguing results that low-hitter flows can have a higher impact on a network than heavy-hitter flows, or the notion that routing and topology can play a role in the design of shallower bottleneck structures that would improve the convergence time of congestion control algorithms.
Part 2: A Theory of Flow Ordering for Networks
How large is a network flow? Existing work in the literature has addressed this question by considering metrics such as the number of bytes, transmission rate, duration, jitter or burstiness, to name a few. But is one metric better than another one and, more importantly, what is the true meaning of the elusive concept of flow size? In the techniques described herein we show that a formal mathematical definition of flow size depends on the topological structure of the network and a utility function that describes its QoS properties. This leads to a theory of flow ordering, a mathematical framework that explains the connection between the abstract concept of flow size and the QoS properties of networks, including data network, energy-distribution networks, cellular networks, and networks for transport of physical goods. To put the theory to the test, we apply it to derive computationally tractable algorithms to resolve the problem of topologically ordering TCP flows according to their QoS impact.
The objective of this part of the discussion is to develop the following concept: the size of a flow should not only be measured in terms of arbitrary scalar metrics such as rate, byte counts or duration as it has been done in the past, but also through a formal framework that takes into account the mathematical properties—in terms of its topology and its QoS utility function—of a data network. The contributions of this technique are as follows:
In the following, we present the main body of this work. We first introduce the concept of partitioned QoS, the general definition of elephant flow, the topological properties of inclusiveness and nested convexity and an optimal greedy algorithm.
Thereafter, we provide empirical measurements to help validate the theory and illustrate its applicability, and provide conclusions.
Network Model and QoS Functions
We start by introducing a definition of the network model that will be used throughout the discussion below. As used herein, a network can be a data network, an energy (e.g., electricity) distribution network, a cellular network, a network for transporting goods, etc.
Definition 5 Network. We define a network N as a tuple T,F such that:
In the language of data networks, the vertices in S correspond to switches (or routers) whereas the edges in L correspond to links connecting the switches in the network. Together they uniquely characterize the topology of the network T. In addition, a flow fi∈F corresponds to a specific communication connection (e.g., depending on the network architecture and model, this could correspond to a TCP/UDP connection, an MPLS path, an IP tunnel, etc.) between a source vertex and a destination vertex.
Next we introduce the concept of partitioned QoS function, the starting point from where our theoretical framework is constructed:
Definition 6 Partitioned QoS function. Let N=T,F be a network and assume that each flow fi∈F is assigned resources from N according to one of l possible policies (P1, P2, . . . , Pi). For instance, while not limited to, an example of policies can be those conforming to the rule “flows assigned to policy Pi are forwarded at a higher priority than flows assigned to policy Pi+1”. Let the tuple F1, F2, . . . , Fl be an l-part partition on the set of flows F such that the flows in Fi are assigned to policy Pi, for 1≤i≤l. We will say that q(F1, F2, . . . , Fi) is an l-partitioned QoS function of network N if q(F1, F2, . . . , Fl)>q(F1′, F2′, . . . , Fl′) implies that the quality of service (QoS) achieved with configuration F1, F2, . . . , Fl is greater than with configuration F1′, F2′, . . . , Fl′. When there is no need to make the number of partitions explicit, we will also say that q( ) is a partitioned QoS function or simply a QoS function.
A partitioned QoS function is therefore a utility function on the space of all the possible l-part partitions of F. Throughout this essay, the following encoding will provide a convenient way to express some of the mathematical and algorithmic results:
Definition 7 Flow encoding. A network configuration expressed as an l-part partition of a set of flows F, F1, F2, . . . , Fl, can be conveniently encoded as an |F|-dimensional vector p where each element pi is such that:
pi=kiffi∈Fk (1)
Hence, any l-part partition has one corresponding vector in the space [l]|F|, where [l]={k∈|k≤1}. Using this encoding, a partitioned QoS function can be expressed as a utility function mapping the set of vectors in [l]|F| onto the set of real numbers :
q:[l]|F|⇒ (2)
A partitioned QoS function can also be expressed in terms of finding ways to store |F| objects into l boxes according to a utility function q( ). In this case, the search space is given by the sum of all multinomial coefficients, which corresponds also to the size of the space [l]|F|:
Σk
The problem of finding an optimal network configuration F1, F2, . . . , Fl can now be reduced to the problem of identifying the element in [l]|F| that maximizes q( ):
Definition 8 Partitioned QoS problem. We will say that an l-part partition F1*, F2*, . . . , Fl* of F is a QoS optimal partition if q(F1*, F2*, . . . , Fl*)≥q(F1, F2, . . . , Fl) for any other l-part partition F1, F2, . . . , Fl of F. We will refer to the problem of finding such optimal partition as the partitioned QoS problem or the l-partitioned QoS problem if there is a need to make the number of partitions explicit.
General Definition of Elephant and Mouse Flow
The previous formulation leads to a formal interpretation and definition of the concepts of elephant and mouse flows:
Definition 9 General definition of elephant and mouse flows. Let N=T,F be a network and assume that each of its flows is assigned resources from the network according to two possible policies, P1 and P2. Let also q( ) be a 2-partitioned QoS function. Then we will say that Fe⊆F is the set of elephant flows if and only if Fe,F\Fe is a QoS optimal partition. Consequently, we will say that F\Fe is the set of mouse flows.
The above definition shows that the problem of identifying the set of elephant flows in a network corresponds to an l-partitioned QoS problem with l=2. It also allows us to establish a direct relationship between the meaning of elephant flow and its effects on the QoS of a network: the set of elephant flows is one that when exclusively assigned to policy P1, the resulting QoS of the network is maximal. The same can be said of mouse flows: the set of mouse flows is one that when exclusively assigned to policy P2, the resulting QoS is also maximal. Notice that without loss of generality, we will take the convention F1,F2=Fe,\Fe, so that the set of elephant and mouse flows are assigned policies P1 and P2, respectively.
Following the notation in equation (1), any 2-partitioned QoS function q( ) can be encoded using a mapping between the set of |F|-dimensional binary vectors and the set of real numbers:
q:b|F|⇒ (4)
where b=[2]=0.1.
Since the above is a general definition of elephant and mouse flows, then any of the definitions proposed in the existing literature should correspond to a particular solution of a specific 2-partitioned QoS problem. The next lemma reveals the structural connection of the existing solutions—characterized by the problem's implicit network topology and QoS function—and the general definition of elephant and mouse flows:
Lemma 6 Metric-based partitioned QoS. Let m(f) be any function mapping the set of flows F with the set of nonnegative real numbers, m:F→+. We will call m(f) a flow-metric function or simply the metric function when the context is understood. Examples of metric functions include the number of bytes or the transmission rate of a flow. Assume the following 2-partitioned QoS function:
where r(f) is the rate of flow f, C is a network capacity parameter, K1 and K2 are two arbitrary positive numbers chosen so that K2>K1, and π(f) and ρ(f) are two penalty and reward functions satisfying the following conditions:
π(fi)>π(fj)⇔m(fi)>m(fj) ρ(fi)>ρ(fj)⇔m(fi)<m(fj), for all fi,fj∈F (6)
Then the set of elephant flows corresponds to F1=Fe={f1, f2, . . . , fλ*} such that m(fi)>m(fi+1), Σ∀i>λ*r(fi)≤C, and Fe has minimal cardinality.
Proof. Equation (5) can be modeled using the network described in
The optimal 2-partitioned QoS is therefore one that redirects the minimum number of the largest flows—according to the metric function m( )—to the low-priority queue without exceeding the capacity C of the bottleneck link. This partitioned QoS problem can be trivially solved using the following greedy algorithm. Let {f1, f2, . . . , f|F|} be the list of flows ordered according to the rule m(fi)≥m(fi+1). Then the set of elephant flows corresponds to F1=Fe={f1, f2, . . . , fλ*} such that Σ∀i>λ*r(fi)≤C and Fe has minimal cardinality, i.e., λ* is minimal.
The 2-partitioned QoS function in equation (5) demonstrates that for any elephant flow detection algorithm in the literature which uses a metric function m(f) to classify elephant flows, there exists at least one partitioned QoS function—e.g., that in equation (5)—whose solution leads to the ordering given by the metric. For instance, elephant flow definitions such as those based on the metrics of flow rate, byte count or burstiness are all equivalent to solving the partitioned QoS problem in
General Definitions of Flow Ordering and Flow Size
We now center around the problem of deriving a formal definition of flow ordering and flow size for data networks. While we will discuss this subject for the case l=2, the following results can be demonstrated to be generally applicable to the general case of arbitrary number of policies l≥2.
Definition 10λ-optimal QoS partition. Let N,P1,P2,q( ) be a 2-partitioned QoS problem. We will say that Fλ⊂F defines α λ-optimal QoS partition on the set of flows F if q(Fλ,F\Fλ)≥q(F′,F\F′) for all F′ such that F′⊂c F and |Fλ|=|F′|=λ. When the meaning is clear, we will simply use the term λ-optimal partition.
Notice that using Definitions 6 and 10, we have that the set of elephant flows Fe is an |Fe|-optimal QoS partition. The intuition behind the concept of a λ-optimal QoS partition is as follows. Assume first that λ=1. The 1-optimal QoS partition corresponds to a single-flow partition F1={f1} such that the exclusive mapping of flow f1 onto policy P1 leads to the best QoS configuration among all other possible single-flow partition configurations. It follows that flow f1 has the biggest QoS impact—as its mapping onto policy P1 is QoS maximal—and we can conclude that among all flows, f1 is the largest one from a QoS impact standpoint. This reveals a recursive strategy to identify the ordering of flows according to an abstract concept of size as follows:
Notice that in order for the above construction process to succeed, the condition Fλ⊂Fλ+1 must be satisfied at each iteration. This leads to the first property that a 2-partitioned QoS function needs to satisfy to ensure the flows can be ordered according to their size:
Property 2 Inclusive QoS functions. We will say that a 2-partitioned QoS function is inclusive if its λ-optimal partition Fλ includes the (λ−1)-optimal partition Fλ−1, for 2≤λ≤|F|. Equivalently, assuming the vector p{circumflex over (l)}» is the configuration encoding of (Fλ,F\Fλ) (see definition 7), a 2-partitioned QoS function is inclusive if pλ−1=pλ−1∧pλ, where ∧ is the bitwise logical conjunction.
We are now in a position to formally introduce the concept of flow size ordering:
Definition 11 Flow ordering. Let N, P1, P2, q( ) be a 2-partitioned QoS problem and assume that q( ) is inclusive. Then we will say that f1 is the largest elephant flow in the set F if and only if:
q(f1,F\{f1})≥q(f′,F\{f′}), for all f′∈F. (7)
Similarly, we will say that a flow f2 is the second largest flow in the set F if and only if:
q({f1,f2),F\{f1,f2})≥q((f1,f′},F\{f1,f′}), (8)
for all f′∈F and f′≠f1. More in general, we will say that a flow fi∈F is the i-th largest flow in the set F if and only if:
q(Fi,F\Fi)≥q(F′,F\F′) (9)
where
Fi=Fi−1∪{fi},fi∉Fi−1
F′=Fi−1∪{f′}, for all f′∉Fi−1 (10)
for 1≤i≤|F| and F0={Ø}.
We will also say that the ordered list {f1, f2, . . . , f|F|} defines the flow size ordering of the 2-partitioned QoS problem N, P1, P2, q( ).
Definition 11 introduces both the mathematical definition of flow ordering and a greedy algorithm to order the flows according to such ordering. It is easy to see that if the QoS function is inclusive, then the greedy algorithm is optimal, i.e., it yields the correct flow ordering and the correct set of λ-optimal partitions. Further, for the particular case of l=2, the top flows in this ordering correspond to the set of elephant flows. We will elaborate a bit more on this point below.
While knowing the flow ordering is often enough to characterize and optimize network problems, sometimes it can be useful to have an exact measurement or metric for the size of a flow. For instance, as we have seen, all metric based solutions found in the literature rely on such metric to provide the definition of elephant flow and to construct heavy hitter detection algorithms. Next, we construct a formal metric that captures the mathematical meaning of flow size according to the QoS properties of the network.
Property 3 Decreasing returns to gains. Let {f1, f2, . . . , f|F|} be the flow size ordering of a 2-partitioned QoS problem N, P1, P2, q( ) and let σq (fi)=q(Fi,F\Fi)−q(Fi−1, F\Fi−1). We will say that q( ) has decreasing returns to gains, if the following is true:
σq(fi)≤σq(fi−1), for 2≤i≤|F| (11)
Intuitively, the property of decreasing returns to gains tells us that as more flows are moved into policy P1, the QoS gains achieved by performing such action become smaller. This property leads to a natural definition of flow size as follows:
Definition 12 Flow size and q-size metric. Let f1, f2, . . . , f|F|) be the flow size ordering of a 2-partitioned QoS problem N, P1, P2, q( ) and assume q( ) has decreasing returns to gains. We define σq(fi)=q(Fi, F\Fi)−q(Fi−1, F\Fi−1) as the size of flow f. We will also refer to the function σq(fi) as the q-size metric of the QoS problem N, P1, P2, q( ). When the subindex q is redundant, we will simply use the expression σ(fi).
The formal concept of flow size can be understood by analogy with an intuitive method that one can use to measure the size of objects in nature. Suppose that we have different types of objects and that we are interested in knowing their volume ordering. We can immerse each object into a pool of water and measure the amount of water displaced. The object with the n-th largest volume will displace the n-th largest amount of water. This approach is equivalent to the definition of flow size ordering introduced above. The water in the pool can be seen as the total level of QoS available. When a flow is classified as elephant, it is moved into policy P1, as if an object were removed from a pool, enhancing the overall QoS of the network by an amount equal to the amount of water displaced. The size of the flow fi corresponds to the quantity of water displaced, which mathematically can be expressed as the difference in QoS between the two configurations: Fi, F\Fi and Fi−1,F\Fi−1. This value corresponds to σ(fi).
The property of decreasing returns to gains is necessary to provide a natural metric onto the set of real numbers upon which we can measure in absolute terms the size of a flow, but it is not needed in order to define the ordering of the flows according to their size. To define the ordering, a 2-partitioned QoS function q( ) only needs to satisfy the inclusiveness property. In the discussion below, we identify the convexity properties that need to be supported by q( ) to ensure inclusiveness. Before doing that, we complete this analysis showing that all metric-based definitions of flow ordering have decreasing returns to gains:
Lemma 7 Decreasing returns to gains for metric-based ordering. All metric-based partitioned QoS functions have decreasing returns to gains.
Proof. Consider the metric-based partitioned QoS problem described in Lemma 6. Using the condition m(fi)>m(fi+1) and equation 6, we can apply equation (9) to derive the flow size ordering as follows:
q({Ø},F)=K1−Σ∀f∈Fπ(f)=q0
q(F1,F\F1)=q0+π(f1)=q1
q(F2,F\F2)=q1+π(f2)=q2
( . . . )
q(Fλ*-1,F\Fλ*−1)=qλ*−2+π(fλ*−1)=qλ*−1
q(Fλ*,F\Fλ*)=K2+Σ∀f∈F\F*
q(Fλ*+1,F\Fλ*+1)=qλ*−ρ(fλ*+1)=qλ*+1
( . . . )
q(Fl−1,F\Fl−1)=q|F|−2−ρ(fl−1)=ql−1
q(F,{Ø})=ql−1−ρ(fl)=ql (12)
where Fi=Fi−1∪fi, for all 1≤i≤|F|, F0={Ø} and F|F|=F.
The above sequence has two separated regions graphically illustrated in
The second region generates the sequence Fλ*, . . . , F|F| which corresponds to all possible λ-optimal partitions under the assumption that the network is not congested-lower level of equation (5). Under this region, moving a flow fi from policy P2 to policy P1 decreases the QoS of the network by ρ(fi)—i.e., the QoS gain foregone by flow fi when scheduled as low priority. Similarly, using definition 9, {fλ*+1, . . . , f|F|} corresponds to the set of mouse flows in the network. The optimal partition is reached when λ=λ*, that is, when all elephant flows are moved to policy P1, which is accomplished with the configuration (F*, F\F*) and the total QoS is:
K2+Σ∀f∈F\F*
It is easy to see that σ(fi)=π(fi) for 1≤i≤λ* and σ(fi)=−ρ(fi) for λ*<i≤l. Now using equation (6), we can conclude that all metric-based QoS functions have decreasing returns to gain.
Study of Inclusive QoS Functions
The property of inclusiveness is relevant because if a QoS function satisfies it, then the greedy algorithm introduced in Definition 11 is optimal and the problem of ordering the flows in a network becomes tractable. We now study the mathematical properties of inclusive QoS functions. We start with a simple example.
Example 4 Inclusive paths on 3 flow networks. Let N, P1, P2, q( ) be a 2-partitioned QoS problem and assume that F={f1,f2,f3}. We are interested in understanding which possible mappings of the flows onto policies P1 and P2 preserve the property of inclusiveness. Using the encoding in definition 7 and since |F|=3 and l=2, the set of possible flow mappings correspond to the following binary symbols:
[l]|F|=[2]3={{000},{001},{010}, {011},{100},{101},{110},{111}} (14)
The encoding of the solution set works as follows. A zero in the i-th position of each symbol means that flow fi is classified as a mouse flow-hence mapped onto policy P2. Similarly, a one in the j-th position of each symbol means that flow fj is classified as an elephant flow-hence mapped onto policy P1. To find an optimal mapping, we start with the symbol {000} that maps all flows onto policy P2 and attempt to increase the QoS of the network by migrating flows to policy P1 following the recursive strategy described in equation (9). The complete set of possible optimization paths following this strategy is presented in
As indicated in the graph, out of a total of nine possible paths, there exist six paths that satisfy the property of inclusiveness:
Example 4 illustrates with a simple network configuration that not all possible optimization paths lead to a well defined flow ordering. Instead, only those that satisfy the property of inclusiveness do. We are now interested in identifying the topological properties that partitioned QoS functions must satisfy to ensure the property of inclusiveness is satisfied. To this end, we introduce first the concepts of partition distance and nested neighborhood:
Definition 13 Partition and nested neighborhoods. Let F1, F2 and F1′, F2′ be two 2-part partitions of a set of flows F and let the vectors p and p′ be their corresponding configuration encoding (definition 7). We define the partition distance d( ) between F1, F2 and F1′, F2′ as the difference between the number of flows assigned to F1 and F1′. (Equivalently, it is also the difference between the number of flows assigned to F2 and F2′). Mathematically:
dπ(F1,F2,F′1,F′2)=∥F1|−|F′1∥=λF2|−|F′2∥ (16)
or, in a more compact form using the configuration encoding:
dπ(p,p′)=|∥p∥1−∥p′∥1| (17)
where ∥x∥1 is the Manhattan norm. We also define the partition neighborhood nπ( ) of a 2-part partition as the set of all 2-part partitions that are at a partition distance 1 of it:
nπ(F1,F2)={F′1,F′2s.t.dπ(F1,F2,F′1,F′2)=1} (18)
equivalently:
nπ(p)={p′s.t.dπ(p,p′)=1} (19)
The partition neighborhood allows us to define the concept of nested neighborhood no( ) as follows:
no(F1,F2)={F′1,F′2∈nπ(F1,F2)s.t.F1⊂F′1}⇔no(p)={p′∈nπ(p)s.t.p=p∧p′} (20)
where ∧ is the bitwise logical conjunction.
Notice that the nested neighborhood of a 2-part partition is contained inside its partition neighborhood, no(p)⊂nπ(p), hence the name nested. We can now introduce the concept of nested convexity which will allow us to characterize the property of inclusiveness:
Definition 14 Nested convexity on 2-partitioned QoS functions. Let a configuration encoding p be a ∥p∥1-optimal partition of an arbitrary 2-partitioned QoS problem N, P1, P2, q( ). The QoS function q( ) is said to be nested convex in the vicinity of p if the following is true:
∀p′∈nπ(p)∃p″∈no(p)s.t.q(p″)≥q(p′) (21)
Finally, we are in a position to characterize the property of inclusiveness using the convexity properties of the QoS function:
Lemma 8 Nested convexity and inclusiveness. A 2-partitioned QoS function is inclusive if and only if it is nested convex in the vicinity of its λ-optimal partitions, for all 1≤λ≤|F|.
Proof. Let N, P1, P2, q( ) be a 2-partitioned QoS problem and assume that q( ) is nested convex in the vicinity of its λ-optimal partitions but not inclusive. Then, from Property 2, there must be a λ-optimal partition which does not include the (λ−1)-optimal partition, for some λ between 2 and |F|. For such λ, we have that pλ−1≠pλ−1∧pλ and, from equation (20), the following holds:
pλ∉no(pλ−1) (22)
Since pλ is a λ-optimal partition, we also have that:
q(pλ)>q(p),∀p∈nλ(pλ−1) (23)
But equations (22) and (23) contradict the nested convex definition in equation (21). This proves the “if” part of the lemma. Assume now that q( ) is inclusive but not nested convex in the vicinity of some λ-optimal partition pλ and let pλ+1 be a (λ+1)-optimal partition. By definition, we have that pλ+1 is in the partition neighborhood of pλ, that is, pλ+1∈nπ(pλ). From equation (21) and since q( ) is not nested convex in the vicinity of p, we also have that q(pλ+1)>q(p′) for all p′∈no(p). This implies that pλ+1 is not in the nested neighborhood of pλ, and hence that pλ≠pλ∧pλ+1, which contradicts the assumption that q( ) is inclusive from Property 2.
In Appendix B.1 we provide a topological interpretation of the property of nested convexity and the reason it leads to a polynomial time greedy algorithm to compute the λ-optimal partitions. We conclude this initial theoretical framework summarizing in
Identification of the Flow Ordering in TCP Networks
To network operators, understanding which flows are responsible for the largest degradation of performance is critical in order to quickly debug network problems. Consider for example the case of a misconfigured BGP route. Such situation causes certain flows to use routes that are less efficient, therefore negatively impacting the overall QoS of the network. According to the theory of flow ordering described herein, if we are able to correctly order the flows in a network based on their QoS impact, these large flows should reveal themselves at the top of the list. Another common example is the case of flows that violate their service level agreement (SLA), whether due to a network misconfiguration or to a violation of the contract between the user and the network provider.
Traditional methods attempt to identify such flows by looking at scalar metrics such as flow transmission rate, total number of bytes transmitted, or jitter. As shown in Lemma 6, however, such approaches do not take into consideration the topological structure of the network—e.g., the graph structure of the network, the routes followed by the flows, the capacity of the links, or the inner effects of the congestion control algorithm regulating the flows' transmission rates—and are unable to capture the QoS utility function inherent to a network. We now show how the theory of flow ordering can be used in practice to order TCP flows according to their impact to the overall QoS of the network and its topological structure.
The theoretical framework described herein contemplates an area of research around the problem of designing suitable partitioned QoS functions that are practical and useful to network operators. Towards this objective, in what follows we describe and study a first class of QoS functions that are applicable to real networks.
Definition 15 Max-min/max-flow (m3f) QoS function. Let N=T, F be a network and let Fe be an arbitrary subset of flows, Fe⊆F. We define the max-min/max-flow (m3f) QoS function as follows:
where r* is the max-min rate allocation corresponding to network N and δ is a non-negative real number which we refer to as the traffic shaper hyperparameter.
The above QoS function corresponds to the sum of all the flow rates subject to the constraint that all flows are transmitting at the max-min rate while traffic shaping the flows in Fe by an amount δ. Intuitively, this QoS function models the following properties:
We reason that the m3f QoS function is suitable to model networks that are operated by TCP congestion control algorithms-such as Cubic or BBR, among many others-because such algorithms aim at maximizing network utilization while ensuring fairness among all flows. Some TCP congestion algorithms are explicitly designed to converge to max-min (e.g., XCP) while others converge to other fairness criteria such as proportional fairness. In addition, it is well-known that algorithms using additive increase/multiplicative decrease (AIMD)—a type of control loop used by many TCP variants-converge to use equal amounts of bandwidth from their bottleneck link similar to max-min. Equivalent mathematical models for other bandwidth allocation schemes using the framework described herein are contemplated.
A key characteristic of our model in conjunction with the theory of flow ordering is that it allows us to reason about the optimal value of Fe (the set of flows that need to be traffic shaped) that maximize the overall QoS (the welfare) of the network. In this model the hyperparameter δ can also be interpreted as the delta step in a gradient descent algorithm. In particular, in the case of the m3f QoS function, the objective is to identify the subset of flows Fe⊂F such that when they are traffic shaped by a small step size δ, the new resulting max-min rate allocation increases maximally towards the max-flow rate allocation.
It should be understood that m3f is only one example of a QoS function that may be used for ordering flows according to flow sizes or flow metrics so as to improve the overall QoS of a network. In general, the size or metric of a flow represents the network resources required for processing the flow, where the larger the size/metric of a flow the greater the network resources required for processing the flow. Network resources include one or more of link bandwidth, processing capacity, processing time, memory, etc. When the designation of a flow is changed from a relatively less resource consuming flow to a relatively more resource consuming flow, a suitable QoS function may apply or increase a penalty associated with the metric of that flow. The QoS function may also apply or increase a reward associated with the respective metrics of one or more other flows. Thus, the QoS function accounts for the overall QoS of the network, which may also be referred to as network performance of global welfare of the network.
The penalty/reward, which are described above in connection with Equations (5) and (6), represent a reduction/increase in transmission rates associated with the flow. A reduction in the rate of a flow can decrease the instantaneous resource requirement of that flow, making a greater portion of the available network resource(s) for one or more other flows, which can increase the overall QoS of the network. Examples of QoS function include set functions. The m3f function can also be weighted. In particular, max-min can be weighted, max-flow can be weighted, or they both can be weighted. Some QoS functions are constrained such that one or more flows are guaranteed at least respective specified minimum transmission rates or bandwidths. Some QoS functions may be constrained such that one or more flows are limited to at most respective specified maximum transmission rates or bandwidths.
This formulation provides us with a natural way to compute derivatives or gradients in the form of infinitesimal adjustments to flow rates that help improve the QoS of a network. We formalize this concept in the next definition:
Definition 16 Flow gradients. Let N=T,F be a network associated with a QoS function q( ) (Definition 6). We define the gradient of a flow f∈F, ∇f
More in general, we define the gradient produced by the set of flows Fe⊂F as:
Note that this notation should not be directly compared with the traditional definition of gradient used in the field of calculus. The main difference is that in our problem we deal with discrete set functions, instead of continuous and differentiable functions. Unlike the general definition of gradient which results in a vector-valued real function, the flow gradient is a scalar-valued set function. Thus one can reason the name of derivative would be more suitable than gradient. We use instead the more general terminology of gradient to better reflect the concept of incremental path, since our objective is to identify optimization paths within the problem's feasible set that lead to better QoS. Note that since derivatives are a special case of gradients for unidimensional functions, the gradient terminology is still mathematically accurate.
The following example illustrates how the proposed framework can be used to order the flows for the case of the max-min/max-flow QoS function:
Example 5 Ordering flows in a network. Let N be the network example described in
In
Each gradient ∇F
where ri(X) corresponds to the max-min rate of flow ri after all the flows in X are traffic shaped by δ.
Computing such expression provides two challenges. First, it requires the calculation of a new max-min solution for the whole network after the traffic shapers are added. Since real networks can run millions of flows (or many more), this could be a consuming computation, specially when such gradients need to be computed in real time as network conditions vary. Secondly, the computation of the limit in expression (27) requires choosing a small enough δ to ensure the value of the gradient is accurate, but it is not clear how this value can be computed efficiently without requiring multiple iterations, with each iteration involving another calculation of the max-min rate allocation. The algorithm we propose next overcomes both of these challenges by using the knowledge gained from the topological structure of the network. Before describing the algorithm, we need to introduce first the concept of flow gradient graph:
Definition 17 Flow gradient graph. (In what follows we will take the classic definitions of saturated link and bottleneck link in the max-min sense.)
Let b(f) and
We define the flow gradient graph of a network N=S, L, F as a directed graph Γ(N)=V, E constructed as follows:
b(f) and
The key characteristic of the flow gradient graph is that it encodes the necessary information to efficiently compute the flow gradients based on the bottleneck structure of the network. In particular, the gradient of any flow f∈F can be computed by applying a rate shift of δ to the flow vertex associated with f and measuring the resulting rate drift propagation on all other flows by traveling through the directed graph. For this process to be successful, there are three invariants that must be preserved when computing the resulting drift for the affected flows:
The first invariant is needed to ensure that the rate drifts maintain all links saturated—otherwise links could become either unsaturated or the total link utilization could become larger than its capacity—while the second invariant is needed to ensure max-min fairness is preserved. Invariants 1 and 2 work together to compute the resulting rate drifts in the link vertices and are graphically represented in
For the sake of brevity, we only include the formal description of the flow gradient algorithm as supplemental material in Appendix B.2 and we focus the next discussion around an example to describe how the algorithm works.
This ensures that the sum of all rate drifts arriving and departing from l1 is equal to zero (Invariant 1) and that all the departing rate drifts are equal to each other (Invariant 2). Since link vertex l2 has a total input drift of −δ+δ/2=−δ/2 arriving from flow vertices f3 and f6, applying again Invariants 1 and 2 the output drift corresponding to flows f2 and f4 must be equal to δ/4. Now at link vertex l3 there is a total input drift of −δ+δ/4=−3δ/4 arriving from flow vertices f4 and f6, which leads to an output drift onto flow vertex f5 of 3δ/4. However, since flow f5 has a second bottleneck with a drift of 0, applying invariant 3 we have that the total drift leaving flow vertex f5 is 0. The actual gradient of flow f6 can now be easily obtained according to equation (27) by dividing the sum of all drifts by δ, (−δ+δ/2+δ/2+δ/4+δ/4+0)/δ, which results in the expected value of ½, in agreement with the value shown in
As illustrated in the above example, the flow gradient graph provides two beneficial properties. First, the algorithm only requires recomputing the parts of a network that are strictly affected by the rate drift. In particular, the flows that need to be recomputed to derive the gradient of flow f correspond to the subset of flows that are descendants to flow f's parent link vertices. For instance, the gradient of flow f5 is trivially equal to −δ and it can be computed directly without recomputing the rate of any other flow since flow f5's parent vertices l3 and l4 do not have any children vertices in the flow gradient graph other than f5 itself. Thus the proposed algorithm can be seen as a streaming algorithm in that it avoids recomputing the max-min rates for all flows by only considering the necessary rate drift changes from a given initial max-min solution.
This property is important to ensure the scalability of the real-time computation for networks involving a very large number of flows. Secondly, the algorithm is independent of the value of S. This property is a result of the fact that the max-min optimization problem can be characterized by a piecewise linear function, since it can be expressed as a recurrence of linear programming problems. The flow gradient graph is able to capture this property in that the sum of all the rate drifts is proportional to δ, and this factor gets canceled out by the denominator in equation (27). Thus in practice we can fully eliminate the variable δ from the flow gradient graph and work only with the actual rate drift weights, as is shown in the rest of the graphs in
Notice that the flow gradient graph is a valid framework to compute gradients provided that the value of δ is small enough so that the overall bottleneck structure of the network is preserved. Choosing a δ too large could change the structure of the flow gradient graph itself resulting in an erroneous computation. Fortunately, the case of small δ is precisely the case we are interested in, since our objective is to obtain the flow ordering by computing changes in QoS when flows are subject to infinitesimal traffic shapers according to equation (27).
Measurements and Evaluation
We describe below two experiments that we conducted to help empirically validate some of the principles described in our theoretical framework. In the first experiment, we aimed at quantifying and providing some empirical evidence on the property of nested convexity for small networks. In the second experiment, we demonstrate how the theory of flow ordering can be used to identify large flows in Google's B4 network.
Measurements of Nested Convexity in Small Networks
In this experiment we developed a network generator and measured the property of nested convexity for the max-min/max-flow QoS function. To bound the number of possible networks in the search space, we limit this experiment to networks where all links have the same capacity. While real networks in general have links with different capacities, we reason the case of equal link capacity is relevant in controlled environments such as enterprise networks and serves also as a first order approximation for networks that don't satisfy this requirement. Our network generator makes use of the combinatorial properties of the problem to ensure the networks generated are unique, eliminating duplicated networks that are symmetric with each other.
We present measurements for networks up to 7 links and 7 flows. Measuring networks with higher number of links and flows rapidly becomes intractable, thus for this experiment we only focus on these smaller networks. While obviously such small networks may not be realistic in the real world, the next results may still be interesting in small networking environments or in scenarios where flows are aggregated into a reduced number of “superflows”. For instance, in MPLS networks, many IP connections can be aggregated into a reduced number of MPLS paths. Such set of MPLS paths can be modeled as flows, in which case our objective becomes the ordering of these paths according to their overall impact on the QoS of the network.
Table 5 presents the results. All measurements are done by inspecting all possible networks with the corresponding number of links and flows except for the last three cases at the lower right corner of the table which are marked with “*”. These three cases generated an intractable amount of networks and instead we randomly explored 100 million networks for each of them and projected the results as estimations. The four numbers measured in each cell are (in this order): the total number of networks, the number of non-nested convex networks, the number of singular networks and the number of singular networks that are non-nested convex.
We define a singular network as a network that has at least one flow with more than one bottleneck link. We use the term singular because in the continuum space of rate allocations, the case where a flow happens to have the exact same bottleneck rate in more than one link can be seen as a singularity. Notice that as the number of flows increases, the chances of having singular networks decreases because it becomes more difficult for two links to yield the exact same bottleneck rate to a given flow. In real world scenarios, singular networks are very unlikely or may only manifest in infrequent instantaneous situations.
A surprising result derived from Table 5 is that for all networks explored, non-singularity is a sufficient condition for nested convexity. This can be seen by the fact that the second and fourth numbers (the number of non-nested convex networks and the number of singular networks that are non-nested) in each cell are identical. We formalize this in a lemma:
Lemma 9 Nested convexity for small networks. Let N be a non-singular network with constant link capacities and assume its QoS can be characterized with the m3f function described in equation (24). If N has up to 6 links and up to 6 flows, then N is deterministically nested convex. If N has up to 7 links and up to 7 flows, then N is nested convex with very high probability.
The contribution of the above lemma is that it enables a very simple mechanism to verify if a network is nested convex for small networks. This is because the property of singularity can be identified by simple inspection of the max-min rate in each bottleneck link of the network, a procedure which is well known to be solvable in polynomial time. At the time of this writing we are still running our network generator to evaluate if all the networks with 7 flows and 7 links are deterministically nested convex. While it will not necessarily add more insights to our argument, this computation will complete in about two weeks, in which case we will be able to assert whether the above lemma also holds deterministically for all networks up to 7 links and 7 flows. We note also that the above lemma does not hold for larger networks. In Appendix B.3 we present the smallest network we found that does not satisfy the above lemma.
Table 5 shows nested convexity measurements for small networks with constant link capacity. Each cell shows (in this order) the total of number of (1) networks, (2) non-nested convex networks, (3) singular networks and (4) singular networks that are non-nested convex. (Results marked with “*” were obtained by randomly exploring 100 million networks).
Identifying Top Flows in Google's B4 Network
We describe several experiments to help illustrate how a network operator can use the proposed framework to identify large flows according to their impact to the QoS of the network and based on the topological structure of the network. To help bring some realism to the experiment, we choose the topology from a real world network, in this case Google's B4 network. This toplogy is described in Appendix B.4. In our tests, we assumed that all links have a constant capacity. Note that under this assumption, the actual value of the link capacity is irrelevant to the analysis.
Our first experiment involved measuring the flow ordering assuming there exists a flow between each pair of nodes in the network. Because there are 12 nodes, this results in a total of 144 flows. These flows can be interpreted as traffic engineered paths that are used to move traffic from one site to another, and our objective was to determine how efficient each path is according to the topological structure of the network. We started by assuming each flow uses a shortest path between its endpoints and later introduce less efficient flows and verify if the algorithm is able to capture their larger QoS impact. For this experiment we assumed the max-min/max-flow QoS function.
Using the flow gradient graph algorithm by performing an exhaustive computation of all gradients for all possible subsets of flows up to cardinality 4, we obtain the λ-optimal partitions {f12}, {f12,f131}, {f12,f131,f132}, {f12,f131,f132,f142} for λ=1,2,3,4, respectively. Thus the largest flow is f12, followed by f131, f132 and f142. For 5 and more flows the search space becomes intractable, since from equation (3), a total of 2144 executions of the flow gradient graph algorithm would need to be computed for the full exploration. It is in this regard that our theoretical framework can be used to identify the flow ordering in a tractable manner. Assuming the property of nested convexity-equivalently, inclusiveness, as shown in Lemma 8-we can compute the full ordering of all the 144 flows in polynomial time (a total of 144*143/2 executions of the flow gradient graph algorithm) by using the greedy algorithm introduced in Definition 11. By using this approach, we obtained the flow ordering:
One observation from the experiment is that many of the positive flow gradients are very similar to each other and have a small value. (See Appendix B.4 for full detail on all the gradients.) For instance, the largest flow f12 has a gradient of 0.66, which from equation (7) implies that the gradient of any other flow fi≠f12 must be no larger than 0.66. This reflects the notion that all flows are relatively efficient and that there is no flow that stands out as substantially larger than any another flow.
We reason that this is true because we chose each flow to follow a shortest path between its endpoints, which makes them efficient flows from a topological standpoint. Thus the m3f QoS function is able to capture this property. To further test this characteristic, we add 12 more flows to our network experiment, f*1, . . . , f*12, and we force each flow fi* to go through links l0, . . . , li−1, for 1≤i≤12. Our goal is to verify that the proposed theoretical framework and the m3f function are capable of identifying flow f*12 as a very inefficient and large flow in the network since it goes through all possible links. Indeed by computing the flow ordering for all the 166 flows (the previous 144 plus the new 12 flows), we now obtain the top 10 flow ordering f*12>f*10>f*9>f*11>f8*>f7>f5>f47>f130>f11, which correctly identifies f*12 as the top flow.
Interestingly, the m3f QoS function assigns a higher size to f*10 than to f*11, even though the latter one uses one more link. This is the type of subtle insights that the proposed topological approach to flow classification brings, since the ordering is not only based on the amount of links used by a flow, but on a comprehensive view of the complete structure of the network, including the inner interactions among all the flows with respect to each other and the links they traverse. Finally, we also observe that the proposed framework is able to quantify the magnitude by which a flow is larger than other flows. Indeed, the gradient of the top flow f*12 is equal to 3.87, ∇f*
Conclusions and Open Questions
Suppose that you are a network operator and suddenly detect that the performance of your network has been drastically reduced due to an unexpected set of misbehaving flows—e.g., maybe due to a mis-configured BGP route or maybe because such flows are not respecting their service level agreement. You are interested in identifying these flows, but your network is running millions of flows and you cannot afford debugging each of them individually. The theory of flow ordering techniques described herein provide a mathematical tool to tackle this problem in a tractable manner. By providing a real-time list of flows ordered according to their QoS impact, network operators can focus on debugging the top flows that are most likely to have caused the collapse of the network.
A.1 Mathematical Proofs
Property 4 Monotonic fair shares. Let slk be the fair share of link l at iteration k of the BPG algorithm. Then slk≤slk+1.
Proof. This result is described in connection with in Property 3.6 discussed in: Jordi Ros-Giralt and Wei Kang Tsai, “A Theory of Convergence Order of Maxmin Rate Allocation and an Optimal Protocol,” Proceedings IEEE INFOCOM 2001, volume 2, pp. 717-726 (2001), which is incorporated herein by reference in its entirety.
A.1.1 Property 1
Monotonic fair shares along a precedence path. Let sl
Proof. By induction, it is sufficient to prove that the lemma holds for n=2. Let k1 and k2 be the value of k at the iteration when links l1 and l2 are removed from the set of unresolved links (BPG algorithm/line 13). This implies sl
A.1.2 Lemma 1
Bottleneck convergence order. A link l is removed from the set of unresolved links L{circumflex over ( )}k at iteration k in the BPG algorithm, k=k\{l}, if and only if all of its direct and indirect precedent links have already been removed from this set at iteration k−1.
Proof. We start with the sufficient condition. Assume that at an arbitrary iteration k all direct and indirect precedents of a link l1 have converged. We will also assume that link l1 does not converge at iteration k and arrive at a contradiction. It must be that sl
To address the necessary condition, suppose that a link l1 converges at iteration k and that another link l2 that has not converged yet at the same iteration is either a direct or indirect precedent of link l1. By construction of the BPG algorithm, however, this is not possible since once link l1 converges at iteration k, it is removed from the set of unresolved links k, thus no further direct or indirect links can be added to it.
A.1.3 Lemma 2
Bottleneck influence. A bottleneck l can influence the performance of another bottleneck l′, i.e., ∂sl′/∂cl≠0, if and only if there exists a set of bottlenecks {l1, l2, . . . , ln} such that li is a direct precedent of li+1, for 1≤i≤n−1, l1=l and ln=l′.
Proof. We start with the sufficient condition. By induction, it is enough to show the case n=2. Assume that link l1 is a direct precedent of link l2, then there must exist a flow f bottlenecked at link l1 that shares a flow with both l1 and l2, f∈∩. Note that any infinitesimal change on the capacity of link l1 changes the rate of flow rf, since this flow is bottlenecked at the same link (BPG Algorithm/lines 9 and 12). Note also that such variation propagates to link l2 inducing a change in its fair share (line 9), which implies as ∂sl
To address the necessary condition, note first that the performance of a link is uniquely determined by the fair share equation (line 9):
slk(cl−rf)/|\k|, ∀l∈k (28)
This equation depends on internal link parameters (such as its capacity cl and the number of flows going through it) as well as external parameter such as the rates of the flows bottlenecked at some other link {rf|∀f∈k∩l}. Thus sl can only be externally influenced by changing these rates. It is easy to see that these rates correspond to flows that are bottlenecked at links that are direct precedents of link l. Thus the fair share of a link can only change if the rate of one or more of its flows bottlenecked at one of its direct precedent links changes. Since the rates of these flows are equal to the fair share of these direct precedent links, this implies that in order to induce a change in the fair share of sl, it is necessary to change the capacity of one or more of its direct precedent links.
A.1.4 Lemma 3
Depth of the bottleneck structure. The diameter of the BPG graph—which we will also refer as the depth or the number of levels of the bottleneck structure—is equal to the last value of the iterator k in the BPG algorithm.
Proof. Since the diameter of the BPG graph corresponds to the maximum length of any of its paths and since at every iteration the algorithm adds one more vertex to any longest path in the graph, the value of k at the end of the algorithm must be equal to the size of any longest path.
A.1.5 Lemma 4
Minimum convergence time of a distributed congestion control algorithm. Let τ(li,lj) be a weight assigned to each edge (li, lj) of the BPG graph as follows: (1) If li is a direct precedent of lj, then τ(li, lj) is the time that it takes for a message to be sent from li to lj; (2) If li is an indirect precedent of lj, then τ(li, lj)=max{τ(li,lr)+τ(lr, lj)|foranyrelaylink|_rbetween|_iand|_j}. Let l1−l2− . . . −ln be a longest path terminating at link ln according to these weights. Then the minimum convergence time for link ln is Σ1≤i≤n−1τ(li, li+1).
Proof. This lemma is a consequence of Lemma 1. From Lemma 1, we know that a link l cannot resolve its fair share until all of its direct and indirect precedents have resolved their own fair share. Furthermore, we know that if all the direct and indirect precedents of a link l have been resolved at iteration k, then link l can converge immediately after at iteration k+1. To derive the current lemma, we only need to take into account the time that it takes to communicate a message from the direct and indirect links of link l to link l itself. Because direct precedent links share a flow, communication can propagate directly. For the case of indirect precedence, communication has to go through the relay link. In particular, we need to select the relay link that imposes the longest distance, as that's the longest time it can take to propagate the state between a link and its indirect precedent. The τ( ) function introduced in the lemma captures the value of these communication times for the two possible relations between bottleneck links.
A.1.6 Lemma 5
Flow influence. A flow f can influence the performance of another flow f′, i.e., ∂rf′/∂rf≠0, if and only if there exists a set of bottlenecks {l1, l2, . . . , ln} such that (1) li is a direct precedent of li+1, for 1≤i≤n−1, (2) flow f′ is bottlenecked at link ln and (3) flow f goes through l1.
Proof. Let f and f′ be two arbitrary flows and assume that flow f′ is bottlenecked at link l′. From lines 9 and 12 of the BPG algorithm, the rate of a flow corresponds to the fair share of its bottleneck. Thus flow f can only influence flow f′ if it can affect link l″s fair share, sl′. Assume that we apply an infinitesimal change to flow f's rate rf. This could be achieved by applying a traffic shaper that slightly decreases its rate (negative infinitesimal) or by imposing a minimum rate constraint that assigns a rate slightly higher than the fair share of its bottleneck link sl (positive infinitesimal). For any link l traversed by flow f, such small perturbation will lead to an infinitesimal change in its fair share sl. Now from Lemma 2 we know that a link l can only influence another link l′ if there exists a directed path from l to l′ in the BPG graph. Furthermore, Lemma 2 is a necessary and sufficient condition. Thus both the necessary and sufficient conditions of this lemma must hold too if we set l=l1 and l′=ln.
A.2 Results of the Bottleneck Structure with TCP Reno
We demonstrated above that BBR generally performs better in capturing a given network's bottleneck structure than AIMD-based protocols such as Cubic and Reno. While the results of BBR and Cubic are presented in
A.3 Results of the Low-Hitter Flow Experiment for TCP Cubic and Reno
The plots in
A.4 Jain's Fairness Index
Jain's index is a metric that rates the fairness of a set of values x1, x2, . . . , xn according to the following equation:
The index value ranges from 1/n (worst case) to 1 (best case).
For multi-link networks the value xi must be normalized to an optimal fairness allocation. We normalize xi as the ratio fi/Oi, where fi is the rate of flow fi achieved through the experiments and Oi is its expected max-min fair throughput.
A.5 Notes on Using the Framework in Production Networks
We now present practical considerations that would allow the integration of the proposed theory into modern data networks. The proposed theoretical framework requires three inputs:
Of the multiple avenues of potential operational use cases, we list the following:
B.1 Topological Interpretation of the Property of Nested Convexity
B.2 Flow Gradient Computation Algorithm
Algorithm 3 provides the pseudocode of the algorithm introduced above in connection with Definition 16.
b(f) := Flow f's bottleneck links;
(f) := Flow f's non-bottleneck links;
b(l) := Link l's bottleneck flows;
(l) := Link l's non-bottleneck flows;
:= List of flows to be processed;
= ∪ (f*);
B.3 Smallest Non-Nested Convex Networks
B.4 Google's B4 Network Topology and Flow Gradients
It is clear that there are many ways to configure the device and/or system components, interfaces, communication links, and methods described herein. The disclosed methods, devices, and systems can be deployed on convenient processor platforms, including network servers, personal and portable computers, and/or other processing platforms. Other platforms can be contemplated as processing capabilities improve, including personal digital assistants, computerized watches, cellular phones and/or other portable devices. The disclosed methods and systems can be integrated with known network management systems and methods. The disclosed methods and systems can operate as an SNMP agent, and can be configured with the IP address of a remote machine running a conformant management platform. Therefore, the scope of the disclosed methods and systems are not limited by the examples given herein, but can include the full scope of the claims and their legal equivalents.
The methods, devices, and systems described herein are not limited to a particular hardware or software configuration, and may find applicability in many computing or processing environments. The methods, devices, and systems can be implemented in hardware or software, or a combination of hardware and software. The methods, devices, and systems can be implemented in one or more computer programs, where a computer program can be understood to include one or more processor executable instructions. The computer program(s) can execute on one or more programmable processing elements or machines, and can be stored on one or more storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), one or more input devices, and/or one or more output devices. The processing elements/machines thus can access one or more input devices to obtain input data, and can access one or more output devices to communicate output data. The input and/or output devices can include one or more of the following: Random Access Memory (RAM), Redundant Array of Independent Disks (RAID), floppy drive, CD, DVD, magnetic disk, internal hard drive, external hard drive, memory stick, or other storage device capable of being accessed by a processing element as provided herein, where such aforementioned examples are not exhaustive, and are for illustration and not limitation.
The computer program(s) can be implemented using one or more high level procedural or object-oriented programming languages to communicate with a computer system; however, the program(s) can be implemented in assembly or machine language, if desired. The language can be compiled or interpreted. Sets and subsets, in general, include one or more members.
As provided herein, the processor(s) and/or processing elements can thus be embedded in one or more devices that can be operated independently or together in a networked environment, where the network can include, for example, a Local Area Network (LAN), wide area network (WAN), and/or can include an intranet and/or the Internet and/or another network. The network(s) can be wired or wireless or a combination thereof and can use one or more communication protocols to facilitate communication between the different processors/processing elements. The processors can be configured for distributed processing and can utilize, in some embodiments, a client-server model as needed. Accordingly, the methods, devices, and systems can utilize multiple processors and/or processor devices, and the processor/processing element instructions can be divided amongst such single or multiple processor/devices/processing elements.
The device(s) or computer systems that integrate with the processor(s)/processing element(s) can include, for example, a personal computer(s), workstation (e.g., Dell, HP), personal digital assistant (PDA), handheld device such as cellular telephone, laptop, handheld, or another device capable of being integrated with a processor(s) that can operate as provided herein. Accordingly, the devices provided herein are not exhaustive and are provided for illustration and not limitation.
References to “a processor”, or “a processing element,” “the processor,” and “the processing element” can be understood to include one or more microprocessors that can communicate in a stand-alone and/or a distributed environment(s), and can thus can be configured to communicate via wired or wireless communication with other processors, where such one or more processor can be configured to operate on one or more processor/processing elements-controlled devices that can be similar or different devices. Use of such “microprocessor,” “processor,” or “processing element” terminology can thus also be understood to include a central processing unit, an arithmetic logic unit, an application-specific integrated circuit (IC), and/or a task engine, with such examples provided for illustration and not limitation.
Furthermore, references to memory, unless otherwise specified, can include one or more processor-readable and accessible memory elements and/or components that can be internal to the processor-controlled device, external to the processor-controlled device, and/or can be accessed via a wired or wireless network using a variety of communication protocols, and unless otherwise specified, can be arranged to include a combination of external and internal memory devices, where such memory can be contiguous and/or partitioned based on the application. For example, the memory can be a flash drive, a computer disc, CD/DVD, distributed memory, etc. References to structures include links, queues, graphs, trees, and such structures are provided for illustration and not limitation. References herein to instructions or executable instructions, in accordance with the above, can be understood to include programmable hardware.
Although the methods and systems have been described relative to specific embodiments thereof, they are not so limited. As such, many modifications and variations may become apparent in light of the above teachings. Many additional changes in the details, materials, and arrangement of parts, herein described and illustrated, can be made by those skilled in the art. Accordingly, it will be understood that the methods, devices, and systems provided herein are not to be limited to the embodiments disclosed herein, can include practices otherwise than specifically described, and are to be interpreted as broadly as allowed under the law.
This application claims priority to and benefit of U.S. Provisional Patent Application No. 62/735,366, entitled “Systems and Methods for Flow Ordering for Data Networks,” filed on Sep. 24, 2018 and to U.S. Provisional Patent Application No. 62/740,757, entitled “Systems and Methods for QoS Based Analysis of Flows in Data Networks,” filed on Oct. 3, 2018, the entire contents of each of which are incorporated herein by reference.
This invention was made with government support under contract No. DE-SC0011358 awarded by the U.S. Department of Energy (DoE). The government has certain rights in the invention.
Number | Name | Date | Kind |
---|---|---|---|
20160057061 | Avci | Feb 2016 | A1 |
Number | Date | Country | |
---|---|---|---|
62740757 | Oct 2018 | US | |
62735366 | Sep 2018 | US |