This invention relates to methods for network congestion control. Congestion control is a critical traffic management task responsible for minimizing network congestion. Network congestion typically occurs when packets arrive to a routing node at a higher rate than they can be forwarded. Congestion causes increase in packet delivery times and packets drops, with the latter due to queue overflows and packet timeouts. In turn, this causes network resources to be wasted due to packet retransmissions and in-transit packet storage. In the context of Internet Protocol (IP) networks, the Transmission Control Protocol (TCP) and its variants provide the end-to-end congestion-control functionality. The success of TCP in the Internet is to a large extent due to the low and stable delays and the low bit-error rates characteristic of typical Internet connections, and the low cost of packet retransmissions in terms of network resources. In the context of intermittently-connected and/or lossy networks (ICLNs), widely-used congestion-control solutions have been more elusive.
Described herein is a method for controlling congestion in ICLNs comprising the following steps. The first step provides for determining, at a local network node, a payoff score for each of a plurality of active flows of network traffic. Each active flow consists of a stream of in-transit packets at the local network node that come from a common source and share a common destination. Each active flow's payoff score is based on a pricing model that considers both a sojourn time and a position in a queue of each of an active flow's constituent packets. Another step provides for allocating unused buffer space across all active flows in the local network node based on relative traffic loads with a buffer-space allocation (BSA) agent. Another step provides for controlling a rate at which packets from all active flows are received at the local network node with a hop-by-hop local-flow-control (LFC) agent according to each flow's payoff score.
Throughout the several views, like elements are referenced using like references. The elements in the figures are not drawn to scale and some dimensions are exaggerated for clarity.
The disclosed method below may be described generally, as well as in terms of specific examples and/or specific embodiments. For instances where references are made to detailed examples and/or embodiments, it should be appreciated that any of the underlying principles described are not to be limited to a single embodiment, but may be expanded for use with any of the other methods and systems described herein as will be understood by one of ordinary skill in the art unless otherwise stated specifically.
ICLNs are characterized by high transmission delays and jitter, low quality links with high bit-error rates, and lack of a persistent end-to-end path between the traffic source and destination. Thus, they violate fundamental TCP assumptions, which limits, and in some cases prevents altogether, the use of TCP in ICLNs. Furthermore, for a large set of applications ICLN traffic is expected to be bursty and dominated by short request/response messaging for command-and-control (C2), and information collection and dissemination tasks. Even versions of TCP designed to deal with long delays and high bandwidth-delay-product networks can be challenged in ICLNs due to their need for roundtrip-time (RTT) and retransmission timeout (RTO) parameter estimates. Furthermore, the delay in the issuance of the congestion control actions at the traffic sources can make TCP-based solutions ineffective, especially when bursty traffic is the dominant traffic type flowing thru the network.
Method 10 represents an alternative paradigm to end-to-end congestion control that enables intermediate network nodes to take congestion control actions independently and without notifying the traffic source. In this context nodes are able to decide whether and at what speed they want to receive packets from any given flow. Method 10 uses LFC and BSA management to develop a congestion control framework for ICLNs. An LFC agent may be used to decide whether or not to perform one of the following actions for a given active flow to mitigate network congestion according to an LFC policy: reduce a flow speed, pause the given flow, or restart the given flow if paused. The LFC policy may be learned via a Proximal Point Optimization (PPO) deep reinforcement learning algorithm using a Markov Decision Process (MDP) as a modeling abstraction of the queue dynamics. The overarching buffer-space allocation given to each flow may be defined via a portfolio selection strategy that encourages large buffer allocations to flows that provide a higher payoff to the node. The payoff may be measured based on a pricing model for packets that embeds within it the notions of sojourn time and packet position in the queue, in lieu of the queue occupancy level, as local congestion indicators. Method 10, thus, mitigates the risk of bufferbloat.
Each traffic flow flowing through the node may be managed by an individual LFC agent that controls the rate at which packets from the flow are received. The LFC agent policy may be based on aggregate packet sojourn statistics, its buffer-space allocation and its buffer occupancy level. The LFC agent can stop a flow temporarily to mitigate congestion, while forcing nodes upstream to adjust the forwarding rates for the congested flow.
When a queue within the node becomes active, the node's BSA is responsible for allocating buffer space for that queue. The BSA also reevaluates the queues' buffer-space allocations periodically to adjust them based on the traffic characteristics of each flow. When a queue is full based on its BSA any new packet arriving to that queue is dropped due to the tail-dropped policy used. Packets are forwarded from active queues based on a schedule defined by the PS. Let F denote the number of active queues in node n. The PS defines a bandwidth-allocation profile μn:=[μn,1, . . . , μn,F]′∈, where (·)′ denotes the transpose operator, μn,f∈[0, 1], ∀f, the portion of the available transmission bandwidth B allocated to the fth queue, and 1′μn=1, where 1 denotes a vector of ones of appropriate size. It also defines the service schedule for each queue, and thereby the order in which packets are transmitted.
The PS policy can be designed based on the QoS requirements and priority associated with each flow. Furthermore, it can be dynamically adjusted to accommodate the dynamics of the traffic and the available transmission bandwidth. The LFC module regulates the flow rate for each active flow. In an embodiment of method 10, the LFC can request the PS from any neighboring node, say node n′∈, to reduce the bandwidth allocation, pause or restart any flow. The PS of the neighboring node can be configured to always grant such requests. When a flow rate is reduced or the flow is paused in response to an LFC request, the PS does not reallocate the available bandwidth to any other flow. Thus, the effective bandwidth allocation profile at a node n′, in which there are F′ active flows, is μn′E:=[μn′,1E, . . . , μn′,F′E]′∈, with μn′,f′E∈[0,1] denoting the effective bandwidth allocation to flow f, and satisfies 1′μn′E≤1. When a flow is paused, an explicit congestion notification (ECN) may also be sent to the source of the flow. Similarly to their role in active queue management (AQM), ECNs direct the source to reduce its traffic generation rate and accelerate the issuance of congestion-control actions (cf., implicit congestion control methods such as TCP).
Embodiments of method 10 include designing BSA and LFC policies to mitigate packet congestion locally. The BSA policy uses summary statistics captured per queue to define a BSA profile that maximizes the packet throughput achieved by the node. Summary statistics subsume the average benefit (for a collection of packets) and risk associated with a given BSA profile. The LFC policy uses internal queue statistics and information from the BSA and PS modules to control the effective packet acceptance rate for any flow. The resulting LFC and BSA policies will be compatible with any PS policy if the PS can control individual flow transmission rates as requested by the LFC module of a neighboring node.
The BSA policy may be developed based on a portfolio management strategy. In that context, the node has a portfolio of F holdings where each stock in the portfolio corresponds to a traffic flow. The f-th buffer occupancy level qf∈ (measured in bytes), with qf≤q̆f and q̆f∈[Q0, . . . , Q−(F−1)Q0] denoting the queue space allocation for the f-th queue, corresponds to the amount of stock f owned by the node. The unused queue space across all queues Ω≥0, which represents the node's available investment capital, is defined as
where each summand in the first sum corresponds to the available queue space per queue based on the current BSA. The node receives a conveyance fee per packet it forwards. The price of a packet with index i∈ in the f-th queue is defined as a function of its size bi∈ (in bytes), its sojourn time δi, i.e., the time it spent in its queue waiting to be forwarded, and its position in the queue ζi as
where ηf>0 and γf∈(0, 1] are tuning parameters that control the rate at which the price of a packet decreases as a function of δi and the base for the packet price adjustment according to its position in the queue, g: → is a monotonically increasing function of packet size, and αf>0 is a scalar used to adjust the price of packet based on the priority of the flow they belong to. According to Equation (2), the price of a packet decreases as δi increases and is discounted based on the packet position in the queue with an exponential discount factor γf. The discounting term approaches unity as the packet moves up in the queue. When a packet is dequeued its corresponding ζi=0 and, thus, the discounting factor equals unity. The per-packet conveyance fee received by the node for packet i in queue f equals its price when dequeued, i.e., pf(bi, δi, 0).
Based on Equation (2) and the number of active flows, the BSA policy seeks a space allocation profile for the node's available queue space that maximizes its expected profit based on the conveyance fees it expects to collect when forwarding packets. To that end, the notion of rate of return which defines the fractional change in price for a packet since its arrival to the queue is introduced. With Qf(t) defining the set of packets in the f-th queue at time t, the rate of return of stock f, i.e., that of the f-th traffic flow, at time t is defined as
where ti0<t denotes the arrival time of packet i to the queue, δi(t) its sojourn time at time t, ζi(t) its position in the queue at time t, Λf,0:=Σi∈Q
weighs the impact of the sojourn time on the return rate. Low ρf(t) values indicate high sojourn times and little advancement of packets towards the front of the queue since their arrival to the queue and are, thus, reasonable choices as congestion indicators. The rate of return ρf(t) is set to a positive constant when the queue is empty.
Next, we consider the queue-space allocation problem from the vantage point of portfolio selection. The BSA policy seeks to redistribute the unused queue space Ω to traffic flows yielding a high return rate, i.e., flows with weaker congestion indicators. A flow f that yields high or low aggregate rate of returns should respectively receive a large or small q̆f. Although counterintuitive, limiting the BSA for congested flows serves as an indicator of congestion to the LFC policy and prevents exacerbating congestion by limiting the number of packets from that flow that can be received and stored.
Continuing with the stock portfolio analogy, the rate of return of each stock may be evaluated periodically over a window of M samples comprising the columns of R(t):=[ρ(t-1), . . . , ρ(t-M)], where ρ(t-m):=[ρ1(t-m), . . . ρf(t-m)]′∈, m=1, . . . , M. Let
where θ>0 is a tuning parameter that represents the risk aversion level chosen for the selection of the portfolio, and ≥ is an entry-wise inequality. Vector w defines the BSA across the portfolio. The variance term in Equation (4) captures the variability the rates of return and, thus, the perceived risk of allocating buffer space to a particular stock. The constrained optimization problem in Problem (4) is convex and, thus, it has a unique maximum. It can be solved efficiently via one of several convex optimization algorithms known in the art. Once ŵ:=[ŵ1, . . . , ŵF]′ is available, the new BSA profile is defined via Ωŵ as:
Per active flow, the LFC policy, enforced by an LFC agent, decides whether to reduce the flow speed, pause it or restart it, if paused. This decision-making process followed by the LFC agent is modeled as an MDP using a modeling abstraction of the queue dynamics. The LFC agent may be trained via a deep RL framework using the PPO algorithm. The MDP model and PPO algorithm are discussed in the ensuing paragraphs.
The MDP describes the evolution of the environment as the LFC agent interacts with it. With τ defining the decision epoch and () the state (action) space, the LFC agent observes the state sf(τ)∈ and decides the action af(τ)∈ to execute. The environment evolves to a new state sf(τ+1)∈ and the LFC agent receives a reward rf(τ+1)∈. The dynamics of the environment are characterized by the set of transition probabilities of the form:
with s′f, sf∈, r′f∈, and af∈, which define the probability of the system to transition to s′f and receive a reward r′ given that the system is currently in sf and the LFC agent takes action af. In our case, the probabilistic transitions are due to the stochastic nature of the traffic and the available transmission bandwidth, and their impact on the BSA policy which translates to unpredictable variability in the BSA. Note that the time in-between decision epochs is defined by the transmission schedule associated with each traffic flow. The elements of the MDP describing the decision-making process followed by the LFC agent are described next.
With respect to state space, let denote the sequence of sojourn times for all packets dequeued from queue f during the last Z decision epochs. The state of the f-th queue observed by its supervising LFC agent at the beginning of decision epoch τ is:
Where q̆f(τ) is the latest queue-space allocation issued by the BSA, μf(τ)B(τ) is the bandwidth allocated by the PS, lf(τ)∈[0,1] is the packet loss rate experienced by the queue due to its tail-drop policy between (τ−1, τ], qf(τ) is the queue occupancy, ϕf(τ)∈{normal, paused, slowed-down} is a categorical variable defining the state of the flow, and ωf(τ)∈ is a vector containing summary sojourn time statistics for . In particular, ωf(τ):=[df,min(τ),df,max(τ),df,mean(τ),df,medium(τ),df,std(τ)]′ where df,min(τ), df,max(τ), df,mean(τ), df,medium(τ), and df,std(τ) are respectively the minimum, maximum, mean, median, and standard deviation values of the element of .
With respect to action space, upon observing the state of the queue at t, the LFC agent selects a discrete action af(τ)∈(sf(τ))⊂{no-change, pause, slow-down, restart}. The set of actions available to the LFC agent is a function of s(τ) via ϕf(τ), and is defined as (sf(τ)):={no-change, slow-down, pause, restart}\(ϕf(τ)), where \ denotes the set difference operation, and
Note that |(sf(τ))|=3 for all sf(τ)∈ and f, τ. When the LFC agent of a given node chooses the pause, slow-down or restart actions, the LFC agent/module generates a request message for the PS in the neighboring node from where the flow is coming from to execute the action.
With respect to transition function, the packet queueing and dequeuing processes characterize the evolution of the MDP. These processes are externally driven by the traffic flow dynamics and a medium access control (MAC) protocol. The queue occupancy level and the flow state evolve according to:
where Ψf(τ) is a random variable representing the packet arrivals (in bytes) to the queue in the interval (τ, τ+1]. Equation (9) assumes that only packets stored in the queue prior to τ can be transmitted in the interval (τ, τ+1]. The flow state evolves according to
The evolution of the remaining variables in s(τ) cannot be easily obtained in analytical form. Their evolution can, however, be traced numerically as the MDP evolves.
The reward received by the LFC agent when taking action af(τ) when in s(τ) and evolving to s(τ+1) is given by:
where ĥf(τ+1):=((df,mean(τ+1)−df,median(τ+1))/df,std(τ+1) denotes the nonparametric skew statistic and C>0 is a tuning parameter. The nonparametric skew ĥdf(τ+1)∈[−1,1] measures the skewness of the sojourn time probability density function (p.d.f.). If the flow is not in a paused state, then the agent is rewarded with a positive value if the sojourn times p.d.f. is left-skewed in which case the mass of sojourn times p.d.f. is concentrated to the left of its mean value. Otherwise, the LFC agent receives a zero or negative reward. The LFC agent is also rewarded proportionally to the packet acceptance rate (1−lf(τ+1))∈[0,1] experienced by the queue during the interval (τ, τ+1]. If the flow is paused, then the LFC agent receives zero reward.
Within an RL framework, the LFC agent seeks to learn a policy π: → that maps the observed state of the environment sf∈ into an action af∈. In our case, the policy π defines whether to reduce the flow speed, pause it or restart it, if paused. Typically, RL seeks the policy that maximizes the expected sum of rewards given by:
where γ∈(0,1) is a discount factor, Π defines the set of feasible policies π, and the expectation is taken over the state-action marginal distribution of the pair (sf, af) trajectory induced by a policy π.
In an embodiment of method 10, a PPO algorithm is used to learn the LFC policy πθ, where θ denotes the set of deterministic parameters defining the policy. Let β(τ)(θ):=πθ(af(τ)|sf(τ))/πθold (πθ
where denotes the empirical expected value, ϵ∈(0, 1), and the function clip (x, 1−ϵ, 1+ϵ) clips the value of the scalar x to be within [1−ϵ, 1+ϵ]. The objective function in Equation (13) removes the incentive for choosing a θ that will cause βτ(θ)Â to go outside the interval [1−ϵ, 1+ϵ]. After applying the minimum operator, Equation (13) becomes a lower bound for the classical policy gradient objective βτ(θ)Â(s(τ), af(τ)). PPO improves the stability in the learning of the policy by constraining the size of the change allowed in π in between stochastic gradient descent updates.
Algorithm 1, presented below, represents a summarized embodiment of method 10 as a local queue management framework for controlling congestion in an ICLN. Algorithm 1 relies on the concurrent execution of the BSA and LFC policies discussed above. Although operating concurrently, these policies are executed at configurable and possibly different time scales. The execution of the proposed local queue management policies in node n∈ is summarized as Algorithm 1, which presumes the existence of F active flows and their corresponding BSA profile, a baseline LFC policy πθ* learned via the PPO algorithm described above, where θ* denotes the set of parameters defining the LFC policy, and uses nf∈ and ñf∈ to denote the one-hop neighbor of n from where flow f is being forwarded to n and the source node for flow f, respectively. The procedure UPDATE QUEUE OCCUPANCY represents the evolution of the queue occupancy due to packet arrivals and departures as discussed above. Note that in this work πθ* is assumed to be common to all agents, and thus all traffic classes. Additional refinement in the definition of the LFC policies based on the traffic class is possible.
Add and remove flows
The LFC and BSA policies can be updated periodically based on a fixed schedule. Alternatively, the LFC policy can be updated after a round of interactions with the neighboring node where the corresponding flow originates or a local packet forwarding round occurs. In this case, the length of the time between LFC decision epochs varies based on the configuration of the MAC protocol used by the node. Recall that the LFC policy interacts with the PS's in neighboring nodes to change the state of a flow. Implementing this feedback mechanism may require the availability of an independent control channel, or if per-hop acknowledgement messages are used, the use of a dedicated set of 2-bits in the acknowledgement header to notify the neighboring PS of any transmission-rate change requests, and a similar mechanism to notify the LFC agent that the request was successful. In addition to the periodic BSA updates, the BSA policy also provides a new BSA profile when a new flow is created or when an inactive flow is removed.
The following description illustrates the performance of method 10 via numerical tests that focus on the management of F=4 active flows within a single node. In this test, each queue was modeled in Python as an instance of a common environment class that inherited its structure from the OpenAI gym application programming interface (API). The traffic arrivals for each queue were modeled as a Poisson Process with exponentially distributed packet arrivals configured via their mean arrival rate λ. Packet departures were modeled via a fair queuing policy that distributed the fixed transmission bandwidth B=2 kilobytes-per-second equally across all queues. Packet sizes were modeled as independent and identically distributed uniform random variable in the interval [100,150] bytes. The packet dequeuing process incurred a nominal delay of 10 milliseconds per packet, irrespective of the packet size. The evolution {af(τ)}f=14 was modeled as described above using a fixed step interval of 60 seconds that corresponded to the time in between decision epochs τ. Per f, the sojourn statistics in ωf(τ) were computed using 60-second sojourn-time metrics aggregated over a sliding window that comprised the last 100 measurements. The BSA policy was configured to be executed on a schedule. All scenarios considered for training and testing were limited to 6 hours and 400 decision epochs. The following subsections describe the training of the LFC agents and the BSA strategy applied to a single node.
With respect to the numerical tests mentioned above, the LFC agent policy was trained using the PPO implementation in RLlib and the hyper-parameter search workflow was implemented with Ray Tune. Our PPO algorithm configuration used the tanh activation function and computed the advantage function  using a value function baseline obtained via a critic deep neural network (DNN). During example training of the LFC policy for the numerical tests, q̆f and λ were modeled as uniform random variables in the intervals [75, 1 275] kilobytes (KB) and [0.5, 8] packets-per-second, respectively. These parameters were randomly drawn at the beginning of each episode and kept fixed throughout. One training approach sampled q̆f and λ over their entire distribution domain. Although the episode rewards seemed to converge, the auxiliary policy training curves showed a near-zero value function (VF) explained-variance and entropy after a few iterations for a wide-range of hyper-parameter choices. This behavior was believed to be caused by the LFC agent quickly learning a suboptimal, deterministic policy and it being unable to learn to predict the VF over the PPO horizon. Poor asymptotic performance of machine learning algorithms in environments with high variability or when solving tasks that involve subtasks of significantly different difficulty has been documented as a common challenge when training machine learning algorithms. The LFC agent training method may be refined with ideas from curriculum learning which suggests to gradually increase the complexity of the environment during the training process.
For the example embodiment of method 10 employed in the numerical tests referenced above, the variability in the environment was incrementally increased via parameter sequences ε1=(1200, 975, 675, 375, 75) and ε2=(0.5, 2.1, 3.7, 5.3, 6.9, 8.5). A total of 10 tasks indexed as k=0, . . . , 9 were defined using ε1 and ε2. For task k, parameters {circumflex over (ε)}1(k):=ε1(└k/2┘) and ε2(k):=ε2(└(k+1)/2┘), where └·┘ denotes the rounding down operation and ε1(m)(ε2(m)) the m-th entry of the sequence ε1(ε2), were used to define the sampling intervals. Thus, when using task k from the curriculum, q̆c and λ were sampled from intervals [{circumflex over (ε)}1(k), 1275] KB and [0.5, {circumflex over (ε)}2(k)] packets-per-second. Mean reward curves obtained when using our curriculum-based (CB) training approach were compared with the reward curves obtained when no CB (NCB) training approach was used. In this comparison, all curves corresponded to the case C=1.0. In the CB case, each task in the curriculum was used for 5·105 iterations before transitioning to the next task. The CB approach consistently outperformed the NCB approach in terms of mean reward per iteration. After 5·106 iterations, the mean reward achieved was 362.1 for CB and 334.1 for NCB.
The performance of two CB training approaches using different DNN configurations (i.e., DNN1 and DNN2) were also compared. The input and output layer for both configurations had 10 and 1 neurons, respectively. DNN1 featured two hidden layers with 16 neurons each, and DNN2 featured two hidden layers with 128 neurons each. In this case, tasks k=0, . . . , 8 were used to train the policy for 5·106 iterations each while task k=9 was used for 107 iterations. The CB approach with DNN1, achieved a mean reward of 358.8 at the end of training and outperformed the CB approach using DNN2, which achieved a mean reward of 347.8 at the end of training. Additional metrics such as the total loss, VF loss, the VF explained variance and entropy values may be used to select the final LFC policy. In the performance comparison mentioned above, the DNN1 configuration obtained a total loss value of 0.65, a VF loss of 0.84, a VF explained variance of 0.35, and an entropy of 0.68. The DNN2 configuration obtained a total loss value of 1.91, a VF loss of 2.41, a VF explained variance of −0.09, and an entropy of 0.20.
The LFC policy obtained when using DNN1 was further analyzed. The performance of this policy when managing the queue was compared with the performance of the policies obtained via the NCB-approach using DNN1, the CB approach using DNN2, a policy that chooses actions at random, and a deterministic policy based on a threshold, termed THP. The THP policy stops accepting packets when packets are first dropped. Then, it waits till the queue is emptied to resume accepting packets. Table I illustrates the performance of these policies for different q̆f and λ. The average reward
mean
mean
mean
mean
0.9
69.1
80.5
69.3
317.0
81.2
357.7
63.2
332.7
126.3
27.8
354.5
103.7
0.1
218.7
44.2
All results in Table I are averages obtained over 25 Monte Carlo runs for each policy. Interestingly, the CB-DNN1 approach was able to yield a balanced performance featuring low sojourn times and high packet throughput in most cases. The CB approach with DNN2 yielded better
4, median
From the above description of the method 10, it is manifest that various techniques may be used for implementing the concepts of method 10 without departing from the scope of the claims. The described embodiments are to be considered in all respects as illustrative and not restrictive. The method/apparatus disclosed herein may be practiced in the absence of any element that is not specifically claimed and/or disclosed herein. It should also be understood that method 10 is not limited to the particular embodiments described herein, but is capable of many embodiments without departing from the scope of the claims.
This application claims the benefit of U.S. provisional patent application 63/525,122, filed 5 Jul. 2023, titled “Method for Controlling Congestion in Intermittently-Connected and Lossy Computer Networks” (Navy Case #211162), which application is hereby incorporated by reference herein in its entirety.
The United States Government has ownership rights in this invention. Licensing and technical inquiries may be directed to the Office of Research and Technical Applications, Naval Information Warfare Center Pacific, Code 72120, San Diego, CA, 92152; voice (619) 553-5118; NIWC_Pacific_T2@us.navy.mil. Reference Navy Case Number 211162.
Number | Date | Country | |
---|---|---|---|
63525122 | Jul 2023 | US |